Home > Uncategorized > Distribution of add/multiply expression values

Distribution of add/multiply expression values

The implementation of scientific/engineering applications requires evaluating many expressions, and can involve billions of adds and multiplies (divides tend to be much less common).

Input values percolate through an application, as it is executed. What impact does the form of the expressions evaluated have on the shape of the output distribution; in particular: What is the distribution of the result of an expression involving many adds and multiplies?

The number of possible combinations of add/multiply grows rapidly with the number of operands (faster than the rate of growth of the Catalan numbers). For instance, given the four interchangeable variables: w, x, y, z, the following distinct expressions are possible: w+x+y+z, w*x+y+z, w*(x+y)+z, w*(x+y+z), w*x*y+z, w*x*(y+z), w*x*y*z, and w*x+y*z.

Obtaining an analytic solution seems very impractical. Perhaps it’s possible to obtain a good enough idea of the form of the distribution by sampling a subset of calculations, i.e., the output from plugging random values into a large enough sample of possible expressions.

What is needed is a method for generating random expressions containing a given number of operands, along with a random selection of add/multiply binary operators, then to evaluate these expressions with random values assigned to each operand.

The approach I used was to generate C code, and compile it, along with the support code needed for plugging in operand values (code+data).

A simple method of generating postfix expressions is to first generate operations for a stack machine, followed by mapping this to infix form to a postfix form.

The following awk generates code of the form (‘executing’ binary operators, o, right to left): o v o v v. The number of variables in an expression, and number of lines to generate, are given on the command line:

# Build right to left
 
function addvar()
{
num_vars++
cur_vars++
expr="v " expr # choose variable later
}
 
 
function addoperator()
{
cur_vars--
expr="o " expr # choose operator later
}
 
 
function genexpr(total_vars)
{
cur_vars=2
num_vars=2
expr="v v" # Need two variables on stack
while (num_vars <= total_vars)
   {
   if (cur_vars == 1) # Must push a variable
      addvar()
   else
      if (rand() > 0.2) # Seems to produce reasonable expressions
         addvar()
      else
         addoperator()
   }
while (cur_vars > 1) # Add operators to use up operands
   addoperator()
print expr
}
 
BEGIN {
# NEXPR and VARS are command line options to awk
        if (NEXPR != "")
           num_expr=NEXPR
        else
           num_expr=10
        for (e=1; e <=num_expr; e++)
           genexpr(VARS)
        }

the following awk code maps each line of previously generated stack machine code to a C expression, wraps them in a function, and defines the appropriate variables.

function eval(tnum)
{
if ($tnum == "o")
   {
   printf("(")
   tnum=eval(tnum+1) #recurse for lhs
   if (rand() <= PLUSMUL) # default 0.5
      printf("+")
   else
      printf("*")
   tnum=eval(tnum) #recurse for rhs
   printf(")")
   }
else
   {
   tnum++
   printf("v[%d]", numvars)
   numvars++ # C is zero based
   }
 
return tnum
}     
 
BEGIN {
        numexpr=0
        print "expr_type R[], v[];"
        print "void eval_expr()"
        print "{"
        }
 
 
        {
        numvars=0 
        printf("R[%d] = ", numexpr)
        numexpr++ # C is zero based
        eval(1)
        printf(";\n")
        next
        }
 
END {
        # print "return(NULL)"
        printf("}\n\n\n")
 
        printf("#define NUMVARS %d\n", numvars)
        printf("#define NUMEXPR %d\n", numexpr)
        printf("expr_type R[NUMEXPR], v[NUMVARS];\n")
        }

The following is an example of a generated expression containing 100 operands (the array v is filled with random values):

R[0] = (((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((v[0]*v[1])*v[2])+(v[3]*(((v[4]*
v[5])*(v[6]+v[7]))+v[8])))*v[9])+v[10])*v[11])*(v[12]*
v[13]))+v[14])+(v[15]*(v[16]*v[17])))*v[18])+v[19])+v[20])*
v[21])*v[22])+(v[23]*v[24]))+(v[25]*v[26]))*(v[27]+v[28]))*
v[29])*v[30])+(v[31]*v[32]))+v[33])*(v[34]*v[35]))*(((v[36]+
v[37])*v[38])*v[39]))+v[40])*v[41])*v[42])+v[43])+
v[44])*(v[45]*v[46]))*v[47])+v[48])*v[49])+v[50])+(v[51]*
v[52]))+v[53])+(v[54]*v[55]))*v[56])*v[57])+v[58])+(v[59]*
v[60]))*v[61])+(v[62]+v[63]))*v[64])+(v[65]+v[66]))+v[67])*
v[68])*v[69])+v[70])*(v[71]*(v[72]*v[73])))*v[74])+v[75])+
v[76])+v[77])+v[78])+v[79])+v[80])+v[81])+((v[82]*v[83])+
v[84]))*v[85])+v[86])*v[87])+v[88])+v[89])*(v[90]+v[91]))
v[92])*v[93])*v[94])+v[95])*v[96])*v[97])+v[98])+v[99])*v[100]);

The result of each generated expression is collected in an array, i.e., R[0], R[1], R[2], etc.

How many expressions are enough to be a representative sample of the vast number that are possible? I picked 250. My concern was the C compilers code generator (I used gcc); the greater the number of expressions, the greater the chance of encountering a compiler bug.

How many times should each expression be evaluated using a different random selection of operand values? I picked 10,000 because it was a big number that did not slow my iterative trial-and-error tinkering.

The values of the operands were randomly drawn from a uniform distribution between -1 and 1.

What did I learn?

  • Varying the number of operands in the expression, from 25 to 200, had little impact on the distribution of calculated values.
  • Varying the form of the expression tree, from wide to deep, had little impact on the distribution of calculated values.
  • Varying the relative percentage of add/multiply operators had a noticeable impact on the distribution of calculated values. The plot below shows the number of expressions (binned by rounding the absolute value to three digits) having a given value (the power law exponents fitted over the range 0 to 1 are: -0.86, -0.61, -0.36; code+data):

Number of expressions having a given absolute value (binned by rounding to three digits).

A rationale for the change in the slopes, for expressions containing different percentages of add/multiply, can be found in an earlier post looking at the distribution of many additions (the Irwin-Hall distribution, which tends to a Normal distribution), and many multiplications (integer power of a log).

  • The result of many additions is to produce values that increase with the number of additions, and spread out (i.e., increasing variance).
  • The result of many multiplications is to produce values that become smaller with the number of multiplications, and become less spread out.

Which recent application spends a huge amount of time adding/multiplying values in the range -1 to 1?

Large language models. Yes, under the hood they are all linear algebra; adding and multiplying large matrices.

LLM expressions are often evaluated using a half-precision floating-point type. In this 16-bit type, the exponent is biased towards supporting very small values, at the expense of larger values not being representable. While operations on a 10-bit significand will experience lots of quantization error, I suspect that the behavior seen above not noticeable change.

  1. Nemo
    June 7, 2023 20:47 | #1

    Did the compiler optimization levels make any difference?

  2. June 8, 2023 10:42 | #2

    @Nemo
    I used the default optimization level, because I think that is likely to produce the most reliable code.
    The C Standard does not specify order of evaluation, so printf("Hello")+printf("World") could produce HelloWorld or WorldHello, but the result of the addition will always be the same, i.e., the number of characters.

  1. No trackbacks yet.