Multiprecision in JavaScript: Fast Multiplying

Today we will take a look at multiplication and fast multiplication. As promised 😉
The code for fast multiplication is easily adapted to Tom Wu’s bigint implementation and if you have a need for it…

Multiplying

Multiplying is as simple as learned in primary school:

Bigint.prototype.kmul = function(bi){
  var ret,i,j,a,b;
  a = this;
  b = bi;
  ret = new Bigint(0);
  ret.dp = new Array(a.used + b.used);
  for(var k=0;k<ret.dp.length;k++)
    ret.dp[k] = 0>>>0;
  ret.used = a.used + b.used;
  for(i=0;i<a.used;i++){
    carry = 0;
    for(j=0;j<b.used;j++){
      temp = ret.dp[i+j]+(a.dp[i]*b.dp[j])+carry;
      carry = Math.floor(temp / MP_DIGIT_MAX);
      ret.dp[i+j] = temp & MP_MASK;
    }
    ret.dp[i+b.used] = carry;
  }
  ret.clamp();
  return ret;
}

(Several shortcuts (is one of the multiplicants zero or one), checks, signs, and balances ommited)

As I said: just the same.

With one exception: you most probably have learned some shortcuts beside rote-memorizing the multiplication tables. I don’t know how you got it taught but I do paper&pencil multiplication with nothing but addition. I make a table with the bigger multiplicant multiplied by 1 – 9 and I do it by adding:

1 * 234534 = 234534
2 * 234534 = 234534 + 234534
...

If you write that one under the other

    234534 1x
    234534
==========
    469068 2x
    234534
==========
    703602 3x

    469068
    469068
==========
    938136 4x
    234534
==========
   1172670 5x

    703602
    703602
==========
   1407204 6x
    234534
==========
   1641738 7x

    938136
    938136
==========
   1876272 8x
    234534
==========
   2110806 9x

The curious sytem is not to save additions but to avoid typos—copying the digits from the line above at the same places is just less error-prone.

Now you have the ability to just write them one under the other with one digit shifted to the left and add the whole thing together to get the product. You never actually multiply when the numbers are large enough, that is, have mroe than two small digits.

We cannot do that with a comnputer. OK, we could but that would not be very reasonable when the individual digits are a wee bit larger than ten. Such a table with 26 bit long digits would be 2^26 \times 2^26 = 2^{26+26} = 2^52 =  4\,50359\,96273\,70496 bits large, that’s half a petabyte (in IEC/ISO: pebibyte) of memory.
If you plan to use The Cloud you’ll be short between a half and up to 1.5 million dollar!
Or do it at home: the new Backblaze 4U racks have 135 TByte for less than eight grands (plus taxes, I think), that makes about 32 grands total or 64 for a full petabyte.

Hey, I didn’t know it got that cheap!

That means that you can get three petabytes plus the people to maintain it 24/7 (including spare parts) plus an software engineer to kick them for close to half a million dollar, that is the price of the cheapest bidder for a sixth of the space for just one year. Without the engineer btw, they cost a fortune or two extra.
And you have the hardware already which you can either sell at the end of the first year and buy more for less or keep it (will be good for another two years bfore it gets too expensive to maintain) and make some money instead of spending it.

I digress, I know 😉

But replacing multiplication with addition is the right idea. The simplest way is the Karatsuba 2-way. GP/PARI code (b is the base, X,Y the two multiplicants):

X = x1*b + x0
Y =  y1*b + y0
x0y0 = x0 * y0
x1y1 = x1 * y1
t1 = x1 + x0
x0 = y1 + y0
t1 = t1 * x0
x0 = x0y0 + x1y1
t1 = t1 - x0
xy = x1y1*b^2 + t1 * b + x0y0
xy - X*Y

If all is correct it prints zero at the end.

To implement it in Javascript we need something to cut the Bigints in slices:

Bigint.prototype.slice = function(start,end){
  var ret = new Bigint(0);
  ret = this.dp.slice(end - start);
  ret.used = end - start;
  ret.sign = this.sign;
  return ret;
};

If you want it for Tom Wu’s Bigint:

Bigint.prototype.slice = function(start,end){
  var ret = nbi();
  ret = this.slice(end - start);
  ret.t = end - start;
  ret.s = this.s;
  return ret;
};

The actual code:

var KARATSUBA_CUTOFF = 125; // YYMV
Bigint.prototype.mul = function(bint){
  if(Math.min(this.t,bint.t) < 2*KARATSUBA_CUTOFF){
    return this.multiply(bint);
  }
  var sign = this.sign * bint.sign;
  var ret =  this.abs().karatsuba(bint.abs());
  ret.sign = sign;
  return ret;
};
Bigint.prototype.karatsuba = function(bint){
  var x0,x1,y0,y1,x0y0,x1y1,t1,xy;

  var tlen = this.used;
  var blen = bint.used;

  var m = Math.min(tlen, blen)>>>1;
  if(m <= KARATSUBA_CUTOFF)return this.multiply(bint)
  x1 = this.slice(m,tlen);
  x0 = this.slice(0,m);
  y1 = bint.slice(m,blen);
  y0 = bint.slice(0,m);

  x0y0 = x0.karatsuba(y0);
  x1y1 = x1.karatsuba(y1);

  t1 = x1.add(x0);
  x0 = y1.add(y0);
  t1 = t1.karatsuba(x0);

  x0 = x0y0.add(x1y1);
  t1 = t1.sub(x0);

  // z2 * B^(2*m) + z1 * B^m + z0
  xy = (x1y1.dlShift(2*m)).add(t1.dlShift(m)).add(x0y0);
  
  return xy;
};

If you want it for Tom Wu’s Bigint:

Bigint.prototype.dlShift = function(n){
  var r = nbi();
  this.dlShiftTo(n,r);
  return r;
};
var KARATSUBA_CUTOFF = 125; // YYMV
Bigint.prototype.mul = function(bint){
  if(Math.min(this.t,bint.t) < 2*KARATSUBA_CUTOFF){
    return this.multiply(bint);
  }
  var sign = this.s * bint.s
  var ret =  this.abs().karatsuba(bint.abs());
  ret.s = sign;
  return ret;
};
Bigint.prototype.karatsuba = function(bint){
  var x0,x1,y0,y1,x0y0,x1y1,t1,xy;

  var tlen = this.t;
  var blen = bint.t;

  var m = Math.min(tlen, blen)>>>1;
  if(m <= KARATSUBA_CUTOFF)return this.multiply(bint)
  x1 = this.slice(m,tlen);
  x0 = this.slice(0,m);
  y1 = bint.slice(m,blen);
  y0 = bint.slice(0,m);

  x0y0 = x0.karatsuba(y0);
  x1y1 = x1.karatsuba(y1);

  t1 = x1.add(x0);
  x0 = y1.add(y0);
  t1 = t1.karatsuba(x0);

  x0 = x0y0.add(x1y1);
  t1 = t1.subtract(x0);

  // z2 * B^(2*m) + z1 * B^m + z0
  xy = (x1y1.dlShift(2*m)).add(t1.dlShift(m)).add(x0y0);
  
  return xy;
};

It worked the time I tested it, but YMMV, as always.

This algorithm got expanded by Toom and Cook (please ask Wikipedia for the background) to n-way algorithms. Useful are the 3 and 4-way algorithms and in parts the 5-way.

Several algorithms are known for the 3-way Toom-Cook algorithm, the two most prominent are probably the one from Paul Zimmermann. The following is also GP/PARI code but this time shamelessly stolen from Jörg Arndt’s book “Matters Computational” freely downloadable at several places on the internet. It’s a good read, I recommend it heartily.

A = a2*x^2 + a1*x + a0
B = b2*x^2 + b1*x + b0
S0 = a0 * b0
S1 = (a2+a1+a0) * (b2+b1+b0)
S2 = (4*a2+2*a1+a0) * (4*b2+2*b1+b0)
S3 = (a2-a1+a0) * (b2-b1+b0)
S4 = a2 * b2
T1 = 2*S3 + S2
T1 /= 3 \\ division by 3
T1 += S0
T1 /= 2
T1 -= 2*S4
T2 = (S1 + S3)/2
S1 -= T1
S2 = T2 - S0 - S4
S3 = T1 - T2
P = S4*x^4 + S3*x^3 + S2*x^2 + S1*x + S0
P - A*B

And the second one from Bodrato and Zanoni is

A = a2*x^2 + a1*x + a0
B = b2*x^2 + b1*x + b0
S0 = a0 * b0
S1 = (a2+a1+a0) * (b2+b1+b0)
S2 = (4*a2+2*a1+a0) * (4*b2+2*b1+b0)
S3 = (a2-a1+a0) * (b2-b1+b0)
S4 = a2 * b2
S2 = (S2 - S3)/3 \\ division by 3
S3 = (S1 - S3)/2
S1 = S1 - S0
S2 = (S2 - S1)/2
S1 = S1 - S3 - S4
S2 = S2 - 2*S4
S3 = S3 - S2
P = S4*x^4+ S2*x^3+ S1*x^2+ S3*x + S0
P - A*B

I won’t give code for it, it is just the same spiel as above, you can copy the GP/PARI code nearly verbatim. Jörg Arndt has also GP/PARI code for 4-way and 5-way in his book.

FFT code comes later, this post is already way too long.

Squaring

Squaring can be doen a bit faster than multiplying two instances of the same number. If both numbers have the same digits at the same places you will do the multiplication of the digits x, y x \times y two times (at least), the second time as y \times x. That can safe some amount of multiplications, \frac{n^2+2}{1} for n-digit numbers to be exact, and also, for very large numbers, cache misses.

Optimization by the Toom-Cook 2-way method introduced by Karatsuba. Well, C. F. Gauss actually. He used a similar trick for mutiplying complex numbers. But the term “Gauss did it already!” is a good example for redundancy. There’s even an old mathematician’s joke:

“How many mathematicians do you need to change a bulb?”
“None, Gauss did it already!”

X = x1*b + x0
x0_2 = x0^2
x1_2 = x1^2
t1 = x1 + x0
t1 = t1^2
x0 = x0_2 + x1_2
t1 = t1 - x0
xx = x1_2*b^2 + t1 * b + x0_2
xx - X^2

The difference to the multiplication algorithm

X = x1*b + x0                     X = x1*b + x0
Y =  y1*b + y0                    /* Y = x1*b + x0 */
x0y0 = x0 * y0                    x0y0 = x0^2
x1y1 = x1 * y1                    x1x1 = x1^2
t1 = x1 + x0                      t1 = x1 + x0
x0 = y1 + y0	                  /* x0 = x1 + x0 */
t1 = t1 * x0                      t1 = t1^2
x0 = x0y0 + x1y1                  x0 = x0y0 + x1y1
t1 = t1 - x0                      t1 = t1 - x0
xy = x1y1*b^2 + t1 * b + x0y0     xy = x1y1*b^2 + t1 * b + x0y0
xy - X*Y                          xy - X^2

The cutoff-point is probably a good amount higher (I have 125 for multiplication and about 250 for squaring but YMMV ).

Cutoff Points

The cutoff-points depend on the CPU (the ratio between multiplication and addition and the processor cache) and the complexity of the algorithm. The latter point has the larger influence: 2-way has shifts only, 3-way has a division by three and some more operations and 4-way has so many operations already that I don’t put them here but point you to Jörg Arndt’s book mentioned above but also Tom St Denise’s book which is a must-read, too, if you want to build a multi-precision integer library.

Balancing

The Toom-Cook algorithms work best when both multiplicants are roughly of the same digit length. There exist asymmetric Toom-Cook algorithms, for asymetric input but it can be more optimal to do asymmetric sliceing. See for example Bodrato’s 4-way assymmetric squaring but also Chung et al.

It is simpler to make multiplicants of equal length.

\displaystyle{ a * b = (xa_1 + a_0) *  b }

Where x is a positive integer power of the digit size and chosen such that \#a_0 = \#b where \#y = \big\lceil \log_b \max \left(1,|y| + 1 \right) \big\rceil is the digit count. It is probably a good idea to do it recursively up to a certain ratio/cutoff whatever comes first. I used the Karatsuba cutoff C_k (you might take two different values for the two different cases below)

\displaystyle{\min\left(\#a,\#b\right) > C_k \land \min\left(\#a,\#b\right) > C_k}

Together with the ratio

\displaystyle{ \frac{\min\left(\#a,\#b\right)}{\max\left(\#a,\#b\right)} > C_l \land \frac{\min\left(\#a,\#b\right)}{\max\left(\#a,\#b\right)} < C_h  }

The limiting values I found for libtommath are C_l = \frac{1}{1\,000} and C_h = 1 - \frac{1}{1\,000} but YMMV, as the old saying goes.

I think I'll do FFT next.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s