Multiple Precision in JavaScript: FFT multiplication

Nobody believed it but finally: FFT multiplication for my Little big-integer library. It is quite general and can be used for Tom Wu’s bigint library, too, if the the digit size is not larger than 28 bits (default for Mozilla but check the start of Tom Wu’s file).
I don’t go into the mathematical details, there is more than enough to find online but feel free to ask if you have any questions.
The code-base I used is my own implementation of FFT multiplication for Libtommath, so no license problems.

The individual digits (limbs) of the Bigint are to large for the 64 bit data type used as the native Number in JavaScript, it will lead very quick to rounding errors if we use the full 26 bit of the big-digit. Half of it, 13 bits, is ok and good for 16,777,216 limbs (ca 208 Mega-bits) of 13 bits (with 3 bits angst-allowance). With the full 26 bits and the same 3 bits we couldn’t do it all[1].

A second condition must be met: the digit length must be a power of two for the FFT algorithm I use.

Let’s start with the uncomfortable fact that I wasn’t planning to use the code for JavaScript, too, and didn’t spared some good pointer-juggling. Not very much for C, really not, but a wee bit much for JavaScript. But I don’t want to rewrite the whole thing, so…

Let’s start at the beginning. This time. I’ll use simple functions to avoid any mess, ok?
So…

function fft_mul(a,b){
  var fa = [],fb=[],len = [];
  var c;
  // put checks and balances here if used stand-alone
  bi_to_fft(a, fa, b, fb,len);
  MP_fft(fa, fb,len);
  c = fft_to_bi(fb,len);
  return c;
}

We’ll use and re-use a lot of memory here, hence the Arrays to pass around.
The first function chops the numbers, the second does the FFT convolution and the last one sums it all up.
As said above, we can use only smaller limbs, so we cut ours in half

var MP_DIGIT_BIT = 26 // change accordingly
var MP_DIGIT_SIZE     = (1<<MP_DIGIT_BIT)
var MP_DIGIT_BIT_HALF = (MP_DIGIT_BIT>>1)
var MP_DIGIT_HALF     = (1<< MP_DIGIT_BIT_HALF )
var MP_DIGIT_MASK     = (MP_DIGIT_HALF-1)

/* base two integer logarithm for convenience*/
function highbit( n){
  var r=0>>>0;
  var m=n>>>0;
  while (m >>>= 1) {
    r++;
  }
  return r;
}

function bi_to_fft(a,fa,b,fb,len){
   var length_a, length_b, length_needed, i,j, hb;
   /* Check of the multiplicants happens earlier */
   length_a = a.used;
   length_b = b.used;

   /* Digits get split in half, so twice the length is needed*/
   length_needed = ( length_a + length_b ) * 2 ;
   /* final length must be a power of two to keep the FFTs simple */
   hb = highbit( length_needed );
   /* check for the rare case that it is already a power of 2 */
   if(length_needed != 1<<hb){
     length_needed = 1<<(hb+1);
   }
   /* Send computed length back to caller */
   len[0] = length_needed;

   /* Put splitted digits in double-array, in the same order as in bigint.dp */
   for(i = 0,j=0;i<length_needed/2;i++,j+=2){
      if(i < length_a){
        fa[j]   = 1.0 *  (a.dp[i]                         & MP_DIGIT_MASK);
        fa[j+1] = 1.0 * ((a.dp[i] >>> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
      }
      /* padding a */
      if(i >= length_a){
        fa[j]   = 0.0;
        fa[j+1] = 0.0;
      }
      /* sweep up rests */
      if(i < length_b){
        fb[j]   = 1.0 *  (b.dp[i]                         & MP_DIGIT_MASK);
        fb[j+1] = 1.0 * ((b.dp[i] >>> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
      }
      /* padding b if necessary*/
      if(i >= length_b){
        fb[j]   = 0.0;
        fb[j+1] = 0.0;
      }
   }
}

Quite straightforward, simple even. I try to persuade the engine to make double Arrays by multiplying with a float but using typed arrays from the beginning would be much cleaner of course.
The next function called is MP_fft

function MP_fft(x, y,len){
  var n;
  n = len[0]; 
  if(n < 2) return MP_VAL;
  fht_dif_rec(x,n,0);
  fht_dif_rec(y,n,0);
  fht_conv_core(x, y,n, 0.0);
  fht_dit_rec(y, n,0),0;
}

And here starts the real stuff as described in you textbook. We do two transformations (in frequency), one for every number each, do the convolution (multiplication), do an inverse transformation (in time) over the result of the convolution and are done. At least with the good stuff.

The Algorithm I used does a small optimization by calling itself recursively and if the size of the Array is small enough to fit in the CPU L1 cache it does the actual transform. With the value set sets in at < 4096 elements if the size of a double is 8 bytes. I hope it works, had no time to test it yet. But the it works without recursion, of course, just switch it off. I’ll update this post when I (or you?) found some time to test it.
[UPDATE] Yes, I was shot by my own pointer juggling 😉 I forgot to place the pointer offset in the recursions correctly and everywhere.
I’ll c&p my comments from Libtommath here.

/*
  The size of the L1-cache in bytes. The number here is that of the data cache
  part of an AMD Duron. The Linux kernel gives a lot of information e.g.:
    grep . /sys/devices/system/cpu/cpu0/cache/index*//*
  There is also lscpu(1) wich is easier to use.
  On Windows:
    http://msdn.microsoft.com/en-us/library/ms683194.aspx
    http://www.cpuid.com/softwares/cpu-z.htm
  Lack of access to a Mac leaves that part blank. The new MacOS is based on BSD,
  so 'dmesg' might work or
    cat /var/run/dmesg.boot | grep CPU
 */
if(typeof  L1_SIZE === 'undefined')
  var L1_SIZE = 65536;

var M_PI = 3.14159265358979323846264338327950288419716939937511

var TWOPI = (2.0 * M_PI);

/*
  The recursive version of a Hartley transform, decimation in frequency.
  The trick is a kind of binary splitting: recurse until length of array
  is short enough to fit into the L1-cache.
  Idea found in FFTW3 but it is way older:
    Singleton, R. C., "On Computing the Fast Fourier Transform," 
    Communications of the ACM 10:647, 1967
  Seeking for this citation online found the software "am" at
  https://www.cfa.harvard.edu/~spaine/am/ where the same method can be found
  in https://www.cfa.harvard.edu/~spaine/am/download/src/transform.c
  
  The construction of the functions here have been slightly adapted to the
  functions in "am" from above to make the actual FFT function easier replaceable
  with others e.g.: those in "src/transform.c". They are legible now, too.
*/
function fht_dif_rec(x, n,offset)
{
    var nh=0>>>0;;
    if (n == 1 )
        return;
    if (n < Math.floor(L1_SIZE / (2 * 8) ) ) { // 8 = sizeof(double)
        fht_dif_iterative(x, n, offset, true); 
        return;
    }
    fht_dif_iterative(x, n, offset,false);
    nh = n >>> 1;
    fht_dif_rec(x, nh,offset);
    fht_dif_rec(x , nh, nh+offset);
    return;
}

The Hartley transform is a FFT over reals, so no complex arithmetic needed. Makes it simpler to read, too.
It is the iterative version of a Hartley transform, decimation in frequency. For the technical details see the excellent description in Jörg Arndt’s book “Matters Computational”[1]. See pages 515 ff there.
One optimization is used here, the hartley_shift (p. 516 f. in [1]) which avoids a large number of trigonetric computations.

In the case of computationally very expensive or even non-existent trigonometric functions (they are in JavaScript, believe me, this comment is c&p’d from Libtomfloat which is able to run on very reduced hardware that may not have any trigonometric functions)t he calculation of the variables {c,s} can be replaced with two square roots which are easier to implement.
The following pseudo-code calculates the numbers needed, although in reverse order:

    c = 0.0;s = 1.0;
    for(i=0;i<n;i++){
      s = sqrt((1.0 - c) / 2.0);
      c = sqrt((1.0 + c) / 2.0);
    }

Also: the trigonometric values are always an ordered subset out of an ordered set of sizeof(unsigned long)*CHAR_BIT tuples if the length of the data is a power of two. That set can be pre-computed and needs sizeof(unsigned long)*CHAR_BIT*2 * sizeof(double)*CHAR_BIT bits of memory. With the current 64-bit implementations (64 bit integers and 64 bit doubles) it’s 8 kibi.

function fht_dif_iterative(x, n,offset, do_loop)
{
    var m=0>>>0,mh=0>>>0,mq=0>>>0;
    var i,j,k;
    var a,b,t, c,s, u,v,tmp;
    // a pointer to x in the original
    var dp=offset;
    for (m=n; m > 1; m >>>= 1) {
        mh = m >>> 1;
        mq = mh >>> 1;
        t = M_PI / mh;
        a = Math.sin(0.5 * t);
        a *= 2.0 * a;
        b = Math.sin(t);
        for (i = 0; i < n; i += m) {
            dp = offset + i; // dp = x + i in original
            for (j = 0, k = mh; j < mh; ++j, ++k) {
                u = x[dp+j];
                v = x[dp+k];
                x[dp+j] = u + v;
                x[dp+k] = u - v;
            }
            dp += mh;
            c = 1.0;
            s = 0.0;
            for (j = 1, k = mh - 1; j < mq; ++j, --k) {
                tmp = c;
                c -= a * c + b * s;
                s -= a * s - b * tmp;
                u = x[dp+j];
                v = x[dp+k];
                x[dp+j] = u * c + v * s;
                x[dp+k] = u * s - v * c;
            }
        }
        if(!do_loop)break;
    }
    return;
}

The awkward indexing of the Array is caused by the pointer-juggling in the original C-code and my unwillingness to rewrite it.

The inverse transforms:

/* The iterative Hartley transform, decimation in time. Description above */
function fht_dit_iterative(x, n,offset, do_loop)
{
    var m=0>>>0, mh=0>>>0 ,mq=0>>>0;
    var i,j,k;
    var a,b,t, u,v, c,s, tmp;
    var dp=offset;

    m = (do_loop)?2:n;
    for (; m <= n; m <<= 1) {
        mh = m >>> 1;
        mq = mh >>> 1;
        t = M_PI / mh;
        a = Math.sin(0.5 * t);
        a *= 2.0 * a;
        b = Math.sin(t);
        for (i = 0; i < n; i += m) {
            dp = offset + i + mh; // was dp = x + i + mh
            c = 1.0;
            s = 0.0;
            for (j = 1, k = mh - 1; j < mq; ++j, --k) {
                tmp = c;
                c -= a * c + b * s;
                s -= a * s - b * tmp;
                u = x[dp+j];
                v = x[dp+k];
                x[dp+j] = u * c + v * s;
                x[dp+k] = u * s - v * c;
            }
            dp -= mh;
            for (j = 0, k = mh; j < mh; ++j, ++k) {
                u = x[dp+j];
                v = x[dp+k];
                x[dp+j] = u + v;
                x[dp+k] = u - v;
            }
        }
    }
    return;
}
function fht_dit_rec(x, n,offset)
{
    var nh;

    if (n == 1)
        return;
    if (n < Math.floor(L1_SIZE / (2 *  8) ) ) { // 8 = sizeof(double)
        fht_dit_iterative(x,n,offset,true);
        return;
    }
    nh = n >>> 1;
    fht_dit_rec(x, nh,offset);
    fht_dit_rec(x, nh,nh+offset);
    fht_dit_iterative(x,n,offset,false);
    return;
}

And finally the actual convolution. It is the FHT convolution from Jörg Arndt’s book[1]. The code looks a bit messy but only on the face of it. This method avoids the otherwise costly bit reversing (which is called “revbin” in J. Arndt’s book).
And yes, it really looks complicated but it isn’t at all, try to follow the logic yourself. It is a bit like playing Tower of Hanoi 😉

function fht_conv_core(f, g, n, v /*=0.0*/ )
{
    var nh, r, rm, k, km, tr, m;
    var xi, xj, yi, yj;
    if (v == 0.0)
	v = 1.0 / n;

    g[0] *= (v * f[0]);
    if (n >= 2)
	g[1] *= (v * f[1]);
    if (n < 4)
	return;
    v *= 0.5;
    nh = (n >>> 1);
    r = nh;
    rm = n - 1;
    xi = f[r];
    xj = f[rm];
    yi = g[r];
    yj = g[rm];
    g[r] = v * ((xi + xj) * yi + (xi - xj) * yj);
    g[rm] = v * ((-xi + xj) * yi + (xi + xj) * yj);

    k = 2;
    km = n - 2;
    while (k < nh) {
	rm -= nh;
	tr = r;
	r ^= nh;
	for (m = (nh >>> 1); !((r ^= m) & m); m >>>= 1) {;
	}

	xi = f[r];
	xj = f[rm];
	yi = g[r];
	yj = g[rm];
	g[r] = v * ((xi + xj) * yi + (xi - xj) * yj);
	g[rm] = v * ((-xi + xj) * yi + (xi + xj) * yj);
	--km;
	++k;
	rm += (tr - r);
	r += nh;
	xi = f[r];
	xj = f[rm];
	yi = g[r];
	yj = g[rm];
	g[r] = v * ((xi + xj) * yi + (xi - xj) * yj);
	g[rm] = v * ((-xi + xj) * yi + (xi + xj) * yj);
	--km;
	++k;
    }
    return;
}

And now let’s glue the whole thing together.
We have to carry the carry. The FFT multiplication does no carry (it’s one of the tricks of it), so we have to do it.
And we have to make a whole out of to halfs, too.
[UPDATE 2] fixed a bug, see comment below

function fft_to_bi(fft_array, len){
    var new_length, i,j,e;
    var carry,temp;
    var a = new Bigint(0);

    new_length = len;
    carry = 0;
    for(i=0;i<len;i++){
      temp = carry;
      carry = 0;
      /*
          BUGFIX:
          Doing the >>>0 to press unsigned values is one step too early
          the number is still a double with 53 bits and not the 32 bit the
          >>>0 makes out of it.
          JavaScript's loose typing is really a pain in the rear!
          Even when it was my own fault here 😉
       */
      temp  += Math.round(fft_array[i])/* >>>0 */;
      if(temp >= MP_DIGIT_HALF){
        carry = Math.floor(temp / MP_DIGIT_HALF);
        temp  = temp & MP_DIGIT_MASK;
      }
      /* memory is still expensive, not a thing to waste easily */
      fft_array[i] = temp;
    }

    /* re-marry the digits */
    for(i=0,j=0;j<new_length;i++,j+=2){
      a.dp[i]   = (Math.round(fft_array[j+1]))& MP_DIGIT_MASK;
      a.dp[i] <<= MP_DIGIT_BIT_HALF;
      a.dp[i]  |= (Math.round(fft_array[j]))  & MP_DIGIT_MASK;
      /* and count them all */
      a.used++;
    }
    if(carry){
      a.dp[i] = carry;
      a.used++;
    }
    a.clamp();
    return a;
}

Yes, that was all, we’re done.
Was it really that complicated?
Tsk, tsk, tsk! 😉

[1] Arndt, Jorg. Matters computational. Springer, 2011.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s