Adventures of a Programmer: Parser Writing Peril III

The last post showed an implementation of the calculation of the factorial with Borwein’s trick. Running time was about 12 minutes, if I remember it correctly, not that bad a result with Libtommath, a big-integer library written with a pedagogical purpose in mind and not for speed. There is a fast version of it, aptly named Tom’s fast math which has a lot of handwritten assembler code in it without changing the API too much.

But 12 minutes for the factorial of a mere million is still too slow. So I wrote a simple FFT multiplier. One warning: I don’t have a 64 bit computer at hand for testing, so everything should work on a 32 bit CPU and might work with a 64 bit-CPU but I doubt it.

As usual: the code is as simple as possible but not simpler and it is useful, which means in this case: fast enough to replace the Toom-Cook algorithm at some limit. The algorithm for FFT-based multiplication has also an upper limit because of the limited precision of the used floating point types.
To keep things even simpler, I used the algorithms presented in Jörg Arndt’s book “Matters Computational” as close as possible (most are written in C++) to enable the reader to make use of the more thorough descriptions there. It also saves me from typing too much mathematical formulas here. The LaTeX support is good here but still quite tedious to write when it is not possible to use macros.

So, long intro, short text? Well, sort of.

The problem with rounding errors mentioned above is caused by the fact that the FFT needs floating point computations (sine, cosine, or sqrt) and the floating point type used (`double` or `long double` if we stay with Standard C) needs to hold the product of two integers without any loss of precision. The `double` type has a mantissa of 53 bits, the ` long double` type has one of 64 bits in C-99 (some compilers, like e.g.: Microsoft Visual C++ in some versions for x86 CPU’s don’t). This restricts the length of the numbers, that is the number of digits we can multiply. There is a table in Jörg Arndt’s book that lists the lengths allowed in relation to the size of the digits. Libtommath has a default of 28 bit per digit, 30 are possible under some circumstances and it uses 60 bit digits on a 64-bit CPU. The last one is a problem. Not a very large one, even simple to handle but it causes a branch in the code. The code here in this post assumes 28 bit digits but I will show how to build the necessary branch.
If you have read Jörg Arndt’s book…what? Ohhh-kaaay, well, then…
To fit a product into 53 bit we can have at most 53/2 = 26 bits per multiplicand if they are of the same size (one with 26 bit and one with 27 bit would work, too, yes). With some bits reserved as an insurance against rounding errors, say 3 (three) we are down to 25 bits. The LTM-digit has 28 per default, so we have to get it smaller. To avoid too much of the error-prone bit-juggling the half of it told a good number and taken. According to the table at page 561 we can multiply with numbers up to 56 MiBi length (still 15 MiBi with the half of the 30 bit digits). That should suffice for most—but not all!—uses[1].
For the 60 bit variant it gets a little bit more complicated. The formula to calculate the need is
$\displaystyle{ N_{max}(R) = 2^{M-S-2\log_2(R)} }$
where $M$ is the number of bits of the mantissa, $S$ the safety bits and $R$ the size of the digit. With $R = 2^{30}$ $S = 3$ $M = 53$ we get `2^(50 - 2* log(2^30)/log(2)) ~ 0.0009765625`, with $R = 2^{14}$ we get `4194304` and the value gets down to one at $R = 2^{25}$.
So, a digit-size (or radix to use the correct word for it) of 30 is not usable. The larger floating point type is `long double` with a 64 bit mantissa which gives, with a 3 bit safety net, `2`. Some 64-bit CPU’s have extended types with 106 bits or even 113 bits. That would give (again with 3 bits safety) `8,796,093,022,208` and `~1,125,899,906,842,624` respectively. To check for them use ` sizeof(the_long_floats_name) * CHAR_BIT` to find out. GCC has e.g.: `__float128` as such a type. You might have more luck with the GPU but that is completely out of any standard and far from the theme of this post. Maybe in a later post?
So, 14 bit it shall be and 28 we have, that is twice as much as we need.
Wasting a bit of memory we put every half in the individual buckets of a freshly allocated array of `double`.
We need some constants here with the ability to be independent of the individual sizes of the digits (or limps as they get called), at least the 28/30 bit variants

```#define MP_DIGIT_SIZE (1<<DIGIT_BIT)
#if (DIGIT_BIT == 60)
#define MP_DIGIT_BIT_HALF (DIGIT_BIT>>2)
#else
#define MP_DIGIT_BIT_HALF (DIGIT_BIT>>1)
#endif
#define MP_DIGIT_HALF (1<< MP_DIGIT_BIT_HALF )
```

The 60 bit variant gets parted into four chunks a 15 bit each which will make it slower. Probably. Maybe. Who knows?
A little helper is needed: an integer logarithm function with base 2 (two).

```static int highbit(unsigned  long n){
int r=0;
unsigned long m=n;
while (m >>= 1) {
r++;
}
return r;
}
```

We do not implement the 60 bit stuff here, so we let the precompiler utter a little note and stop.

```#if (DIGIT_BIT == 28)
#else
#error tested with 28 bit digits only, 30 bit might, 60 bit will not
#endif
```

OK, I admit: I lied, we do not implement the 30 bit stuff either. It is the very same code but needs extra branches for the maximum value which I will not do for simplicity reasons.
The function takes the two multiplicands and fills two arrays of `double` with the result. The length of the arrays land in `length`. The type of `length` should be `size_t` before it gets put into production. Let’s dive into the code.

```int dp_to_fft(mp_int *a, double **fa,
mp_int *b, double **fb, int *length){
int length_a, length_b, length_needed, i,j, hb;
double *fft_array_a,*fft_array_b;

length_a = a->used;
length_b = b->used;
/*
As with every multiplication the number of digits of the result is at most
the sum of the digits of the individual multiplicands.
Twice that because because the digits are half the size after the mangling
here, so twice as many are needed to represent the same number.
*/
length_needed = ( (length_a + length_b)  )*2 ;
/*
Final length must be a power of two to be able to use simple FFT algorithms
*/
hb = highbit((unsigned  long) length_needed );
/* check for the rare case that it is already a power of 2 */
if(length_needed != 1<<hb){
/* Otherwise choose the next larger power of two */
length_needed = 1<<(hb+1);
}
/* Put the result where we can find it later*/
*length = length_needed;

/* checks and balances omitted */
fft_array_a = malloc(sizeof(double) * length_needed);
fft_array_b = malloc(sizeof(double) * length_needed);

/* Now put it all together, or better: rip it apart */
for(i = 0,j=0;i<length_needed/2;i++,j+=2){
if(i < length_a){
/* mask off the lower half, cast and put it into the array */
fft_array_a[j]   = (double) (a->dp[i]                        & MP_DIGIT_MASK);
/* Shift upper half to the right,  mask off the lower half,
cast and put it into the array */
fft_array_a[j+1] = (double)((a->dp[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
}
/* The rest needs to be set to zero. */
if(i >= length_a){
fft_array_a[j]   = 0.0;
fft_array_a[j+1] = 0.0;
}
/* Rinse and repeat */
if(i < length_b){
fft_array_b[j]   = (double) (b->dp[i]                        & MP_DIGIT_MASK);
fft_array_b[j+1] = (double)((b->dp[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
}
if(i >= length_b){
fft_array_b[j]   = 0.0;
fft_array_b[j+1] = 0.0;
}
}
/* Link the result to the pointers given. */
*fa = fft_array_a;
*fb = fft_array_b;
return MP_OKAY;
}
```

The result we get from the FFT process is not normalized and needs some massage before we make a `mp_int` out of it. Nothing complicated, just compute the carries.

```int fft_to_dp(double *fft_array, mp_int *a,int length){
int new_length,i,j,e;
mp_word carry,temp;
/* Checks and balances omited */
new_length = length;
/* Allocate some memory in advance */
if (a->alloc < new_length  ) {
if ((e = mp_grow (a, new_length  )) != MP_OKAY) {
return e;
}
}
/* This can be omitted, but, well ... */
memset(a->dp,0,a->used);
carry = 0;
for(i=0;i<length;i++){
temp = carry;
carry = 0;
/* mp_word is the type that can hold twice DIGIT_BIT */
temp  += (mp_word)(round(fft_array[i]));
if(temp >= MP_DIGIT_HALF){
carry = temp / (mp_word)MP_DIGIT_HALF;
temp  = temp % (mp_word)MP_DIGIT_HALF;
}
/* be friendly to the environment and recycle */
fft_array[i] = (double)temp;
}
/* This is the mirror of dp_to_fft() */
for(i=0,j=0;j<length;i++,j+=2){
/*This will be the upper half of the bits, so use the _next_ entry*/
/* shift it to the left */
a->dp[i] <<= MP_DIGIT_BIT_HALF;
/* add teh first entry n */
/* Keep the ledger uptodate */
a->used++;
}
/* Do not forget the left-overs, very tasty when reheated the next day! */
if(carry)a->dp[i] = carry;
/* We most probably have accumulated some leading zeros, get rid of them */
mp_clamp(a);
return MP_OKAY;
}
```

That was the code to get the numbers to and fro. The code for the FFTs is more complex, but not much. We use the Hartley transform here, from the book, without the bit-reversal. The actual kind of transform is quite arbitrary, I’ll use another FFT later, to show that you can use different kinds just like plug-ins.
I changed as little as possible from Jörg Arndt’s example code, so here are the two helper functions he uses, although it makes more sense in C to weave them in.

```static inline void sumdiff(double *a, double *b){
double t;
t=*a-*b;
*a+=*b;
*b=t;
}

static inline void SinCos(double p, double *s, double *c){
*s = sin(p);
*c = cos(p);
}
```

We need two versions for the algorithm, decimation in frequency and decimation in time and in that exact order. See the book for the reason.
The decimation in frequency:

```void fht_dif(double *f,unsigned long ldn){
unsigned long n,m,m4,j,r,ldm;
unsigned long k,mh;
double c,s,phi0;

n = 1<<ldn;
for( ldm=ldn;ldm>= 1;ldm--){
m = 1<<ldm;
mh = m>>1;
m4 = m>>2;
phi0 = PI/mh;
for (r=0; r<n; r+=m){
unsigned long t1 = r;
unsigned long t2 = t1 + mh;
sumdiff(&f[t1], &f[t2]);
if( m4 ){
unsigned long t1 = r + m4;
unsigned long t2 = t1 + mh;
sumdiff(&f[t1], &f[t2]);
}
}
for (j=1, k=mh-1; j<k; ++j,--k){
SinCos(phi0*j, &s, &c);
for (r=0; r<n; r+=m) {
unsigned long tj = r + mh + j;
unsigned long tk = r + mh + k;
unsigned long t1 = r + j;
unsigned long t2 = tj; // == t1 + mh;
sumdiff(&f[t1], &f[t2]);
t1 = r + k;
t2 = tk; // == t1 + mh;
sumdiff(&f[t1], &f[t2]);
double fj = f[tj];
double fk = f[tk];
f[tj] = fj * c + fk * s;
f[tk] = fj * s - fk * c;
}
}
}
return;
}
```

(The book has pseudo-code listings for a better understanding)
The decimation in time:

```void fht_dit(double *f,unsigned long ldn){
unsigned long n,m,mh,m4,k,j,r,ldm;
double c,s,phi0;
n = 1<<ldn;
for( ldm=1;ldm<= ldn;++ldm){
m = 1<<ldm;
mh = m>>1;
m4 = m>>2;
phi0 = PI/mh;
for(r=0; r< n; r+=m){
unsigned long t1 = r;
unsigned long t2 = t1 + mh;
sumdiff(&f[t1], &f[t2]);
if ( m4 ){
unsigned long t1 = r + m4;
unsigned long t2 = t1 + mh;
sumdiff(&f[t1], &f[t2]);
}
for (j=1, k=mh-1; j<k; ++j,--k){
SinCos(phi0*j, &s, &c);
unsigned long tj = r + mh + j;
unsigned long tk = r + mh + k;
double fj = f[tj];
double fk = f[tk];
f[tj] = fj * c + fk * s;
f[tk] = fj * s - fk * c;
unsigned long t1 = r + j;
unsigned long t2 = tj; // == t1 + mh;
sumdiff(&f[t1], &f[t2]);
t1 = r + k;
t2 = tk; // == t1 + mh;
sumdiff(&f[t1], &f[t2]);
}
}
}
return;
}
```

Both o the multiplicands get mangled through the FFTs (decimation in frequency), multiplied without carry, and mangled again through the FFT, but this time it is decimation in time. The part in the middle is called convolution and is a bit more complicated because we did not do the reversing in the FFTs to save two full rounds.

```void fht_conv_core(double *f, double *g,unsigned long ldn, double v/*=0.0*/){
unsigned long n = (1UL<<ldn);
if ( v==0.0 )  v = 1.0/n;

g[0] *= (v * f[0]); // 0 == revbin(0)
if ( n>=2 ) g[1] *= (v * f[1]); // 1 == revbin(nh)
if ( n<4 )   return;
v *= 0.5;
unsigned long nh = (n>>1);
unsigned long r=nh, rm=n-1; // nh == revbin(1),   n1-1 == revbin(n-1)
double xi  ,    xj, yi,yj;
//fht_mul(f[r], f[rm], g[r], g[rm], v);
//         xi    xj     y1     y2
xi = f[r];
xj = f[rm];
yi = g[r];
yj = g[rm];
g[r]  = v*( ( xi + xj)*yi + (xi - xj)*yj );
g[rm] = v*( (-xi + xj)*yi + (xi + xj)*yj );

unsigned long k=2, km=n-2;
while ( k<nh ){
// k even:
rm -= nh;
unsigned long tr = r;
r^=nh; for (unsigned long m=(nh>>1); !((r^=m)&m); m>>=1)    {;}
//fht_mul(f[r], f[rm], g[r], g[rm], v);
xi = f[r];
xj = f[rm];
yi = g[r];
yj = g[rm];
g[r]  = v*( ( xi + xj)*yi + (xi - xj)*yj );
g[rm] = v*( (-xi + xj)*yi + (xi + xj)*yj );
--km;
++k;
// k odd:
rm += (tr-r);
r += nh;
//fht_mul(f[r], f[rm], g[r], g[rm], v);
xi = f[r];
xj = f[rm];
yi = g[r];
yj = g[rm];
g[r]  = v*( ( xi + xj)*yi + (xi - xj)*yj );
g[rm] = v*( (-xi + xj)*yi + (xi + xj)*yj );
--km;
++k;
}
return;
}
```

That code looks more complicated than it actually is. See the pseudo-code in the book for some explanation.
Getting all together:

```int fft(double *x, double *y, unsigned long length){
unsigned long n;
/* Checks and balances omitted */
n = length;
fht_dif(x, highbit(n));
fht_dif(y, highbit(n));
fht_conv_core(x, y,highbit(n), 0.0);
fht_dit(y, highbit(n));
return MP_OKAY;
}
```

Which allows for the following little function to multiply two `mp_int`:

```  double *fa, *fb;
fa = NULL;fb = NULL;
int length, e;
if( ( e = dp_to_fft(a, &fa, b, &fb, &length)) != MP_OKAY){
if(fa != NULL) free(fa);
if(fb != NULL) free(fb);
return e;
}
if( ( e = fft(fa, fb, length)) != MP_OKAY){
if(fa != NULL) free(fa);
if(fb != NULL) free(fb);
return e;
}
if( ( e = fft_to_dp(fb, c, length)) != MP_OKAY){
if(fa != NULL) free(fa);
if(fb != NULL) free(fb);
return e;
}
free(fa);free(fb);
return MP_OKAY;
```

The result is a bit slower than native multiplication up to about 300.000 limbs. We can speed it up, very slightly, with a recursive variant that loops until the length of the chunk is smaller than twice the size of a `double>` times the size of the L1-cache in bytes (Find out with e.g.: Linux `cat /proc/cpuinfo | grep cache`).

Searching the web gave a link to a program named am for modeling atmosphere by the Smithsonian Astrophysical Observatory. The file transform.c contains several transforms derived from Jörg Arndt’s book (without the convolution code(s)). The license is very relaxed short of being public domain (it was once the case for everything the US-governement published, they seem to have changed it. Shall I guess who lobbied for that?).
Using the implementation of their Hartley transform (revbin code stripped) makes the whole thing even faster than the native multiplication. For very large numbers only (above ca 4.000 limbs), but still! Modern CPUs with large L1-caches might be able to take a lot more advantage of the algorithm.
Oh, squaring, yes, forgot squaring, sorry. But squaring is just the same, only with a single multiplicand this time. It is called auto-convolution and the book I mentioned about half a dozen times has it, too. No, wait, it doesn’t? OK, the simplest way is, of course, to abuse the normal version

```int fht_autoconv_core(double *f,unsigned long ldn, double v/*=0.0*/){
unsigned long n = (1UL<<ldn);
if ( v==0.0 )  v = 1.0/n;

f[0] *= (v * f[0]); // 0 == revbin(0)
if ( n>=2 ) f[1] *= (v * f[1]); // 1 == revbin(nh)
if ( n<4 )   return;
v *= 0.5;
unsigned long nh = (n>>1);
unsigned long r=nh, rm=n-1; // nh == revbin(1),   n1-1 == revbin(n-1)
double xi  ,    xj, yi,yj;
//fht_mul(f[r], f[rm], f[r], f[rm], v);
//         xi    xj     y1     y2
xi = f[r];
xj = f[rm];
yi = f[r];
yj = f[rm];
f[r]  = v*( (xi + xj)*yi + (xi - xj)*yj );
f[rm] = v*( (-xi + xj)*yi + (xi + xj)*yj );

unsigned long k=2, km=n-2;
while ( k<nh ){
// k even:
rm -= nh;
unsigned long tr = r;
r^=nh; for (unsigned long m=(nh>>1); !((r^=m)&m); m>>=1)    {;}
//fht_mul(f[r], f[rm], f[r], f[rm], v);
xi = f[r];
xj = f[rm];
yi = f[r];
yj = f[rm];
f[r]  = v*( (xi + xj)*yi + (xi - xj)*yj );
f[rm] = v*( (-xi + xj)*yi + (xi + xj)*yj );
--km;
++k;
// k odd:
rm += (tr-r);
r += nh;
//fht_mul(f[r], f[rm], f[r], f[rm], v);
xi = f[r];
xj = f[rm];
yi = f[r];
yj = f[rm];
f[r]  = v*( (xi + xj)*yi + (xi - xj)*yj );
f[rm] = v*( (-xi + xj)*yi + (xi + xj)*yj );
--km;
++k;

}
return MP_OKAY;
}
```

And because I have it all written to test I can just paste it here without further cost 😉

```int dp_to_fft_single(mp_int *a, double **fa, int *length){
int length_a,  length_needed, i,j, hb;
double *fft_array_a;
length_a = a->used;
length_needed = ( length_a * 2   )*2 ;
/* final length must be a power of two to keep it simple */
hb = highbit((unsigned  long) length_needed );
/* check for the rare case that it is already a power of 2 */
if(length_needed != 1<<hb){
length_needed = 1<<(hb+1);
}
*length = length_needed;
fft_array_a = malloc(sizeof(double) * length_needed);
for(i = 0,j=0;i<length_needed/2;i++,j+=2){
if(i < length_a){
fft_array_a[j]   = (double) (a->dp[i]         & MP_DIGIT_MASK);
fft_array_a[j+1] = (double)((a->dp[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
}
if(i >= length_a){
fft_array_a[j]   = 0.0;
fft_array_a[j+1] = 0.0;
}
}
*fa = fft_array_a;
return MP_OKAY;
}
```

The reversal `dp_to_fft()` is the same. The FFT function is a bit different here, it calls the functions from the atmosphere-modeling software mentioned above. The main difference is one argument more (which does not get used, btw.) and the length is the full length, not the integral base two logarithm.

```int fft_sqr(double *x, unsigned long length){
unsigned long n;
n = (length);
fht_dif_rec(x,(n),1);
fht_autoconv_core(x,highbit(n), 0.0);
fht_dit_rec(x, (n),1);
return MP_OKAY;
}
```

And the main caller:

```int mp_sqr_fft(mp_int *a, mp_int *c){
double *fa;
fa = NULL;
int length, e;
if( ( e = dp_to_fft_single(a, &fa,  &length)) != MP_OKAY){
if(fa != NULL) free(fa);
return e;
}
if( ( e = fft_sqr(fa, length)) != MP_OKAY){
if(fa != NULL) free(fa);
return e;
}
if( ( e = fft_to_dp(fa, c, length)) != MP_OKAY){
if(fa != NULL) free(fa);
return e;
}
free(fa);
return MP_OKAY;
}
```

Some benchmarks:

```/* Code used to build large numbers */
void mp_bignum(mp_int *a,int length){
int n=length;
mp_rand(a,10);
while(n--) mp_sqr(a,a);
}
```

The benchmarks here have been done with the FHT implementation from the am-software.

Multiplying (cut-off at about ca 4,000 limbs):
Limbs : 1,308,945
Mult. native: 98.02000000 seconds
Mult. FFT : 9.28000000 seconds

Squaring (cut-off at about ca 20,000 limbs):
Limbs : 1,308,945
Mult. native: 42.79000000 seconds
Mult. FFT : 5.78000000 seconds

The benchmarks and cut-offs are a bit distracting because the length of the working array grows stepwise, one power of two after the other, so exact cut-offs are not one single number but a function or a look-up table.

Another benchmark: the factorial of a million.
Normal: 762.60000000 seconds
FFT : 746.66000000 second
Mmh…a bit disappointing? The small difference has a simple reason: the number of times the FFT squaring runs is only four. The places are
hit at 30125 limbs
hit at 64705 limbs
hit at 138313 limbs
hit at 294459 limbs
Much more is possible with the final primorial. If we would be able to compute it with binary splitting we could save a significant amont of time.

If we use FFT with the Binary Splitting algorithm (var.:Luschny) we get a very significant result with the factorial of 100,000:

BinSplit: 10.55000000 seconds
MUL hit at 4986 limbs
MUL hit at 5879 limbs
MUL hit at 10864 limbs
MUL hit at 6911 limbs
MUL hit at 12650 limbs
MUL hit at 23513 limbs
BinSplit fft: 1.32000000 seconds

That is even faster than the Borwein variant! Lets try one million:
MUL hit at 6412
MUL hit at 4254
MUL hit at 7528
MUL hit at 13940
MUL hit at 4170
MUL hit at 4264
MUL hit at 8517
MUL hit at 16171
MUL hit at 30110
MUL hit at 4286
MUL hit at 4314
MUL hit at 8583
MUL hit at 4297
MUL hit at 4308
MUL hit at 8596
MUL hit at 17159
MUL hit at 33213
MUL hit at 63323
MUL hit at 4297
MUL hit at 4305
MUL hit at 8598
MUL hit at 4308
MUL hit at 4318
MUL hit at 8600
MUL hit at 17215
MUL hit at 4300
MUL hit at 4311
MUL hit at 8607
MUL hit at 4320
MUL hit at 4313
MUL hit at 8624
MUL hit at 17229
MUL hit at 34435
MUL hit at 67558
MUL hit at 130881
MUL hit at 4325
MUL hit at 4313
MUL hit at 8631
MUL hit at 4321
MUL hit at 4310
MUL hit at 8634
MUL hit at 17258
MUL hit at 4306
MUL hit at 4323
MUL hit at 8623
MUL hit at 4316
MUL hit at 4323
MUL hit at 8631
MUL hit at 17267
MUL hit at 34532
MUL hit at 4310
MUL hit at 4326
MUL hit at 8635
MUL hit at 4323
MUL hit at 4329
MUL hit at 8647
MUL hit at 17278
MUL hit at 4327
MUL hit at 4332
MUL hit at 8648
MUL hit at 4325
MUL hit at 4327
MUL hit at 8650
MUL hit at 17300
MUL hit at 34568
MUL hit at 69081
MUL hit at 136467
MUL hit at 267347
BinSplit fft: 479.58000000 seconds

That is definitely a progress! 😉

[1] An implementation of the Schönhage–Strassen algorithm is in the works.