# On the numerical evaluation of factorials VIII

The code for the FFT multiplication is working now and had been uploaded to my fork of libtommath. It is still limited to the 28 bit digits, The Schönberg-Strassen algorithm should be able to let us get rid of that ancient limit bt it is still in the works.
BTW: for some reasons unknown to me, not all commits of the original libtommath are in my fork of it. One of those commits fixed a bug that prevented GCC from compiling libtommath, I put it in there manually wich is not the very best method to do it, but as mentioned above: I still don’t know what I made wrong when I forked libtommath.

The cut-offs and limits may need some adjusting for other processors and architectures then my poor little 1GHz AMD Duron.

Now that FFT multiplication is working we can get back to the factorial. The last state was about 12 minutes for the factorial of one million if I remember it correctly. That is definitely way too much even for such an old processor like the one in my box. To avoid linking to the last post…oops, never mind. However, the last code was:

```int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);
shift = (int)prime_divisors(n,2LU);
cut = n / 2 ;
for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
bit = highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
mp_mul_2d (result, shift, result);
return MP_OKAY;
}
```

Anything open for optimization? What’s about the last loop over the primes with an exponent of 1 (one)? The loop adds all of the last primes to the result. What’s about summing them in a temporary variable and multiply the result with the temporary variable instead?

```#define MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF 200000
int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);

shift = (int)prime_divisors(n,2LU);

cut = n / 2 ;

for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
mp_init(&temp);
bit = (int)highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
if(n<MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF){
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
}
else {
mp_init(&temp);
mp_set_int(&temp,1);
for(; p < r; p++){
mp_mul_d(&temp,(mp_digit)p_list[p], &temp);
}
mp_mul(result,&temp,result);
}
mp_mul_2d (result, shift, result);

return MP_OKAY;
}
```

After some tests I found a cut-off point at 200,000. The factorial of one million is now available in 6 (six) minutes. Half of the time than before!
Can we do more? Well, let’s see: the primes are consecutive, increasing and already in an array. That should enable us to use the binary splitting method. It is a bit more complicated here because we get into big-number territory in parts of it but that is good, because we can use some of the optimizations for the multiplication of large numbers. The size, or better: magnitude, number of digits, of the multiplicands is also the same, ideal for the implemented Toom-Cook 2 and 3 and FFT.

```static unsigned long *primelist;
static   mp_int p_temp;

mp_int *primorial__lowlevel(unsigned long a, unsigned long b ,
unsigned long p)
{
unsigned long  c;
unsigned long long prod;

mp_init(&p_temp);

mp_int ta,tb;

if( b == a){
mp_set_int(&p_temp,p);
return  &(p_temp);
}
if( b-a > 1){
c= (b + a) >> 1;
mp_init_multi(&ta,&tb,NULL);
mp_copy(primorial__lowlevel( a   , c , primelist[a] ),&ta);
mp_copy(primorial__lowlevel( c+1 , b , primelist[b] ),&tb);
if(ta.used == 1 && tb.used == 1){
if(ta.dp[0] <= MP_DIGIT_HALF || tb.dp[0] <= MP_DIGIT_HALF ){
prod = (mp_word)ta.dp[0] * tb.dp[0];
if( prod< MP_DIGIT_SIZE){
mp_set_int(&p_temp,(mp_digit)prod);
return &p_temp;
}
}
}
mp_mul(&ta,&tb,&p_temp) ;
mp_clear_multi(&ta,&tb,NULL);
return &p_temp;
}
if(   (primelist[a]) >=  (MP_DIGIT_HALF )
|| (primelist[b]) >=  (MP_DIGIT_HALF )   ){
mp_set_int( &p_temp,primelist[a]);
mp_mul_d(&p_temp,primelist[b],&p_temp);
}
else{
c = primelist[a]*primelist[b];
mp_set_int(&p_temp,c);
}
return &p_temp;
}
void primorial(unsigned long a, unsigned long b, mp_int *result){
unsigned long  r=0;
primelist = fill_prime_list(a+1, b+1, &r);
*result = *primorial__lowlevel(0,r-1,1);
}
```

It even needs some quasi global (static) variables which is rarely a good idea but avoiding them wold clobber the code even more. However, let’s put it in to see if it helps.

```#define MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF 200000
int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);

shift = (int)prime_divisors(n,2LU);

cut = n / 2 ;

for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
mp_init(&temp);
bit = (int)highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
if(n<MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF){
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
}
else {
mp_init(&temp);
primorial(cut,n,&temp);
mp_mul(result,&temp,result);
}
mp_mul_2d (result, shift, result);

return MP_OKAY;
}
```

Result: roughly the same. The theoretical complexity is sometimes different from the one of the implementation. The main culprit here is most probably the deep copy nexessary in our implementation, but it will win for much larger numbers.
But there is another one of the same in the main loop. Let’s replace that with our first algorithm:

```#define MP_FACTORIAL_BORW_LOOP_CUTOFF 500000
#define MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF 200000

int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);

shift = (int)prime_divisors(n,2LU);

cut = n / 2 ;

for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
mp_init(&temp);
bit = (int)highbit(exp_list[0]);
if(n < MP_FACTORIAL_BORW_LOOP_CUTOFF){
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
}
else{
for( ; bit>=0; bit--) {
mp_sqr(result, result);
mp_set_int(&temp,1);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(&temp,(mp_digit)p_list[i], &temp);
}
}
mp_mul(&temp,result,result);
}
}
if(n<MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF){
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
}
else {
primorial(cut,n,&temp);
mp_mul(result,&temp,result);
}
mp_mul_2d (result, shift, result);
return MP_OKAY;
}
```

After some test runs the cut-off found is something around half a million.
Run time for the factorial of one million: a mere 25 (twenty-five) seconds! Oh, and with the linear multiplication instead of the binary splitting of the final primorial it is about 30 seconds. Now we have a significant difference. It is the same difference as before but five seconds hidden in six minutes are hard to detect.

I tried ten million but with only half a gig of RAM and some memory hungry apps like the Firefox I use to write this post, it starts to swap like hell quite fast and needs about just a bit less than 13 minutes. Sounds a lot but the factorial of ten million is a large number, about 1.2024e65657052. Yes, a number with 65,657,053 decimal digits.
I tried the ten million factorial without a running Firefox and got about 10 seconds saved. Not much, even if the sound the harddrive made gave the impression of ten minutes instead 😉
The factorial of ten million is large enough to allow us a better test if the binary splitting is indeed faster. And it is: with the linear multiplication instead of the binary splitting method the run time nearly doubles and passed 24 minutes. So the theory is correct at least 😉

BTW: I just saw that libtommath shares its birthday with me 😉