# On Binary Splitting

The method of binary splitting to evaluate sums of rational numbers is well known. A slightly more detailed overview (with some code examples) is given in [2] (preprint in [1]) for the sums of rational numbers.

The binary splitting algorithm uses the divide&conquer method to keep the single operations as small as possible for as long as possible. Another advantage of this method is that all operands are of roughly the same size which is favored by all of the fast multiplication algorithms like e.g.: FFT. Asymmetric Toom-Cook algorithms exist but only for a handful of different relations.

The basic idea for divide&conquer is to evaluate $a \odot b \odot c \odot d$ (with $a \odot b = b \odot a$) as $(a \odot b) \odot (c \odot d)$. That keeps the sizes equal if the sizes of $\{a,b,c,d\}$ are equal, too. In times of multi-core processors the ability to process as many of the operations in parallel as there are cores available is also not a thing one should deem as insignificant too easily.

But there is also something named CPU-cache, a still very finite but very fast memory where the CPU keeps its bricks and mortar it works with, so it might be a good idea too keep as many things in this memory as possible. Here divide&conquer is not the most ideal algorithm because it has to grab its material from different parts of the memory a lot of times. Although the access to memory can be assumed to be $\mathcal{O}(1)$ but the hidden constant can get very large for the different types of memory. So large, that it might be better to use an asymptotically worse algorithm if it is able to keep all data in the CPU-cache. In the case of multiplication it is the school book algorithm here for the first level if $a\odot b$. fits into a single CPU register, e.g. the product of two 32-bit values fits into one 64-bit CPU-register.

An example with libtommath (checks and balances omitted)

#define BIN_SPLIT_TUNABLE_CONSTANT 64
void mp_bin_split(mp_digit *array, int n, mp_int * result)
{
mp_digit first_half, second_half;
int i;
mp_int tmp;

if (n == 0) {
mp_set(result, 1);
return;
}
if (n <= BIN_SPLIT_TUNABLE_CONSTANT) {
mp_set(result, array[0]);
for (i = 1; i < n; i++)
mp_mul_d(result, array[i], result);
return;
}

first_half = n / 2;
second_half = n - first_half;
mp_bin_split(array, second_half, result);
mp_init(&tmp);
mp_bin_split(array + second_half, first_half, &tmp);
mp_mul(result, &tmp, result);
mp_clear(&tmp);
}


If BIN_SPLIT_TUNABLE_CONSTANT is set very small it could be a fruitful idea to do the splitting by hand (mp_word is defined to be large enough to hold the product of two mp_digit).

/* ---- snip --- */
if (n <= 8 ) {
mp_set(result, array[0]);
for (i = 1; i < n; i++)
mp_mul_d(result, array[i], result);
return;
}
if (n == 8 ) {
mp_word s1,s2,s3,s4;
mp_digit carry, a[8],b[8],
c1[8]={0,0,0,0,0,0,0,0},
c2[8]={0,0,0,0,0,0,0,0},*temp;
double C1[16],C2[16];
s1 = array[0] * array[1];
array[0] = (unsigned long) (s1  & ((mp_word) MP_MASK));
array[1] = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
s2 = array[2] * array[3];
array[2] = (unsigned long) (s2  & ((mp_word) MP_MASK));
array[3] = (unsigned long) (s2 >> ((mp_word) DIGIT_BIT));
s3 = array[4] * array[5];
array[4] = (unsigned long) (s3  & ((mp_word) MP_MASK));
array[5] = (unsigned long) (s3 >> ((mp_word) DIGIT_BIT));
s4 = array[6] * array[7];
array[6] = (unsigned long) (s4  & ((mp_word) MP_MASK));
array[7] = (unsigned long) (s4 >> ((mp_word) DIGIT_BIT));
if(array[1] == 0 && array[3] == 0 && array[5] == 0 && array[7] == 0){
s1 = s1 * s2;
a[0] = (unsigned long) (s1  & ((mp_word) MP_MASK));
a[1] = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
s2 = s3 * s4;
b[0] = (unsigned long) (s2  & ((mp_word) MP_MASK));
b[1] = (unsigned long) (s2 >> ((mp_word) DIGIT_BIT));
if(a[1] == 0 && b[1] == 0){
s1 = s1 * s2;
a[0] = (unsigned long) (s1  & ((mp_word) MP_MASK));
a[1] = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
if(a[1] == 0){
result->dp[0] = a[0];
result->used = 1;
return;
}
result->dp[0] = a[0];
result->dp[1] = a[1];
result->used = 2;
return;
}
k = (a[1] == 0)?1:2;
l = (b[1] == 0)?1:2;
temp = result->dp;
for(i=0;i<k;i++){
carry = 0;
temp = result->dp + i;
for(j=0;j<MIN(l,4-i);j++){
s1 =  ((mp_word)*temp) + (mp_word)a[i] * (mp_word)b[j] + (mp_word)carry;
*temp++ = (unsigned long) (s1  & ((mp_word) MP_MASK));
carry  = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
}
if(i+j<4){
*temp = carry;
}
}
result->used = 4;
mp_clamp(result);
return;
}
else{
a[0] = array[0];
a[1] = array[1];
b[0] = array[2];
b[1] = array[3];
k = (a[1] == 0)?1:2;
l = (b[1] == 0)?1:2;
temp = c1;
for(i=0;i<k;i++){
carry = 0;
temp = c1 + i;
for(j=0;j<MIN(l,4-i);j++){
s1 =  ((mp_word)*temp) + (mp_word)a[i] * (mp_word)b[j] + (mp_word)carry;
*temp++ = (unsigned long) (s1  & ((mp_word) MP_MASK));
carry  = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
}
if(i+j<4){
*temp = carry;
}
}
a[0] = array[4];
a[1] = array[5];
b[0] = array[6];
b[1] = array[7];

k = (a[1] == 0)?1:2;
l = (b[1] == 0)?1:2;
temp = c2;
for(i=0;i<k;i++){
carry = 0;
temp = c2 + i;
for(j=0;j<MIN(l,4-i);j++){
s1 =  ((mp_word)*temp) + (mp_word)a[i] * (mp_word)b[j] + (mp_word)carry;
*temp++ = (unsigned long) (s1  & ((mp_word) MP_MASK));
carry  = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
}
if(i+j<4){
*temp = carry;
}
}
for(i = 0,j=0;i<8;i++,j+=2){
if(i < 8){
C1[j]   = (double) (c1[i] & MP_DIGIT_MASK);
C1[j+1] = (double)((c1[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
C2[j]   = (double) (c2[i] & MP_DIGIT_MASK);
C2[j+1] = (double)((c2[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
}
if(i >= 8){
C1[j]   = 0.0;
C1[j+1] = 0.0;
C2[j]   = 0.0;
C2[j+1] = 0.0;
}
}

mp_fft(C1,C2,16);

carry = 0;
for(i=0;i<16;i++){
s1 = carry;
carry = 0;
s1  += (mp_word)(round(C2[i]));
if(s1 >= MP_DIGIT_HALF){
carry = s1 / (mp_word)MP_DIGIT_HALF;
s1  = s1 % (mp_word)MP_DIGIT_HALF;
}
C2[i] = (double)s1;
}
mp_grow(result,17);
for(i=0,j=0;j<16;i++,j+=2){
result->dp[i] <<= MP_DIGIT_BIT_HALF;
result->used++;
}
if(carry){
result->dp[i] = carry;
result->used++;
}

mp_clamp(result);

return;
}
}
/* ---- snap --- */


And if you think you have seen the worst waste of blog-space you’ve never met the kind of programmers who’s products are described in detail at thedailywtf.com.

[1] Haible, Bruno, and Thomas Papanikolaou. “Fast multiprecision evaluation of series of rational numbers.” (1997). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.3698&rep=rep1&type=pdf
[2] Haible, Bruno, and Thomas Papanikolaou. “Fast multiprecision evaluation of series of rational numbers”. Springer Berlin Heidelberg, 1998.

# Adventures of a Programmer: Parser Writing Peril XIII

The biggest native integer libtommath allowed to set directly seems to be[1] an unsigned long in the function mp_set_int. The biggest native integer used, on the other side, is hidden behind mp_word which is the type able to hold twice the size of mp_digit and can be larger than an unsigned long.

I need for my calculator some ways to work with native numbers without much ado where ado means a lot of conditionals, preprocessor directives, complicated data structures and all that mess. One of the ways to avoid it is to use the digits of the large integers directly if the large integer has only one. An example? Ok, an example.

Take the partial harmonic series, for example

${\displaystyle { \mathrm{H}_n = \sum_{k=1}^n \frac{1}{k} }}$

If you calculate it with the help of the binary-splitting algorithm, especially than, a lot of the numbers involved are in the range of native integers and hold only on digit of the big numbers. The initialization of the big numbers in libtommath set them to 8 digits at least (the responsible variable is MP_PREC in tommath.h) and consumes costly heap memory to do so.

Fredrik Johansson proposed in a blogpost to postpone the reducing of the fraction to the very end. It is not much but it is something so let’s follow his advice and do so using my rational library (as mostly: without any error checking for less code-cluttering).

static mp_rat h_temp;
mp_rat *_harmonics(unsigned long a, unsigned long b)
{
unsigned long m;
mp_rat ta, tb;
mp_int p, q, r, s;
mp_word ps, qr;
int e;

if (b - a == 1) {
mpq_set_int(&h_temp, (long) 1, (long) a);
return &h_temp;
}
m = (a + b) >> 1;
mpq_init_multi(&ta, &tb, NULL);
// This is not necessarily necessary
mp_init_multi(&p, &q, &r, &s, NULL);

mpq_exch(_harmonics(a, m), &ta);
mpq_exch(_harmonics(m, b), &tb);

mp_exch(&ta.numerator, &p);
mp_exch(&ta.denominator, &q);
mp_exch(&tb.numerator, &r);
mp_exch(&tb.denominator, &s);

if ((&p)->used == 1 && (&s)->used == 1) {
ps = (&p)->dp[0] * (mp_word) (&s)->dp[0];
mp_set_word(&ta.numerator, ps);
} else {
mp_mul(&p, &s, &ta.numerator);
}
if ((&q)->used == 1 && (&r)->used == 1) {
qr = (&q)->dp[0] * (mp_word) (&r)->dp[0];
mp_set_word(&tb.numerator, qr);

} else {
mp_mul(&q, &r, &tb.numerator);
}
mp_mul(&q, &s, &h_temp.denominator);

mp_clear_multi(&p, &q, &r, &s, NULL);
mpq_clear_multi(&ta, &tb, NULL);
return &h_temp;
}
int harmonics(unsigned long n, mp_rat * c)
{
mpq_init(&h_temp);
mpq_exch(_harmonics(1, n + 1), c);
mpq_reduce(c);
mpq_clear(&h_temp);
return 1;
}


The library libtommath is not very friendly if used in such a way but I cannot find out a better way to implement the binary splitting algorithm. This implementation of the partial harmonic series for example, is much slower than Fredriks implementation with gmpy (but faster than the native Python one, at least ðŸ˜‰ ). It takes about 0.67 seconds for $\mathop{\mathrm{H}}_{10\,000}$ but already 193.61 seconds—yes, over 3 minutes!— for $\mathop{\mathrm{H}}_{100\,000}$. That is definitely too much.
Funny thing: the normal algorithm is much faster, just 40 seconds for $\mathop{\mathrm{H}}_{100\,000}$ but also slower for smaller values, like about 1.09 seconds for $\mathop{\mathrm{H}}_{10\,000}$ whith the cut-off point at about $\mathop{\mathrm{H}}_{21\,000}$ on my machine. And it is asymptotically slower, I measured some points in between to find out. It is really time to implement fast multiplication in full in libtommath.
Some of the problems with the first algorithm might have their reason in the nonexistent balancing of the multiplicands. There is a balancing function in the pull-queue but it seems to be a port from Ruby which makes it impossible to accept because of Ruby’s license (given that the submitter is not the original programmer of the Ruby code which I haven’t checked.)

static mp_rat h_temp;
mp_rat * _harmonics(unsigned long a,unsigned long b){
unsigned long m;
mp_rat ta,tb;
if(b-a==1){
mpq_set_int(&h_temp,(long)1,(long)a);
return &h_temp;
}
m = (a+b)>>1;
mpq_init_multi(&ta,&tb,NULL);

mpq_exch(_harmonics(a,m),&ta);
mpq_exch(_harmonics(m,b),&tb);

mpq_clear_multi(&ta,&tb,NULL);
return &h_temp;
}
// use the same caller as above


However, it was just meant to be used as an example for mp_set_word which I’ll present here:

#if (MP_PREC > 1)
int mp_set_word(mp_int *c,mp_word w){
mp_zero(c);
if(w == 0){
return MP_OKAY;
}
do{
}while( (w >>= DIGIT_BIT) > 0 && c->used < MP_PREC);
if( w != 0 ){
return MP_VAL;
}
return MP_OKAY;
}
#else
#warning variable "MP_PREC" should be at least 2
#endif


The preprocessor mess is necessary even if the constant MP_PREC should be at least 8 (eight) but, as I can tell you from some very bad experience, one never knows.

[1] The reason for the stylistically awkward subjunctive is: I mean it. I really could have overseen an already implemented function doing exactly what I wanted in the first place and hence is not a case of NIH. This time ðŸ˜‰

# On the numerical evaluation of factorials VIII

The code for the FFT multiplication is working now and had been uploaded to my fork of libtommath. It is still limited to the 28 bit digits, The SchÃ¶nberg-Strassen algorithm should be able to let us get rid of that ancient limit bt it is still in the works.
BTW: for some reasons unknown to me, not all commits of the original libtommath are in my fork of it. One of those commits fixed a bug that prevented GCC from compiling libtommath, I put it in there manually wich is not the very best method to do it, but as mentioned above: I still don’t know what I made wrong when I forked libtommath.

The cut-offs and limits may need some adjusting for other processors and architectures then my poor little 1GHz AMD Duron.

Now that FFT multiplication is working we can get back to the factorial. The last state was about 12 minutes for the factorial of one million if I remember it correctly. That is definitely way too much even for such an old processor like the one in my box. To avoid linking to the last post…oops, never mind. However, the last code was:

int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);
shift = (int)prime_divisors(n,2LU);
cut = n / 2 ;
for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
bit = highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
mp_mul_2d (result, shift, result);
return MP_OKAY;
}


Anything open for optimization? What’s about the last loop over the primes with an exponent of 1 (one)? The loop adds all of the last primes to the result. What’s about summing them in a temporary variable and multiply the result with the temporary variable instead?

#define MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF 200000
int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);

shift = (int)prime_divisors(n,2LU);

cut = n / 2 ;

for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
mp_init(&temp);
bit = (int)highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
if(n<MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF){
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
}
else {
mp_init(&temp);
mp_set_int(&temp,1);
for(; p < r; p++){
mp_mul_d(&temp,(mp_digit)p_list[p], &temp);
}
mp_mul(result,&temp,result);
}
mp_mul_2d (result, shift, result);

return MP_OKAY;
}


After some tests I found a cut-off point at 200,000. The factorial of one million is now available in 6 (six) minutes. Half of the time than before!
Can we do more? Well, let’s see: the primes are consecutive, increasing and already in an array. That should enable us to use the binary splitting method. It is a bit more complicated here because we get into big-number territory in parts of it but that is good, because we can use some of the optimizations for the multiplication of large numbers. The size, or better: magnitude, number of digits, of the multiplicands is also the same, ideal for the implemented Toom-Cook 2 and 3 and FFT.

static unsigned long *primelist;
static   mp_int p_temp;

mp_int *primorial__lowlevel(unsigned long a, unsigned long b ,
unsigned long p)
{
unsigned long  c;
unsigned long long prod;

mp_init(&p_temp);

mp_int ta,tb;

if( b == a){
mp_set_int(&p_temp,p);
return  &(p_temp);
}
if( b-a > 1){
c= (b + a) >> 1;
mp_init_multi(&ta,&tb,NULL);
mp_copy(primorial__lowlevel( a   , c , primelist[a] ),&ta);
mp_copy(primorial__lowlevel( c+1 , b , primelist[b] ),&tb);
if(ta.used == 1 && tb.used == 1){
if(ta.dp[0] <= MP_DIGIT_HALF || tb.dp[0] <= MP_DIGIT_HALF ){
prod = (mp_word)ta.dp[0] * tb.dp[0];
if( prod< MP_DIGIT_SIZE){
mp_set_int(&p_temp,(mp_digit)prod);
return &p_temp;
}
}
}
mp_mul(&ta,&tb,&p_temp) ;
mp_clear_multi(&ta,&tb,NULL);
return &p_temp;
}
if(   (primelist[a]) >=  (MP_DIGIT_HALF )
|| (primelist[b]) >=  (MP_DIGIT_HALF )   ){
mp_set_int( &p_temp,primelist[a]);
mp_mul_d(&p_temp,primelist[b],&p_temp);
}
else{
c = primelist[a]*primelist[b];
mp_set_int(&p_temp,c);
}
return &p_temp;
}
void primorial(unsigned long a, unsigned long b, mp_int *result){
unsigned long  r=0;
primelist = fill_prime_list(a+1, b+1, &r);
*result = *primorial__lowlevel(0,r-1,1);
}


It even needs some quasi global (static) variables which is rarely a good idea but avoiding them wold clobber the code even more. However, let’s put it in to see if it helps.

#define MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF 200000
int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);

shift = (int)prime_divisors(n,2LU);

cut = n / 2 ;

for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
mp_init(&temp);
bit = (int)highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
if(n<MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF){
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
}
else {
mp_init(&temp);
primorial(cut,n,&temp);
mp_mul(result,&temp,result);
}
mp_mul_2d (result, shift, result);

return MP_OKAY;
}


Result: roughly the same. The theoretical complexity is sometimes different from the one of the implementation. The main culprit here is most probably the deep copy nexessary in our implementation, but it will win for much larger numbers.
But there is another one of the same in the main loop. Let’s replace that with our first algorithm:

#define MP_FACTORIAL_BORW_LOOP_CUTOFF 500000
#define MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF 200000

int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);

shift = (int)prime_divisors(n,2LU);

cut = n / 2 ;

for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
mp_init(&temp);
bit = (int)highbit(exp_list[0]);
if(n < MP_FACTORIAL_BORW_LOOP_CUTOFF){
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
}
else{
for( ; bit>=0; bit--) {
mp_sqr(result, result);
mp_set_int(&temp,1);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(&temp,(mp_digit)p_list[i], &temp);
}
}
mp_mul(&temp,result,result);
}
}
if(n<MP_FACTORIAL_BORW_PRIMORIAL_CUTOFF){
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
}
else {
primorial(cut,n,&temp);
mp_mul(result,&temp,result);
}
mp_mul_2d (result, shift, result);
return MP_OKAY;
}


After some test runs the cut-off found is something around half a million.
Run time for the factorial of one million: a mere 25 (twenty-five) seconds! Oh, and with the linear multiplication instead of the binary splitting of the final primorial it is about 30 seconds. Now we have a significant difference. It is the same difference as before but five seconds hidden in six minutes are hard to detect.

I tried ten million but with only half a gig of RAM and some memory hungry apps like the Firefox I use to write this post, it starts to swap like hell quite fast and needs about just a bit less than 13 minutes. Sounds a lot but the factorial of ten million is a large number, about 1.2024e65657052. Yes, a number with 65,657,053 decimal digits.
I tried the ten million factorial without a running Firefox and got about 10 seconds saved. Not much, even if the sound the harddrive made gave the impression of ten minutes instead ðŸ˜‰
The factorial of ten million is large enough to allow us a better test if the binary splitting is indeed faster. And it is: with the linear multiplication instead of the binary splitting method the run time nearly doubles and passed 24 minutes. So the theory is correct at least ðŸ˜‰

BTW: I just saw that libtommath shares its birthday with me ðŸ˜‰

# On the numerical evaluation of factorials VII

Having another Big-Number library at hand it was only natural or me to write a function—several, to be exact—to evaluate the factorial.
The first one is the naÃ¯ve version. Well, not stupid, just naÃ¯ve. A simple iterative algorithm that multiplies the integers but stripped them from their factors of two beforehand and adds them later at once with a single stepshift to the left. It also uses native multiplication before it needs to use the big-integer library by trying the multiplicator if its size is small enough to make the product fit into a single digit of the big-integer. It does it by trial-divison. I saw this method in the implementation of the factorial function in calc (in zfunc.c if you want to look for yourself) and found it odd at first but it is faster, nevertheless. Probably depending on the kind of big-integer library but I haven’t checked others.

int factorial_naive(unsigned long n, mp_int *result){
unsigned long m,l=1,k=0;
mp_set_int(result,1);
for (; n > 1; n--) {
for (m = n; !(m & 0x1); m >>= 1)k++;
if (l <= MP_DIGIT_SIZE/m) {
l *= m;
continue;
}
mp_mul_d(result,l, result);
l = m;
}
if (l > 1) {
mp_mul_d(result, l, result);
}
mp_mul_2d (result, k, result);
return MP_OKAY;
}


The next step on the complexity ladder is the binary splitting algorithm. The direct implementation is slower than the simple iterative one above. The reason lies for one part on the details of the implementation of the big-integer library and for another part on my inability to do it correctly (the difference is too large in time in relation to the specific differences in the implementation of the big-integers, so I assume its me.). Nevertheless, here is the result, maybe you can tell me where I f’d it up. It is stripped off all of the error checks to make it readable.

static void fbinsplit(unsigned long m,unsigned long n,
mp_int *temp1, mp_int *temp2){
mp_int t1,t2;
unsigned long c;
if(m == n){
for (;((m & 0x1) == 0); m >>= 1);
mp_set_int(temp1, m);
}
else{
c = ((m+n)>>1);
mp_init(&t1); mp_init(&t2);
fbinsplit(m, c,temp1,temp2);
mp_copy(temp1,&t1);
mp_copy(temp2,&t2);
fbinsplit(c + 1,n,&t1,&t2);
mp_mul( &t1,&t2, temp2);
mp_clear(&t2);
mp_clear(&t1);
}
}
int factorial2(unsigned long n, mp_int *result){
mp_int t1,t2;
mp_init_multi(&t1,&t2,NULL);
mp_set_int(&t2,1);
mp_set_int(&t1,1);
fbinsplit(1,n,&t1,&t2);
mp_mul_2d (&t2, (int)prime_divisors(n,2LU), result);
mp_clear_multi(&t1,&t2,NULL);
return MP_OKAY;
}


The main speed-brake is probably the deep copy which allocates memory on the heap and does all the other stuff that lets even the most well-meaning program come to a grinding halt.
The function prime_divisors() does what the name suggests. It will be used later, so here it is:

static unsigned long prime_divisors(unsigned long n,unsigned long p){
unsigned long q,m;
q = n;
m = 0LU;
if (p > n) return 0LU;
if (p > n/2LU) return 1LU;
while (q >= p) {
q  = q/p;
m += q;
}
return m;
}


No complicated bit-juggling, just simple trial-division.

Peter Lushny has published a simple algorithm at his page that seemed to work better. Not as good as the theory suggests but at least better than the naÃ¯ve, iterative implementation at the top of this post. He has a lot of different algorithms shown at his page with implementations mostly in Java and C# but sadly without much comments, so you have to study the code to get the gist of it. The code is quite legible, at least.
The following code has all of the error checking stripped, too, just as the one above and probably all of the listings I post.

static mp_int *fbinsplit2b(unsigned long n, unsigned long m){
mp_int t1,t2;
if (m <= (n + 1)){
mp_set_int(&p_temp,n);
return &p_temp;
}
if (m == (n + 2)){
mp_set_int(&p_temp,n * m);
return &p_temp;
}
unsigned long k =  (n + m) / 2;
if ((k & 1) != 1) k = k - 1;
mp_init_multi(&t1,&t2,NULL);
mp_copy(fbinsplit2b(n, k),&t1);
mp_copy(fbinsplit2b(k + 2, m),&t2);
mp_mul(&t1 ,&t2 ,&p_temp);
mp_clear_multi(&t1,&t2,NULL);
return &p_temp;
}
static void fbinsplit2a(unsigned long n,mp_int *p, mp_int *r){
if (n <= 2) return;
fbinsplit2a(n / 2, p, r);
mp_mul( p , fbinsplit2b(n / 2 + 1 + ((n / 2) & 1), n - 1 + (n & 1)),p);
mp_mul(r, p,r);
}
int factorial_binary_splitting(unsigned long n, mp_int *result){
mp_int p, r;
mp_init_multi(&p,&r,NULL);
mp_set_int(&p,1);
mp_set_int(&r,1);
fbinsplit2a(n, &p, &r);
mp_copy(&r,result);
mp_mul_2d (&r,(int)prime_divisors(n,2LU) , result);
return MP_OKAY;
}


It still needs some computationally expensive deep copying but not as much as my own version.
The next one, the final one until somebody finds something faster, is Borwein’s trick as described in some of my earlier posts.

It works with primes, so we need some. We need as many primes for n! as there are up to and including n which isn’t much, so we need nothing fancy here and use a simple sieve to fill an array with the primes. We could use the sieve directly but that would complicate the code too much for too little a gain.

It needs a data structure named bitset, implemented in a way I mentioned earlier when the structure for a flexible 2-d matrix was described.

/* A general bitset, can be used elsewhere, too */
#   define ERAT_BITS (sizeof(uint32_t)*CHAR_BIT)
#   define GET_BIT(s,n)  ((*(s+(n/ERAT_BITS)) &   ( 1<<( n % ERAT_BITS )))  != 0)
#   define SET_BIT(s,n)   (*(s+(n/ERAT_BITS)) |=  ( 1<<( n % ERAT_BITS )))
#   define CLEAR_BIT(s,n) (*(s+(n/ERAT_BITS)) &= ~( 1<<( n % ERAT_BITS )))
#   define TOG_BIT(s,n)   (*(s+(n/ERAT_BITS)) ^=  ( 1<<( n % ERAT_BITS )))
/* bst.size is the size in bits, the overall size might be bigger */
typedef struct bitset_t {
uint32_t size;
uint32_t *content;
} bitset_t;
#   define bitset_alloc(bst, n) \
do {\
(bst)->content=malloc(( n /(sizeof(uint32_t)) + 1 ));\
if ((bst)->content == NULL) {\
fprintf(stderr, "memory allocation for bitset failed");\
exit(EXIT_FAILURE);\
}\
(bst)->size = n;\
} while (0)
#   define bitset_size(bst)  ((bst)->size)
#   define bitset_setall(bst) memset((bst)->content,~0lu,\
(bst->size /(sizeof(uint32_t) ) +1 ))
#   define bitset_clearall(bst) memset((bst)->content,0,\
(bst->size /(sizeof(uint32_t) ) +1 ))
#   define bitset_clear(bst,n) CLEAR_BIT((bst)->content, n)
#   define bitset_set(bst,n)     SET_BIT((bst)->content, n)
#   define bitset_get(bst,n)     GET_BIT((bst)->content, n)
#   define bitset_free(bst) \
do {\
free((bst)->content);\
free(bst);\
} while (0)

uint32_t bitset_nextset(bitset_t * bst, uint32_t n);
uint32_t bitset_prevset(bitset_t * bst, uint32_t n);
uint32_t bitset_nextset(bitset_t * bst, uint32_t n)
{
while ((n < bitset_size(bst)) && (!bitset_get(bst, n))) {
n++;
}
return n;
}

uint32_t bitset_prevset(bitset_t * bst, uint32_t n)
{
while (n > 1 && (!bitset_get(bst, n))) {
n--;
}
return n;
}



The actual sieve is just the basic Eratosthenes method.

#define LN_113 1.25505871293247979696870747618124469168920275806274
unsigned long *fill_prime_list(unsigned long start, unsigned long stop,
unsigned long *R)
{
bitset_t *bst;
uint32_t n, k, j,i;
unsigned long pix,*list;

pix = (unsigned long) (LN_113 *  stop/log(stop) );
list = malloc(pix * sizeof(unsigned long));
if(list == NULL){
return 0;
}
bst = malloc(sizeof(bitset_t));
bitset_alloc(bst, stop);
bitset_setall(bst);
bitset_clear(bst, 0);
bitset_clear(bst, 1);

n = bitset_size(bst);
for (k = 4; k < n; k += 2)
bitset_clear(bst, k);
k = 0;
i = 0;
while ((k = bitset_nextset(bst, k + 1)) < n) {
if(k >= start){
list[i++] = k;
}
for (j = k * k; j < n; j += k * 2) {
bitset_clear(bst, j);
}
}
bitset_free(bst);
list = realloc(list, sizeof(unsigned long)*(i + 1));
*R = i;
return list;
}


The only interessting point is the curious constant LN_113. This is actually the upper bound for the number of primes up to a limit if that limit is bigger than two. To explain it in detail would cause quite some deep diving into number theoretical abysses but I found a post at stackexchange where some A. Walker gives a rough overview.

The function returns a pointer to the list, so do not forget to free the memory after using it. The size of the array is put into R.
One little helper function is also needed: a method to find the index of the MSB.

long highbit(unsigned  long n){
long r=0;
while (n >>= 1) {
r++;
}
return r;
}


Also without any juggled bits, just a simple loop.
Forgot anything? No? Ok.

int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);
shift = (int)prime_divisors(n,2LU);
cut = n / 2 ;
for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
bit = highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
mp_mul_2d (result, shift, result);
return MP_OKAY;
}


The primes with a unit exponent have been treated with a binary splitting algorithm in my implementation in Calc but that did not work out very well with libtommath for the same reasons as for the failure of my try at the factorials.
This implementation needs about 13 minutes to evaluate the factorial of one million. The result has 5,565,703 decimal digits which is quite a large number. PARI/GP needs…uhm…more than 45 minutes. I stopped it there, but I have only the version in Debian Squeeze running. It is linked against GMP 4.3.2. which is more than three yeas old. The higher, higher than libtommath that is, optimized Calc needs about eight minutes for the factorial of one million.