Still no grammar, mainly because of the “Things are falling apart” theme I mentioned in my last post. But away with boring physical masonry and its dirt and on to some cleaner hobbies. One of those is the implementation of the numerical computation of binomial coefficients. Binomial coefficients are useful in a lot of places, to many to even think about listing them, although all have one thing in common: they want them all, in the right order and fast. Really fast.

No problem.

If somebody remembers my last posts about the numerical evaluation of the factorial function like the one here for example and the one describing Borwein’s trick in detail.

With less details, more in a nutshell, the trick speeds up computing by using the prime factorization of the factorial.

The formula for the binomial coefficients is

The exclamation marks mark factorials and we have a fast algorithm for factorials. It works with prime factorization and from the Fundamental Theorem of Arithmetic

Now the formula above is not quite right, the product is actually infinite because of the rule (with ):

To multiply two numbers in, say, the decimal system, we multiply the first place with the first place (ignoring the carry for now)), the second with the second place and so on. We can do the same with the polynomials from the prime factorization. Given the prime factorization of two numbers and we can multiply both numbers by adding the exponents of the primes in the prime factorization. We do not have to care if and are of the same length, we can just pad with primes with an exponent of zero. Let me steal (shamelessly, as always) the equation from wikipedia here:

It is of not much use under normal circumstances because factoring of an integer is complicated. Factoring a factorial, on the other side, is very simple.

We could write functions for it but I wanted to put it in libtommath (I already did in my fork) and they want single, independent functions. It is also easier because there are no carries or negative exponents in the case of the binomial coefficient and we can spare us the trouble of implementing it.

The implementation is straightforward:

Some general shortcuts for are: , , and .

We should also make use of the symmetry of the binomial coefficient:

(with o course). We need an array then, twice the size of the number of primes in (we use the approximation instead of countig them. The amount of wasted memory is small) and fill it with primes up to . Well, actually, we don't fill it, we do it all inline.

If we replace the factorials , in the equation of the binomial coefficient with their prime factorizations , we get

We have taken care that , which means also that , so we can do it in three consecutive loops, the irst one up to and including , the second one up to and including and the last one up to and including . In order

with the last on just padding. We can do that because the real prime factorization is an infinite product but we can ignore all exponents that are zero which leads to the truncation in the above listed way.

The last step is the actual computation of the resulting polynomial and that is the only thing I did in an external function because it is simple and used more often. It will not get used in the factorial function because of the optimizations there which are not applicable for a more general function. The implementation of this function (`bn_mp_compute_factored_factorial.c`

) is as you might have expected:

/* I think this is the third version, isn't it?*/ static long local_highbit(unsigned long n){ long r=0; while (n >>= 1) { r++; } return r; } unsigned long BN_MP_FACTORED_FACTORIAL_CUTOFF = 0x10000000; /* Input is to be sorted as: prime,exponent,prime,exponent,... */ int mp_compute_factored_factorial(unsigned long *f, unsigned long f_length, mp_int *c, unsigned long stop){ unsigned long length,start,i; unsigned long shift = 0; long bit,k=0,hb; mp_int temp; int e; if(stop == 0){ length = f_length; } else{ length = stop; } if(f[0] == 2 && f[1] > 0){ shift = f[1]; if(f_length == 2){ if( (e = mp_set_int(c,1LU<<f[1]) ) != MP_OKAY){ return e; } return MP_OKAY; } } start = 0; if(shift){start+=2;k=2;} bit = 0; for(;k<(long)f_length;k+=2){ hb=local_highbit(f[k+1]); if(bit < hb)bit=hb; } /* just for safety reasons, y'know */ if(bit >(long) DIGIT_BIT){ return MP_VAL; } if( (e = mp_set_int(c,1) ) != MP_OKAY){ return e; } if(f_length > BN_MP_FACTORED_FACTORIAL_CUTOFF){ if( (e = mp_init(&temp) ) != MP_OKAY){ return e; } for( ; bit>=0; bit--) { if( (e = mp_sqr(c, c) ) != MP_OKAY){ return e; } if( (e = mp_set_int(&temp,1) ) != MP_OKAY){ return e; } for(i=start; i<f_length; i+=2) { if ((f[i+1] & (1LU << (unsigned long)bit)) != 0) { if( (e = mp_mul_d(&temp,(mp_digit)f[i], &temp) ) != MP_OKAY){ return e; } } } if( (e = mp_mul(&temp,c,c) ) != MP_OKAY){ return e; } } } else{ for( ; bit>=0; bit--) { if( (e = mp_sqr(c, c) ) != MP_OKAY){ return e; } for(i=start; i<f_length; i+=2) { if ((f[i+1] & (1LU << (unsigned long)bit)) != 0) { if( (e = mp_mul_d(c,(mp_digit)f[i], c) ) != MP_OKAY){ return e; } } } } } if(shift){ /* DIGIT_BIT might be 60 which makes this extra check necessary */ if(shift < INT_MAX){ /* The choice of types is a bit questionable. Sometimes. */ if( (e = mp_mul_2d (c, (int) shift, c) ) != MP_OKAY){ return e; } } else{ int multiplicator=0; while(shift > INT_MAX){ shift >>= 1; multiplicator++; } if( (e = mp_mul_2d (c, (int) shift, c) ) != MP_OKAY){ return e; } if( (e = mp_mul_2d (c, 1<<multiplicator, c) ) != MP_OKAY){ return e; } } } return MP_OKAY; }

The branch in the middle with the cutoff `BN_MP_FACTORED_FACTORIAL_CUTOFF`

makes no difference in this case but might in other. To be tested.

The actual code for the binomial coefficients uses two simpler fallbacks, on with `unsigned long`

and one simple iterative algorithm with big integers. The one with native integers uses one of the normal algorithm but divides by first. Seems to be a waste of good electrons to compute the GCD but it keeps the full range for an `unsigned long`

without having to use an intermediate `unsigned long long`

.

static unsigned long gcd ( unsigned long a, unsigned long b ){ unsigned long c; while ( a != 0 ) { c = a; a = b%a; b = c; } return b; } /* This version using the gcd is a bit slower but avoids the long-long. Upper bound of the number of recursions is n/2. */ static unsigned long comb(unsigned long n, unsigned long k){ unsigned long d; unsigned long q; if(n<k)return 0; if(n==k || k == 0)return 1; if (k > n/2){ k = n-k; } d = gcd( n, k ); q = k / d; return ( comb( n-1, k-1 ) / q ) * ( n / d ); }

And the small big integer function is

static int mp_comb(unsigned long n, unsigned long k,mp_int *c){ unsigned long i; mp_int temp; int e; if(n<k){ mp_set_int(c,0); return MP_OKAY; } if(n==k){ mp_set_int(c,1); return MP_OKAY; } /* That's over-simplified, e.g. comb(99,6) = 1120529256 is still smaller than 2^32. An upper limit would be (the magic number is 1/sqrt(2*pi)) ulim = (0.398942280401430*pow(n,(n+.5 ))*pow((n-k),(k-n-.5 )) /pow(p,(k + 0.5 ))); Stanica, Pantelimon. "Good lower and upper bounds on binomial coefficients." Journal of Inequalities in Pure and Applied Mathematics 2.3 (2001): 30. */ if(n < 35){ if( (e = mp_set_int(c,comb(n,k)) ) != MP_OKAY){ return e; } return MP_OKAY; } if( (e = mp_set_int(c,1) ) != MP_OKAY){ return e; } if (k > n/2){ k = n-k; } if( (e = mp_init(&temp) ) != MP_OKAY){ return e; } for (i=1; i<=k;i++){ if( (e = mp_set_int(&temp,(n - k + i) ) ) != MP_OKAY){ return e; } if( (e = mp_mul(c,&temp,c) ) != MP_OKAY){ return e; } if( (e = mp_set_int(&temp,i) ) != MP_OKAY){ return e; } if( (e = mp_div(c,&temp,c,NULL) ) != MP_OKAY){ return e; } } mp_clear(&temp); return MP_OKAY; }

As you can see from the comment in the code, that the points of the cutoffs are hard to find.

int mp_binomial(unsigned long n,unsigned long k, mp_int *c){ /* Idea shamelessly stolen from Calc. Hey, wait, I wrote that calc-script myself! Oh well, my age's starting to show, I'm afraid ;-)*/ unsigned long *prime_list; unsigned long pix=0,prime,K,diff; mp_bitset_t *bst; int e; if(n<k){ mp_set_int(c,0); return MP_OKAY; } if(n==k || k==0){ mp_set_int(c,1LU); return MP_OKAY; } if((n-k)==1 || k==1){ /* TODO: either check for sizeof(unsigned long) == 4 or repair mp_set_int() */ mp_set_int(c,n); return MP_OKAY; } /* The function mp_comb(n,k) is faster if k<<n (and v.v.), the exact cut-off has not been calculated yet but is quite large.*/ if(k < n/900 || (n-k) < n/900){ if( (e = mp_comb(n, k, c) ) != MP_OKAY){ return e; } return MP_OKAY; } if (k > n/2){ k = n-k; } bst = malloc(sizeof(mp_bitset_t)); if (bst == NULL) { return MP_MEM; } mp_bitset_alloc(bst, n+1); mp_eratosthenes(bst); /* One could also count the number of primes in the already filled sieve */ pix = (unsigned long) (LN_113 * n/log(n) ); prime_list = malloc(sizeof(unsigned long) * (pix) * 2); if (prime_list == NULL) { return MP_MEM; } prime = 2;K=0; do{ diff = mp_prime_divisors(n,prime)- ( mp_prime_divisors(n-k,prime)+mp_prime_divisors(k,prime)); if(diff != 0){ prime_list[K] = prime; prime_list[K+1] = diff; K+=2; } prime = mp_bitset_nextset(bst,prime+1); }while(prime <= k); do { diff = mp_prime_divisors(n,prime)-mp_prime_divisors(n-k,prime); if(diff != 0){ prime_list[K] = prime; prime_list[K+1] = diff; K+=2; } prime = mp_bitset_nextset(bst,prime+1); }while(prime <= n-k); do { prime_list[K] = prime; prime_list[K+1] = mp_prime_divisors(n,prime); prime = mp_bitset_nextset(bst,prime+1); K+=2; }while(prime <= n); prime_list = realloc(prime_list,sizeof(unsigned long) * K); if (prime_list == NULL) { return MP_MEM; } if( (e = mp_compute_factored_factorial(prime_list,K-1,c,0) ) != MP_OKAY){ return e; } free(bst); free(prime_list); return MP_OKAY; }

Your milage may vary but that cut-off point is *quite* off, isn’t it?

More optimiztions are possible, especially to save some memory if we don’t keep the actual values of the primes, only start and end-point but that would make the code more complicated. taking the `unsigned long`

to have 32 bits, we have a need for about 927 MiBytes for the exponents alone or around 2 GiBytes for all, using the upper bound for the number of primes listed above (the thing with the 113) or about 1.5 GiBytes for all using the exact number of primes lower than . How large is that number? With and

using the logarithm of the Gamma function instead of the factorials we get

Which is roundabout . I have no idea how long the computation would last but over one and a quarter billion decimal digits? That’ll last a while and I’m pretty sure about that guess! 😉

Hint: I have an old machine, and AMD Duron with just 1 GHz. Computing with needs about half a second and with it already needs nearly a minute, so with assuming linear growth it’s about 2 years if I calculated it correctly. The whole thing is parallelizable to some very large extent with very little overhead, say 1.2. 64 cores are no rarity anymore, so we could do it in about 13 days. Oh, who would have thought!

So if you need more: just drop me a note.

Pingback: Adventures of a Programmer: Parser Writing Peril X | De Amentiae Mundi

Pingback: On the numerical evaluation of factorials IX | De Amentiae Mundi