# On the numerical evaluation of factorials VII

Having another Big-Number library at hand it was only natural or me to write a function—several, to be exact—to evaluate the factorial.
The first one is the naïve version. Well, not stupid, just naïve. A simple iterative algorithm that multiplies the integers but stripped them from their factors of two beforehand and adds them later at once with a single stepshift to the left. It also uses native multiplication before it needs to use the big-integer library by trying the multiplicator if its size is small enough to make the product fit into a single digit of the big-integer. It does it by trial-divison. I saw this method in the implementation of the factorial function in calc (in zfunc.c if you want to look for yourself) and found it odd at first but it is faster, nevertheless. Probably depending on the kind of big-integer library but I haven’t checked others.

int factorial_naive(unsigned long n, mp_int *result){
unsigned long m,l=1,k=0;
mp_set_int(result,1);
for (; n > 1; n--) {
for (m = n; !(m & 0x1); m >>= 1)k++;
if (l <= MP_DIGIT_SIZE/m) {
l *= m;
continue;
}
mp_mul_d(result,l, result);
l = m;
}
if (l > 1) {
mp_mul_d(result, l, result);
}
mp_mul_2d (result, k, result);
return MP_OKAY;
}

The next step on the complexity ladder is the binary splitting algorithm. The direct implementation is slower than the simple iterative one above. The reason lies for one part on the details of the implementation of the big-integer library and for another part on my inability to do it correctly (the difference is too large in time in relation to the specific differences in the implementation of the big-integers, so I assume its me.). Nevertheless, here is the result, maybe you can tell me where I f’d it up. It is stripped off all of the error checks to make it readable.

static void fbinsplit(unsigned long m,unsigned long n,
mp_int *temp1, mp_int *temp2){
mp_int t1,t2;
unsigned long c;
if(m == n){
for (;((m & 0x1) == 0); m >>= 1);
mp_set_int(temp1, m);
}
else{
c = ((m+n)>>1);
mp_init(&t1); mp_init(&t2);
fbinsplit(m, c,temp1,temp2);
mp_copy(temp1,&t1);
mp_copy(temp2,&t2);
fbinsplit(c + 1,n,&t1,&t2);
mp_mul( &t1,&t2, temp2);
mp_clear(&t2);
mp_clear(&t1);
}
}
int factorial2(unsigned long n, mp_int *result){
mp_int t1,t2;
mp_init_multi(&t1,&t2,NULL);
mp_set_int(&t2,1);
mp_set_int(&t1,1);
fbinsplit(1,n,&t1,&t2);
mp_mul_2d (&t2, (int)prime_divisors(n,2LU), result);
mp_clear_multi(&t1,&t2,NULL);
return MP_OKAY;
}

The main speed-brake is probably the deep copy which allocates memory on the heap and does all the other stuff that lets even the most well-meaning program come to a grinding halt.
The function prime_divisors() does what the name suggests. It will be used later, so here it is:

static unsigned long prime_divisors(unsigned long n,unsigned long p){
unsigned long q,m;
q = n;
m = 0LU;
if (p > n) return 0LU;
if (p > n/2LU) return 1LU;
while (q >= p) {
q  = q/p;
m += q;
}
return m;
}

No complicated bit-juggling, just simple trial-division.

Peter Lushny has published a simple algorithm at his page that seemed to work better. Not as good as the theory suggests but at least better than the naïve, iterative implementation at the top of this post. He has a lot of different algorithms shown at his page with implementations mostly in Java and C# but sadly without much comments, so you have to study the code to get the gist of it. The code is quite legible, at least.
The following code has all of the error checking stripped, too, just as the one above and probably all of the listings I post.

static mp_int *fbinsplit2b(unsigned long n, unsigned long m){
mp_int t1,t2;
if (m <= (n + 1)){
mp_set_int(&p_temp,n);
return &p_temp;
}
if (m == (n + 2)){
mp_set_int(&p_temp,n * m);
return &p_temp;
}
unsigned long k =  (n + m) / 2;
if ((k & 1) != 1) k = k - 1;
mp_init_multi(&t1,&t2,NULL);
mp_copy(fbinsplit2b(n, k),&t1);
mp_copy(fbinsplit2b(k + 2, m),&t2);
mp_mul(&t1 ,&t2 ,&p_temp);
mp_clear_multi(&t1,&t2,NULL);
return &p_temp;
}
static void fbinsplit2a(unsigned long n,mp_int *p, mp_int *r){
if (n <= 2) return;
fbinsplit2a(n / 2, p, r);
mp_mul( p , fbinsplit2b(n / 2 + 1 + ((n / 2) & 1), n - 1 + (n & 1)),p);
mp_mul(r, p,r);
}
int factorial_binary_splitting(unsigned long n, mp_int *result){
mp_int p, r;
mp_init_multi(&p,&r,NULL);
mp_set_int(&p,1);
mp_set_int(&r,1);
fbinsplit2a(n, &p, &r);
mp_copy(&r,result);
mp_mul_2d (&r,(int)prime_divisors(n,2LU) , result);
return MP_OKAY;
}

It still needs some computationally expensive deep copying but not as much as my own version.
The next one, the final one until somebody finds something faster, is Borwein’s trick as described in some of my earlier posts.

It works with primes, so we need some. We need as many primes for n! as there are up to and including n which isn’t much, so we need nothing fancy here and use a simple sieve to fill an array with the primes. We could use the sieve directly but that would complicate the code too much for too little a gain.

It needs a data structure named bitset, implemented in a way I mentioned earlier when the structure for a flexible 2-d matrix was described.

/* A general bitset, can be used elsewhere, too */
#   define ERAT_BITS (sizeof(uint32_t)*CHAR_BIT)
#   define GET_BIT(s,n)  ((*(s+(n/ERAT_BITS)) &   ( 1<<( n % ERAT_BITS )))  != 0)
#   define SET_BIT(s,n)   (*(s+(n/ERAT_BITS)) |=  ( 1<<( n % ERAT_BITS )))
#   define CLEAR_BIT(s,n) (*(s+(n/ERAT_BITS)) &= ~( 1<<( n % ERAT_BITS )))
#   define TOG_BIT(s,n)   (*(s+(n/ERAT_BITS)) ^=  ( 1<<( n % ERAT_BITS )))
/* bst.size is the size in bits, the overall size might be bigger */
typedef struct bitset_t {
uint32_t size;
uint32_t *content;
} bitset_t;
#   define bitset_alloc(bst, n) \
do {\
(bst)->content=malloc(( n /(sizeof(uint32_t)) + 1 ));\
if ((bst)->content == NULL) {\
fprintf(stderr, "memory allocation for bitset failed");\
exit(EXIT_FAILURE);\
}\
(bst)->size = n;\
} while (0)
#   define bitset_size(bst)  ((bst)->size)
#   define bitset_setall(bst) memset((bst)->content,~0lu,\
(bst->size /(sizeof(uint32_t) ) +1 ))
#   define bitset_clearall(bst) memset((bst)->content,0,\
(bst->size /(sizeof(uint32_t) ) +1 ))
#   define bitset_clear(bst,n) CLEAR_BIT((bst)->content, n)
#   define bitset_set(bst,n)     SET_BIT((bst)->content, n)
#   define bitset_get(bst,n)     GET_BIT((bst)->content, n)
#   define bitset_free(bst) \
do {\
free((bst)->content);\
free(bst);\
} while (0)

uint32_t bitset_nextset(bitset_t * bst, uint32_t n);
uint32_t bitset_prevset(bitset_t * bst, uint32_t n);
uint32_t bitset_nextset(bitset_t * bst, uint32_t n)
{
while ((n < bitset_size(bst)) && (!bitset_get(bst, n))) {
n++;
}
return n;
}

uint32_t bitset_prevset(bitset_t * bst, uint32_t n)
{
while (n > 1 && (!bitset_get(bst, n))) {
n--;
}
return n;
}

The actual sieve is just the basic Eratosthenes method.

#define LN_113 1.25505871293247979696870747618124469168920275806274
unsigned long *fill_prime_list(unsigned long start, unsigned long stop,
unsigned long *R)
{
bitset_t *bst;
uint32_t n, k, j,i;
unsigned long pix,*list;

pix = (unsigned long) (LN_113 *  stop/log(stop) );
list = malloc(pix * sizeof(unsigned long));
if(list == NULL){
return 0;
}
bst = malloc(sizeof(bitset_t));
bitset_alloc(bst, stop);
bitset_setall(bst);
bitset_clear(bst, 0);
bitset_clear(bst, 1);

n = bitset_size(bst);
for (k = 4; k < n; k += 2)
bitset_clear(bst, k);
k = 0;
i = 0;
while ((k = bitset_nextset(bst, k + 1)) < n) {
if(k >= start){
list[i++] = k;
}
for (j = k * k; j < n; j += k * 2) {
bitset_clear(bst, j);
}
}
bitset_free(bst);
list = realloc(list, sizeof(unsigned long)*(i + 1));
*R = i;
return list;
}

The only interessting point is the curious constant LN_113. This is actually the upper bound for the number of primes up to a limit if that limit is bigger than two. To explain it in detail would cause quite some deep diving into number theoretical abysses but I found a post at stackexchange where some A. Walker gives a rough overview.

The function returns a pointer to the list, so do not forget to free the memory after using it. The size of the array is put into R.
One little helper function is also needed: a method to find the index of the MSB.

long highbit(unsigned  long n){
long r=0;
while (n >>= 1) {
r++;
}
return r;
}

Also without any juggled bits, just a simple loop.
Forgot anything? No? Ok.

int factorial(unsigned long n, mp_int *result){
unsigned long *p_list;
unsigned long *exp_list;
unsigned long p, i,cut;
long bit;
int shift;
mp_int temp;

unsigned long  r=0;
p_list = fill_prime_list(3, n+1, &r);
exp_list = malloc(sizeof(unsigned long)* (r+1));
if(exp_list == NULL){ return MP_MEM;}

mp_set_int(result,1);
shift = (int)prime_divisors(n,2LU);
cut = n / 2 ;
for(p = 0; p < r; p++) {
if(p_list[p] > cut) break;
exp_list[p] = prime_divisors(n,p_list[p]);
}
bit = highbit(exp_list[0]);
for( ; bit>=0; bit--) {
mp_sqr(result, result);
for(i=0; i<p; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
mp_mul_d(result,(mp_digit)p_list[i], result);
}
}
}
for(; p < r; p++){
mp_mul_d(result,(mp_digit)p_list[p], result);
}
mp_mul_2d (result, shift, result);
return MP_OKAY;
}

The primes with a unit exponent have been treated with a binary splitting algorithm in my implementation in Calc but that did not work out very well with libtommath for the same reasons as for the failure of my try at the factorials.
This implementation needs about 13 minutes to evaluate the factorial of one million. The result has 5,565,703 decimal digits which is quite a large number. PARI/GP needs…uhm…more than 45 minutes. I stopped it there, but I have only the version in Debian Squeeze running. It is linked against GMP 4.3.2. which is more than three yeas old. The higher, higher than libtommath that is, optimized Calc needs about eight minutes for the factorial of one million.