On Binary Splitting

The method of binary splitting to evaluate sums of rational numbers is well known. A slightly more detailed overview (with some code examples) is given in [2] (preprint in [1]) for the sums of rational numbers.

The binary splitting algorithm uses the divide&conquer method to keep the single operations as small as possible for as long as possible. Another advantage of this method is that all operands are of roughly the same size which is favored by all of the fast multiplication algorithms like e.g.: FFT. Asymmetric Toom-Cook algorithms exist but only for a handful of different relations.

The basic idea for divide&conquer is to evaluate $a \odot b \odot c \odot d$ (with $a \odot b = b \odot a$) as $(a \odot b) \odot (c \odot d)$. That keeps the sizes equal if the sizes of $\{a,b,c,d\}$ are equal, too. In times of multi-core processors the ability to process as many of the operations in parallel as there are cores available is also not a thing one should deem as insignificant too easily.

But there is also something named CPU-cache, a still very finite but very fast memory where the CPU keeps its bricks and mortar it works with, so it might be a good idea too keep as many things in this memory as possible. Here divide&conquer is not the most ideal algorithm because it has to grab its material from different parts of the memory a lot of times. Although the access to memory can be assumed to be $\mathcal{O}(1)$ but the hidden constant can get very large for the different types of memory. So large, that it might be better to use an asymptotically worse algorithm if it is able to keep all data in the CPU-cache. In the case of multiplication it is the school book algorithm here for the first level if $a\odot b$. fits into a single CPU register, e.g. the product of two 32-bit values fits into one 64-bit CPU-register.

An example with libtommath (checks and balances omitted)

#define BIN_SPLIT_TUNABLE_CONSTANT 64
void mp_bin_split(mp_digit *array, int n, mp_int * result)
{
mp_digit first_half, second_half;
int i;
mp_int tmp;

if (n == 0) {
mp_set(result, 1);
return;
}
if (n <= BIN_SPLIT_TUNABLE_CONSTANT) {
mp_set(result, array[0]);
for (i = 1; i < n; i++)
mp_mul_d(result, array[i], result);
return;
}

first_half = n / 2;
second_half = n - first_half;
mp_bin_split(array, second_half, result);
mp_init(&tmp);
mp_bin_split(array + second_half, first_half, &tmp);
mp_mul(result, &tmp, result);
mp_clear(&tmp);
}


If BIN_SPLIT_TUNABLE_CONSTANT is set very small it could be a fruitful idea to do the splitting by hand (mp_word is defined to be large enough to hold the product of two mp_digit).

/* ---- snip --- */
if (n <= 8 ) {
mp_set(result, array[0]);
for (i = 1; i < n; i++)
mp_mul_d(result, array[i], result);
return;
}
if (n == 8 ) {
mp_word s1,s2,s3,s4;
mp_digit carry, a[8],b[8],
c1[8]={0,0,0,0,0,0,0,0},
c2[8]={0,0,0,0,0,0,0,0},*temp;
double C1[16],C2[16];
s1 = array[0] * array[1];
array[0] = (unsigned long) (s1  & ((mp_word) MP_MASK));
array[1] = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
s2 = array[2] * array[3];
array[2] = (unsigned long) (s2  & ((mp_word) MP_MASK));
array[3] = (unsigned long) (s2 >> ((mp_word) DIGIT_BIT));
s3 = array[4] * array[5];
array[4] = (unsigned long) (s3  & ((mp_word) MP_MASK));
array[5] = (unsigned long) (s3 >> ((mp_word) DIGIT_BIT));
s4 = array[6] * array[7];
array[6] = (unsigned long) (s4  & ((mp_word) MP_MASK));
array[7] = (unsigned long) (s4 >> ((mp_word) DIGIT_BIT));
if(array[1] == 0 && array[3] == 0 && array[5] == 0 && array[7] == 0){
s1 = s1 * s2;
a[0] = (unsigned long) (s1  & ((mp_word) MP_MASK));
a[1] = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
s2 = s3 * s4;
b[0] = (unsigned long) (s2  & ((mp_word) MP_MASK));
b[1] = (unsigned long) (s2 >> ((mp_word) DIGIT_BIT));
if(a[1] == 0 && b[1] == 0){
s1 = s1 * s2;
a[0] = (unsigned long) (s1  & ((mp_word) MP_MASK));
a[1] = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
if(a[1] == 0){
result->dp[0] = a[0];
result->used = 1;
return;
}
result->dp[0] = a[0];
result->dp[1] = a[1];
result->used = 2;
return;
}
k = (a[1] == 0)?1:2;
l = (b[1] == 0)?1:2;
temp = result->dp;
for(i=0;i<k;i++){
carry = 0;
temp = result->dp + i;
for(j=0;j<MIN(l,4-i);j++){
s1 =  ((mp_word)*temp) + (mp_word)a[i] * (mp_word)b[j] + (mp_word)carry;
*temp++ = (unsigned long) (s1  & ((mp_word) MP_MASK));
carry  = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
}
if(i+j<4){
*temp = carry;
}
}
result->used = 4;
mp_clamp(result);
return;
}
else{
a[0] = array[0];
a[1] = array[1];
b[0] = array[2];
b[1] = array[3];
k = (a[1] == 0)?1:2;
l = (b[1] == 0)?1:2;
temp = c1;
for(i=0;i<k;i++){
carry = 0;
temp = c1 + i;
for(j=0;j<MIN(l,4-i);j++){
s1 =  ((mp_word)*temp) + (mp_word)a[i] * (mp_word)b[j] + (mp_word)carry;
*temp++ = (unsigned long) (s1  & ((mp_word) MP_MASK));
carry  = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
}
if(i+j<4){
*temp = carry;
}
}
a[0] = array[4];
a[1] = array[5];
b[0] = array[6];
b[1] = array[7];

k = (a[1] == 0)?1:2;
l = (b[1] == 0)?1:2;
temp = c2;
for(i=0;i<k;i++){
carry = 0;
temp = c2 + i;
for(j=0;j<MIN(l,4-i);j++){
s1 =  ((mp_word)*temp) + (mp_word)a[i] * (mp_word)b[j] + (mp_word)carry;
*temp++ = (unsigned long) (s1  & ((mp_word) MP_MASK));
carry  = (unsigned long) (s1 >> ((mp_word) DIGIT_BIT));
}
if(i+j<4){
*temp = carry;
}
}
for(i = 0,j=0;i<8;i++,j+=2){
if(i < 8){
C1[j]   = (double) (c1[i] & MP_DIGIT_MASK);
C1[j+1] = (double)((c1[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
C2[j]   = (double) (c2[i] & MP_DIGIT_MASK);
C2[j+1] = (double)((c2[i] >> MP_DIGIT_BIT_HALF ) & MP_DIGIT_MASK);
}
if(i >= 8){
C1[j]   = 0.0;
C1[j+1] = 0.0;
C2[j]   = 0.0;
C2[j+1] = 0.0;
}
}

mp_fft(C1,C2,16);

carry = 0;
for(i=0;i<16;i++){
s1 = carry;
carry = 0;
s1  += (mp_word)(round(C2[i]));
if(s1 >= MP_DIGIT_HALF){
carry = s1 / (mp_word)MP_DIGIT_HALF;
s1  = s1 % (mp_word)MP_DIGIT_HALF;
}
C2[i] = (double)s1;
}
mp_grow(result,17);
for(i=0,j=0;j<16;i++,j+=2){
result->dp[i] <<= MP_DIGIT_BIT_HALF;
result->used++;
}
if(carry){
result->dp[i] = carry;
result->used++;
}

mp_clamp(result);

return;
}
}
/* ---- snap --- */


And if you think you have seen the worst waste of blog-space you’ve never met the kind of programmers who’s products are described in detail at thedailywtf.com.

[1] Haible, Bruno, and Thomas Papanikolaou. “Fast multiprecision evaluation of series of rational numbers.” (1997). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.3698&rep=rep1&type=pdf
[2] Haible, Bruno, and Thomas Papanikolaou. “Fast multiprecision evaluation of series of rational numbers”. Springer Berlin Heidelberg, 1998.

On the numerical evaluation of factorials XI

Superfactorial

The superfactorial is, according to Sloane and Plouffe

${\displaystyle{ \mathop{\mathrm{sf}}\left(n\right)= \prod_{k=1}^n k! }}$

Another definition is according to Clifford A. Pickover’s “Keys to Infinity” (Wiley,1995, ISBN-10: 0471118575, ISBN-13: 978-0471118572, saw it at Amazon offered by different sellers between $2,57 for a used one and$6,89 for new ones. I just ordered one and let you know if it is worth the money). The notation with the arrows is Donald Knuth’s up-arrow notation.

${\displaystyle{ n\!\mathop{\mathrm{\}} = (n!)\uparrow\uparrow(n!) }}$

This post is about the first variation. The superfactorial is also equivalent to the following product of powers.

${\displaystyle{ n\!\mathop{\mathrm{\}}=\prod_{k=1}^n k^{n-k+1}}}$

It is possible to multiply such a product in a fast way with the help of nested squaring, as Borwein and Schönhage have shown. It was also the basis for the fast algorithm for factorials described in this series of postings.

The original product is over the natural numbers up to and including $n$ and it can be made faster by restricting it to the primes and works, roughly, by making the exponents bigger which gives a change for more squarings and lowering the number of intermediate linear multiplications.

The question is: What is faster, factorizing every single power of the product or factorizing every single factorial?

Factorizing every power of the product reduces to factoring of every base which reduces to all bases without the primes. The time complexity of Eratosthenes’ sieve is $\mathop{\mathrm{O}}(n \log\log n)$ which can be told, without much loss of accuracy, linear. With this sieve (there are faster ones, but we need the full sequence, which makes it difficult to implement the following algorithm with other sieves) we get in one wash the prime factors for every number. The prime factors, not the exponents of these prime factors (i.e.: the single information we get is if the exponent is bigger than zero).

An example with factorizing $10$ which should make it more clear.

${\displaystyle{ \begin{matrix} 7 & & & & & & x & & & \\ 5 & & & & x & & & & & x \\ 3 & & x & & & x & & & x & \\ 2 & x & & x & & x & & x & & x \\ / & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \end{matrix} }}$

Once we have this result we are bound to trial division to find the exponents[1]. That is costly but how much does it cost at the end?
We have to do at most $\lfloor\log_2 n \rfloor$ trial divisions (biggest exponent possible is reserved for 2) with at most $\mathop{\pi}\left(\lfloor\sqrt{n} \rfloor\right)$ primes (e.g.: $2\cdot 3\cdot 5 = 30$ but $5^2 = 25$ ). The same has to be done for the exponents[1] of the powers from the superfactorial without the prime bases of which are all in all $n-\pi(n)$.

Assuming Riemann…no, we don’t need to do that 😉

Because all of the trial divisions are in fact equivalent to taking the nth-root which can be done in $\mathop{\mathrm{O}}(\mathop{\mathrm{M}}(m)\log m)$ it is quite fast. In theory.

Calculating the individual factorials on the other side would need $\mathop{\mathrm{O}}(\mathop{\mathrm{M}}(m\log m)$ time but that includes the final multiplication, too.
Now, how much does this cost?
The number of digits of $\mathop{\mathrm{sf}}(n)$ is $\sum_{k=1}^m k\log k$ which is the logarithm of the hyperfactorial $\mathop\mathrm{{hf}}(m) = \prod_{k=1}^m k^k$ that can be approximated with $\log \left( \mathop{\mathrm{hf}}\left( m\right)\right) \sim 0.5 m^2 \log m$ which we can put in our time complexity and solemnly conclude: this is the largest part of the work and the rest pales in contrast.

It also means that we can just choose the simplest way to do it and that’s how we do it. We have implemented some methods before to do some simple arithmetic with lists of prime factors, to find the prime factors of a factorial, and to compute the final result.
Let’s put it al together:

int superfactorial(unsigned long n, mp_int * c)
{
long *t, *s;
unsigned long i, length_t, length_s;
int e;
if (n < 2) {
mp_set(c, 1);
return MP_OKAY;
}
if (n == 2) {
mp_set(c, 2);
return MP_OKAY;
}
if ((e = mp_factor_factorial(2, 0, &s, &length_s)) != MP_OKAY) {
return e;
}
for (i = 3; i < n; i++) {
if ((e = mp_factor_factorial(i, 1, &t, &length_t)) != MP_OKAY) {
return e;
}
if ((e =
&length_s)) != MP_OKAY) {
return e;
}
free(t);
t = NULL;
length_t = 0;
}
if ((e =
mp_compute_signed_factored_factorials(s, length_s, c,
NULL)) != MP_OKAY) {
return e;
}
return MP_OKAY;
}


The relevant functions are in my fork of libtommath.

A first test against the naïve implementation, admittedly quite unfair, gave a difference in time of 1:300 for $\mathop{\mathrm{sf}}(1\,000)$. But more elborated algorithms do not give much back for such small numbers, maybe binary splitting could gain something by making the numbers to multiply more balanced in size.
With $\mathop{\mathrm{sf}}(1\,000) \sim 3.2457\mathrm{\mathbf{e}}1\,177\,245$ the cutoff to FFT multiplication gets reached.

[1] The trial divisions to find the exponents is actually integer root finding, as you might have known already, which can be done faster than just trial division, much faster. With $\mathop{\mathrm{M}}(m)$ the time to multiply two m-digit long integers the time complexity for the nnth-root is $\mathop{\mathrm{O}}(\mathop{\mathrm{M}}(m)\log m)$ with the AGM.

On the numerical evaluation of factorials X

The last post in this sequence was about some more general way to work with the prime factorization of the factorials and about the Narayana numbers:

${\displaystyle{ N(n,k) = \frac{1}{n}{ \binom nk}{\binom n{k-1}} }}$

Which expands to

${\displaystyle{ {{n!^2}\over{\left(k-1\right)!\,k!\,n\,\left(n-k\right)!\,\left(n-k +1\right)!}} }}$

Using $(n+1)! = n(n!)$ and vice versa $(n-1)! = \frac{n!}{n}$ we can shove more things together and get

${\displaystyle{ \frac{k n!^2}{k!^2 n \left(n-k+1\right) \left(n-k\right)!^2} }}$

Get the non-factorials out of the way

${\displaystyle{ \frac{k}{n \left(n-k+1\right)}\; \frac{n!^2}{k!^2 \left(n-k\right)!^2} }}$

With $P_n$ the prime factorization of $n!$ and with their arithmetic resembling the relation of the arithmetic of numbers and the arithmetic of their logarithms (e.g.:$x\cdot y = \exp\left(\log\left(x\right)+\log\left(y\right) \right)$ ) we get (dropping the fraction with the non-factorials)

{\displaystyle{ \begin{aligned} \phantom{\Leftrightarrow\;} &2P_n - \left( 2P_k + 2P_{n-k} \right) \\ \Leftrightarrow\; &2P_n - 2P_k - 2P_{n-k} \\ \Leftrightarrow\; &2\left(P_n - P_k - P_{n-k} \right) \end{aligned} }}

We have to be carefull with the term $n \left(n-k+1\right)$ for it can overflow, so we should divide the prime factorization by and by instead of multiplying $n \left(n-k+1\right)$ out and divide by that all together.

The actual implementation is left as an exercise for the reader.
Or just wait until I do it myself in the next days 😉

On the numerical evaluation of factorials IX

Number nine, who would have thought?!

As it is well known and laid down elsewhere, even by myself in a different post, it is quite easy to multiply and divide integers when their prime factorization is known. It is too complicated for almost all integers but it can be done very fast for factorials.
We have implemented algorithms to compute the binomial coefficient and the Catalan numbers but we did all of it inline and not as a generally applicable function. This generally applicable function needs to be able to handle negative exponents for example and we are going to implement one now.
Why now? Because of the Rule “One, two, many” and we used it three times already. The usage in the factorial function is excused for reasons of speed, so we are back down to two if we ignore the double-factorial. The next function I want to implement is the so called Narayana number:

${\displaystyle{ N(n,k) = \frac{1}{n}{ \binom nk}{\binom n{k-1}} }}$

Which can be written with factorials as

${\displaystyle{ \frac{1}{n} \; {{n!}\over{k!\,\left(n-k\right)!}} \;{{n!}\over{\left(k-1\right)!\,\left(n-k+1\right)!}} }}$

Shoving all together

${\displaystyle{ {{n!^2}\over{\left(k-1\right)!\,k!\,n\,\left(n-k\right)!\,\left(n-k +1\right)!}} }}$

It is most probably faster to use the first formula with the binomials but it is a good example with many factorials to multiply, divide and even use exponentiation (with an integer).

The first function is the one we need to decompose the factorial. It is not much different from the one in the factorial or binomial function but I took the chance to clean it up a little bit.

#include <tommath.h>
#ifndef LN_113
#define LN_113 1.25505871293247979696870747618124469168920275806274
#endif
#include <math.h>
int mp_factor_factorial(unsigned long n, unsigned long start,
long **prime_list,unsigned long *length){

unsigned long pix=0,prime,K;
mp_bitset_t *bst;

/* To be able to handle negative values in subtraction and to be sure that
n is positive*/
if(n > LONG_MAX){
return MP_VAL;
}

bst = malloc(sizeof(mp_bitset_t));
if (bst == NULL) {
return MP_MEM;
}
mp_bitset_alloc(bst, n+1);
mp_eratosthenes(bst);
pix = (unsigned long) (LN_113 *  n/log(n) );

*prime_list = malloc(sizeof(long) * (pix) * 2);
if (prime_list == NULL) {
return MP_MEM;
}
if(start <= 2){
prime = 2;
}
else{
prime = start;
}
K=0;
do {
(*prime_list)[K]   = (long)prime;
(*prime_list)[K+1] = (long)mp_prime_divisors(n,prime);
prime              = mp_bitset_nextset(bst,prime+1);
K+=2;
}while(prime <= n);
*prime_list = realloc(*prime_list,sizeof(long) * K);
if (prime_list == NULL) {
return MP_MEM;
}
free(bst);
*length = K;
return MP_OKAY;
}


The library we use is libtommath with my extensions available at my Github account and spread over this blog.
The function itself has n’t changed: make list of primes, walk over the primes and calculate for every prime the factorization. The result is put in a list in order and with the exponent following the prime, starting with the prime 2.
The mess of different types is due to a desperate try to keep all of the stuff compatible.

What next? Subtraction, OK.

These lists of primes and their exponents are nothing but simple polynomials and subtraction can be done likewise. One little exception: the polynomials are multi-variabled. That means that we need all primes in a consecutive row and we need as much as the bigger one of subtrahend and minuend respectively have.
It is a good idea to save some memory and skip any primes that have an exponent of zero as the result of the subtraction but only as long as the result doesn’t get squeezed another time through the function. The rule of consecutiveness is broken then, so we have to take care to check for such blank spots and fill them in if necessary.

The version without this checks first:

int mp_subtract_factored_factorials( long *subtrahend,
unsigned long l_subtrahend,
long *minuend,
unsigned long l_minuend,
long **difference,
unsigned long /* Vive */ *l_difference /*!*/
){
unsigned long k, counter=0;
unsigned long max_length = MAX(l_subtrahend,l_minuend );
unsigned long min_length = MIN(l_subtrahend,l_minuend );
long d,p,d_1,d_2;
/* TODO: check for sizes > 0 here */

/* Things work a bit different from ordinary arithmetic from now on */
*difference = malloc(sizeof(unsigned long)*(max_length+1));
if(*difference == NULL){
return MP_MEM;
}

/* Loop over smaller chunk first */
for(k=0;k<min_length;k+=2){
/* both have same length, we can take the value of the prime from any */
p = subtrahend[k];
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1];
d_2 = minuend[k+1];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
/* We only keep results that are different from zero */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
/* We need to find out which one is the smaller array and we have basically
two ways to approach the problem:
a) complicated and error-prone pointer juggling
b) just two loops
Here I have chosen b.
*/

/* If both arrays have the same length we can just stop here */
if(max_length == min_length){
/* We mad nothing dirty, so there's nothing to clean up here, let's just
grab our stuff and run */
return MP_OKAY;
}

/* If the subtrahend is bigger we subtract zeros, so simply copy */
if(l_subtrahend >= l_minuend ){
for(k=min_length;k<max_length;k+=2){
p = subtrahend[k];
d = subtrahend[k+1];
/* We still check, one never knows where things came from */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
}
/* If the minuend is bigger we subtract from zeros, so negate before copy*/
else{
for(k=min_length;k<max_length;k+=2){
p = minuend[k];
/* Yes, even negation can overflow */
d_1 = minuend[k+1];
if(d_1 == LONG_MIN){
return MP_VAL;
}
d = -d_1;
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
}
/* Take care of the little difference the length might have */
*l_difference = counter;
return MP_OKAY;
}


Yes, I even checked for under/overflow this time 😉
To implement the test for consecutiveness we have to check every step. Let’s pick out the first loop and put a comment where it has to change

  for(k=0;k<min_length;k+=2){
/* Both have same length, we can take the value of the prime from any?
No, that is not true, we have to check and insert if none is there.
We don't have to insert every prime, just enough that both have the
same amount
*/
p_1 = subtrahend[k];
p_2 = minuend[k];
if(p_1 != p_2){
/* where is the gap? */
if(p_1 < p_2){
/* the gap is in the minuend, fill up subtrahend  */
}
if(p_1 > p_2){
/* the gap is in the subtrahend, fill up minuend */
}
}
/* handle difference in indices */
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1];
d_2 = minuend[k+1];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
/* We only keep results that are different from zero */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}


As you might have guessed already, it is not very straight forward. We cannot simply fill-up the minuend because we have to subtract the exponents, so we have to fill-up with negated exponents. To keep up with the difference in the indices we have to know which one, subtrahend or minuend differs and how much. Which results in this (untested! I tend to be one off in these situations the first time, so please check with a debugger before you try it) loop:

  unsigned long gap_in_subtrahend = 0;
unsigned long gap_in_minuend    = 0;
for(k=0;k<min_length;k+=2){
p_1 = subtrahend[k+gap_in_minuend];
p_2 = minuend[k+gap_in_subtrahend];
if(p_1 != p_2){

if(p_1 < p_2){
for(i=0;p_1 < p_2;i+=2){
p_1 = subtrahend[k+i+gap_in_minuend];
(*difference)[counter]   = 1;
(*difference)[counter+1] = 0;
}
gap_in_minuend += i;
}
else{
for(i=0;p_1 > p_2;i+=2){
p_2 = minuend[k+i+gap_in_subtrahend];
d_2 = minuend[k+i+1+gap_in_subtrahend];
if(d_1 == LONG_MIN){
return MP_VAL;
}
d = -d_1;
(*difference)[counter]   = p_2;
(*difference)[counter+1] = d;
}
gap_in_subtrahend += 0;
}
}
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1+gap_in_minuend];
d_2 = minuend[k+1+gap_in_subtrahend];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
/* We only keep results that are different from zero */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}


There is still amiss a check for the bounds of the participating arrays. And all we would gain is some saved memory. And how often does it happen? It does happen only if you manipulate the arrays by hand e.g.: divide by a small integer which means subtracting the prime factorization of a small integer or after a couple of divisions with factorials.
Two small examples:

${\displaystyle{ \frac{\left(\frac{20!}{10!}\right)}{10!} = 2^2 \cdot 3^0 \cdot 5^0 \cdot 7^0 \cdot 11^1 \cdot 13^1 \cdot 19^1 = 184\,756 }}$

And

${\displaystyle{ \frac{10!}{25} = 2^8 \cdot 3^4 \cdot 5^0 \cdot 7^1 = 145\,152 }}$

So it does happen and doesn’t happen rarely, we cannot ignore it.
Another one of the basic rules for programming is:”Make it work first, optimize later!” where “works unusable slow” is to be considered as non-working, too. (One example might be the too optimistic starting value for the nth-root function in libtommath which makes it unusable slow).
Skip the array compression? At what cost? Two examples

${\displaystyle{ \frac{20!}{19!} = 2^2 \cdot 3^0 \cdot 5^1 \cdot 7^0 \cdot 11^0 \cdot 13^0 \cdot 19^0 = 20 }}$

${\displaystyle{ \frac{20!}{18!} = 2^2 \cdot 3^0 \cdot 5^1 \cdot 7^0 \cdot 11^0 \cdot 13^0 \cdot 19^1 = 380 }}$

If we divide two factorials $\frac{m!}{n!}$ the gaps start to build if $n>\frac{m}{2}$ (proof omitted) and if we divide by a small integer $\frac{m!}{k}$ with $k \frac{}{m/2}$, it just needs a place to keep the original, un-evaluated factorials. Maybe at the end of the array with the prime-part set to one which does not change the behavior of the final multiplication ($1^x = 1$). It should be treated, because it would loop several times for nothing (max. 32 or 64 times depending on the bit size of the unsigned long  type but it does loop over the whole array every time).
So it is the check for $n>\frac{m}{2}$ we can fall back to if the result without it is too slow (unlikely, the heaviest burden computationally is the multiplication and squaring in the final multiplication, but cache misses can hurt. It is to be tested) or uses too much memory (more likely, because we don’t do this for small factorials but for large ones, too large to be handled otherwise, something in the range of $n > 10^6$ for $n!$).
So (I am such a so-so man 😉 ) without the memory optimizations fro know. That makes the code look a bit different. Not much but, well, a bit. We just strip the checks if the difference is zero and leave the rest untouched.

int mp_subtract_factored_factorials( long *subtrahend,
unsigned long l_subtrahend,
long *minuend,
unsigned long l_minuend,
long **difference,
unsigned long /* Vive */ *l_difference /*!*/
){
unsigned long k, counter=0;
unsigned long max_length = MAX(l_subtrahend,l_minuend );
unsigned long min_length = MIN(l_subtrahend,l_minuend );
long d,p,d_1,d_2;
/* TODO: check for sizes > 0 here */

/* Things work a bit different from ordinary arithmetic from now on */
*difference = malloc(sizeof(unsigned long)*(max_length+1));
if(*difference == NULL){
return MP_MEM;
}

/* Loop over smaller chunk first */
for(k=0;k<min_length;k+=2){
/* both have same length, we can take the value of the prime from any */
p = subtrahend[k];
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1];
d_2 = minuend[k+1];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
/* We need to find out which one is the smaller array and we have basically
two ways to approach the problem:
a) complicated and error-prone pointer juggling
b) just two loops
Here I have chosen b.
*/

/* If both arrays have the same length we can just stop here */
if(max_length == min_length){
/* We made nothing dirty, so there's nothing to clean up here, let's just
grab our stuff and run */
return MP_OKAY;
}

/* If the subtrahend is bigger we subtract zeros, so simply copy */
if(l_subtrahend >= l_minuend ){
for(k=min_length;k<max_length;k+=2){
p = subtrahend[k];
d = subtrahend[k+1];
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
/* If the minuend is bigger we subtract from zeros, so negate before copy*/
else{
for(k=min_length;k<max_length;k+=2){
p = minuend[k];
/* Yes, even negation can overflow */
d_1 = minuend[k+1];
if(d_1 == LONG_MIN){
return MP_VAL;
}
d = -d_1;
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
/* this is now the same as max_length */
*l_difference = counter;
return MP_OKAY;
}


Addition follows the example of subtraction

int mp_add_factored_factorials(  long *summand_1,
unsigned long l_summand_1,
long *summand_2,
unsigned long l_summand_2,
long **sum,
unsigned long *l_sum){
unsigned long k, counter=0;
unsigned long max_length = MAX(l_summand_1,l_summand_2);
unsigned long min_length = MIN(l_summand_1,l_summand_2);
long s,p, s_1,s_2;
/* For more detailed comments see mp_subtract_factored_factorials() */
*sum = malloc(sizeof(unsigned long)*(max_length+1));
if(*sum == NULL){
return MP_MEM;
}
for(k=0;k<min_length;k+=2){
p = summand_1[k];
/* Over/underflow possible! */
/*s = summand_1[k+1] + summand_2[k+1];*/
s_1 = summand_1[k+1];
s_2 = summand_2[k+1];

if((s_2 > 0) && (s_1 > LONG_MAX - s_2)){
/* overflow */
return MP_VAL;
}
if((s_2 < 0) && (s_1 < LONG_MIN - s_2)){
/* underflow */
return MP_VAL;
}
s = s_1 + s_2;
(*sum)[counter]   = p;
(*sum)[counter+1] = s;
counter += 2;
}
if(l_summand_1 >= l_summand_2){
for(k=0;k<max_length;k+=2){
p = summand_1[k];
s = summand_1[k+1];
(*sum)[counter]   = p;
(*sum)[counter+1] = s;
counter += 2;
}
}
else{
for(k=0;k<max_length;k+=2){
p = summand_2[k];
s = summand_2[k+1];
(*sum)[counter]   = p;
(*sum)[counter+1] = s;
counter += 2;
}
}
*l_sum = counter;
return MP_OKAY;
}


Multiplication and division of the polynomials make not much sense, because of their multi-variability but multiplication with a small integer makes. Example

{\displaystyle{ \begin{aligned}20! &= 2^{18} \cdot 3^8 \cdot 5^4 \cdot 7^2 \cdot 11^1 \cdot 13^1 \cdot 19^1 \\ (20!)^2 &= 2^{36} \cdot 3^{16} \cdot 5^8 \cdot 7^4 \cdot 11^2 \cdot 13^2 \cdot 19^3 \end{aligned} }}

We need to check for over/underflow at multiplications here where we have to alternatives: use a data type that has twice the size of the data types of the multiplicands or do it the hard way.
The hard, but safe and portable way is to check every possible combination and test by trial division.

if (a > 0) {
if (b > 0) {
if (a > (SIGNED_TYPE_MAX / b)) {
return ERROR;
}
}
else {
if (b < (SIGNED_TYPE_MIN / a)) {
return ERROR;
}
}
}
else {
if (b > 0) {
if (a < (SIGNED_TYPE_MIN / b)) {
return ERROR;
}
}
else {
if ( (a != 0) && (b < (SIGNED_TYPE_MAX / a))) {
return ERROR;
}
}
}


The other alternative is to use long long and hope for the best or use the explicit data types in stdint.h throughout.
We use libtommath here where a data type mp_word is defined to be able to hold the multiplication of two mp_digit numbers but mp_digit can be as small as a char.

int mp_power_factored_factorials(  long *input,
unsigned long l_input,
long multiplicator,
long **product,
unsigned long *l_product){

unsigned long k, counter=0;

long prod,p, p_1,p_2;
#ifdef MP_USE_LONG_LONG_AND_HOPE_FOR_THE_BEST
long long temp;
#endif
*product = malloc(sizeof(unsigned long)*(l_input+1));
if(*product == NULL){
return MP_MEM;
}
p_2 = multiplicator;
for(k=0;k<l_input;k+=2){
p = input[k];
/* Over/underflow possible! */
/*prod = input[k+1] * multiplicator;*/
p_1 = input[k+1];
/* Two alternatives: use "long long" and hope for the best or do it the
hard way */
#ifdef MP_USE_LONG_LONG_AND_HOPE_FOR_THE_BEST
temp = p * (long long)p_1;
if ((temp > LONG_MAX) || (tmp < LONG_MIN)) {
return MP_VAL;
}
prod = (long) temp;
#else
if (p > 0) {
if (p_1 > 0) {
if (p > (LONG_MAX / p_1)) {
return MP_VAL;
}
}
else {
if (p_1 < (LONG_MIN / p)) {
return MP_VAL;
}
}
}
else {
if (p_1 > 0) {
if (p < (LONG_MIN / p_1)) {
return MP_VAL;
}
}
else {
if ( (p != 0) && (p_1 < (LONG_MAX / p))) {
return MP_VAL;
}
}
}
prod = p * p_1;
#endif
(*product)[counter]   = p;
(*product)[counter+1] = prod;
counter += 2;
}

*l_product = counter;
return MP_OKAY;
}


We can produce the reciprocal of a number if we negate all exponents of its prime factorization.

int mp_negate_factored_factorials(  long *input,
unsigned long l_input,
long **output,
unsigned long *l_output){
unsigned long k, counter=0;
long neg;

*output = malloc(sizeof(unsigned long)*(l_input+1));
if(*output == NULL){
return MP_MEM;
}
for(k=0;k<l_input;k+=2){
neg = input[k+1];
if(neg == LONG_MIN){
return MP_VAL;
}
(*output)[counter]   = input[k];
(*output)[counter+1] = -neg;
counter += 2;
}
*l_output = counter;
return MP_OKAY;
}


One can do the above inline, of course and save a copy but doing things inline is always a rich source of unexpected side-effects and an abundantly gushing spring of feet with gunshot injuries, so pleeease be careful.

The final step, computing the polynomials, is different from the function mp_compute_factored_factorials() in that we have to be able to handle negative exponents, too.
One simple thing to handle negative exponents is to keep them in a second array with absolute values of the exponents and divide at the end. Another alternative might be to return both arrays and let the user decide. No matter which way, we have to check and if there are at least one negative exponent decide what to do.

It is also the place where we have to decide what to do with the fact that the useful function mp_mul_d() accepts only mp_digit data types as the small integers. If we approximate the maximum exponent of a factorial as the exponent of two with a magnitude of roughly $n$ for $n!$ with the next exponents roughly half of the magnitude of their predecessor than the biggest possible exponent for the smallest possible mp_digit data type is 255 which makes the largest processable factorial $256!$ if I calculated it correctly (the biggest prime smaller than 255 is 251). That is not very much.

From here on I assume DIGIT_BIT == 28 and check for it. We will handle the exponent of two separately which will allow us to compute $n!$ for $n<2^{29} = 536\,870\,912$. The factorial of half a billion is about $1.6631e4\,453\,653\,131$, a number with nearly four and a half billion digits. Close to the age of the Earth in years (4.54 ± 0.05 billion according to Wikipedia), not so close to the age in minutes (3.15 billion, according to one James Ushher).
What, me and digressing? Ow, c'mon!

static long local_highbit(unsigned  long n){
long r=0;
while (n >>= 1) {
r++;
}
return r;
}
/*
To compute the resulting polynomials we need an approach different from
mp_compute_factored_factorials() because we have to be able to handle negative
exponents, too.
*/
int mp_multiply_factored_factorials(long *f, unsigned long f_length,
mp_int *c, mp_int *r){
unsigned long start=0,i,count=0,neg_count = 0,k=0;
long shift = 0;
long bit=0,neg_bit=0,hb=0,neg_hb=0,*p=NULL;
mp_int temp;
int e;

if(f[0] == 2){
shift = f[1];
if(f_length == 2){
if(shift > 0){
if( (e = mp_set_int(c,1LU<<(unsigned long)f[1]) ) != MP_OKAY){
return e;
}
}
return MP_OKAY;
}
}
if(shift){
start+=2;
k=2;
}
/*
This loop is expensive and should be done inline instead but for now...
*/
for(k=start;k<f_length;k+=2){
if(f[k+1] < 0){
/* count how many to save some memory */
count++;
}
/* Only primes up to 2^DIGIT_BIT because of mp_mul_d() */
if(f[k] >= 1<<DIGIT_BIT){
return MP_VAL;
}
}
/* all are negative */
//if(count && count == f_length/2){
/* if the alternative with two outputs has been chosen just skip the
computation of the polynomial with all positive exponents and set
*/
/* goto: negative_exponents; */

/* The other alternative would compute 1/x that gives always 0 in integer
divisions if x>1. But it would need exactly one exponent with zero which
has been filtered out already. It is debatable now if that was a good
decision.*/
//return MP_OKAY;
//}

if(count){
p = malloc( sizeof(long)* (count * 2) +1);
if(p == NULL){
return MP_MEM;
}
}

for(k=start;k<f_length;k+=2){
/* won't branch if count == 0 */
if(f[k+1] < 0){
p[neg_count]   = f[k];
p[neg_count+1] = abs(f[k+1]);
neg_count +=2 ;
/* And while we are already here ... */
neg_hb = local_highbit(abs(f[k+1]));
if(neg_bit < neg_hb)neg_bit = neg_hb;
/* This allows for some optimization, mainly w.r.t memory. Not done yet */
f[k]   = 1;
/* Set to zero to let the main loop skip it */
f[k+1] = 0;
}
else{
hb = local_highbit(f[k+1]);
if(bit < hb)bit = hb;
}
}

/* DIGIT_BIT can be as small as 8 */
if(bit >(long) DIGIT_BIT || neg_bit >(long) DIGIT_BIT){
return MP_VAL;
}
/* Anything times zero is zero, so set the result to one in advance */
if( (e = mp_set_int(c,1) ) != MP_OKAY){
return e;
}

/* Everything is in place, so let's start by computing with the positive
exponents first */

/* The other function mp_compute_factored_factorials() has a branch
but the cases where it makes sense are quite rare. Feel free to add it
yourself */
for( ; bit>=0; bit--) {
if( (e = mp_sqr(c, c) ) != MP_OKAY){
return e;
}
for(i=start; i<f_length; i+=2) {
if ((f[i+1] & (1LU << (unsigned long)bit)) != 0) {
/* This is problematic if DIGIT_BIT is too small. checked above */
if( (e = mp_mul_d(c,(mp_digit)f[i], c) ) != MP_OKAY){
return e;
}
}
}
}

/* Compute c * 2^n if n!=0 */
if(shift && shift > 0){
if(shift < 1<<DIGIT_BIT){
/* The choice of types is a bit questionable. Sometimes. */
if( (e = mp_mul_2d (c, (mp_digit) shift, c) ) != MP_OKAY){
return e;
}
}
else{
long multiplicator = 0;
another_round:
while(shift > 1<<DIGIT_BIT){
shift >>= 1;
multiplicator++;
}
if( (e = mp_mul_2d (c, (mp_digit) shift, c) ) != MP_OKAY){
return e;
}
if(multiplicator< DIGIT_BIT){
if( (e = mp_mul_2d (c, (mp_digit)(1<<multiplicator), c) ) != MP_OKAY){
return e;
}
}
else{
shift = 1<<multiplicator;
multiplicator = 0;
goto another_round;
}
}
}

/* None are negative*/
if(count == 0){
/* Clean up */
/* Nothing to clean up */
return MP_OKAY;
}

/* Compute with the negative eponents */
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_set_int(&temp,1) ) != MP_OKAY){
return e;
}
for( ; neg_bit>=0; neg_bit--) {
if( (e = mp_sqr(&temp, &temp) ) != MP_OKAY){
return e;
}
/* The exponents of 2 have been stripped already so we start at zero.
"count" counts only the occurrences, the size itself is twice as large.
*/
for(i=0; i<count*2; i+=2) {
if ((p[i+1] & (1LU << (unsigned long)neg_bit)) != 0) {
if( (e = mp_mul_d(&temp,(mp_digit)p[i], &temp) ) != MP_OKAY){
return e;
}
}
}
}

/* Compute c * 2^n if n!=0  for negative n*/
if(shift && shift < 0){
/* check for overflow */
if(shift == LONG_MIN){
return MP_VAL;
}
shift = -shift;
if(shift < 1<<DIGIT_BIT){
/* The choice of types is a bit questionable. Sometimes. */
if( (e = mp_mul_2d (&temp, (mp_digit) shift, &temp) ) != MP_OKAY){
return e;
}
}
else{
long multiplicator = 0;
another_round:
while(shift > 1<<DIGIT_BIT){
shift >>= 1;
multiplicator++;
}
if( (e = mp_mul_2d (&temp, (mp_digit) shift, &temp) ) != MP_OKAY){
return e;
}
if(multiplicator< DIGIT_BIT){
if( (e = mp_mul_2d (&temp, (mp_digit)(1<<multiplicator), &temp) ) != MP_OKAY){
return e;
}
}
else{
shift = 1<<multiplicator;
multiplicator = 0:
goto another_round;
}
}
}
#ifdef BN_MP_DO_LAST_DIVISION
if( (e = mp_div(c,&temp,c,r) ) != MP_OKAY){
return e;
}
#else
if(r != NULL){
if( (e = mp_copy(&temp,r) ) != MP_OKAY){
return e;
}
}
#endif
free(p);
mp_clear(&temp);
return MP_OKAY;
}


It compiles and works for small inputs but it cannot be called well tested. Use with car and drop me a note if you found a bug.

How to compute Narayana numbers? As I said in the beginning: computing them with the formula

${\displaystyle{ N(n,k) = \frac{1}{n}{ \binom nk}{\binom n{k-1}} }}$

would most probably be the fastest way and if not, it is the easiest to implement at least.

int mp_narayana(unsigned long n, unsigned long k, mp_int *c){
mp_int temp;
int e;
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_binomial(n,k,c) ) != MP_OKAY){
return e;
}
if( (e = mp_binomial(n,k-1,&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_mul(c,&temp,c) ) != MP_OKAY){
return e;
}
if( (e = mp_set_int(&temp,n) ) != MP_OKAY){
return e;
}
if( (e = mp_div(c,&temp,c,NULL) ) != MP_OKAY){
return e;
}
mp_clear(&temp);
return MP_OKAY;
}


The above code will be merged as mp_narayana() into my fork of libtommath in the next days. The rest only after some thorough tests.

Adventures of a Programmer: Parser Writing Peril IX

Now, that we hae a function to compute the binom…(what? The grammar? Well)…ial coefficient we can use it to compute the Catalan numbers.

${\displaystyle{ \mathrm{C}_n = \frac{\binom {2n}{n}}{n+1} }}$

The term $2n$ in the binomial coefficient needs some care and even the $n+1$ causes problems if it is larger than 1<<DIGIT_BIT because the function mp_div_d() only accepts mp_digit as the divisor. These thoughts result in the following short function:

int mp_catalan(unsigned long n, mp_int *c){
int e;
if(n == 0 || n == 1){
if( (e = mp_set_int(c,1LU) ) != MP_OKAY){
return e;
}
return MP_OKAY;
}
/*  C(2^31) is already a very large number
C(2^31) ~ 1.7593e1,292,913,972
The author's box would need approximately 1 year and seven month
computing time.
The number of digits follow loosely the size of n here, e.g.
C(2^32)  ~ 1.9303e2,585,827,958
C(2^64)  ~ 2.5891e11,106,046,577,046,714,235
C(2^128) ~ 1.2728e204,870,398,877,478,727,500,024,218,500,206,465,342
*/
if(n >= ULONG_MAX/2){
return MP_VAL;
}
if( (e = mp_binomial(2*n,n,c) ) != MP_OKAY){
return e;
}
if((n+1) >= 1<<DIGIT_BIT){
mp_int temp;
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_set_int(&temp,n+1) ) != MP_OKAY){
return e;
}
if( (e = mp_div(c,&temp,c,NULL) ) != MP_OKAY){
return e;
}
mp_clear(&temp);
}
else{
if( (e = mp_div_d(c,n+1,c,NULL) ) != MP_OKAY){
return e;
}
}
return MP_OKAY;
}


It has been merged into my fork of libtommath.

Further optimizations possible? Yes.
The binomial coefficient is the central binomial coefficient with the interesting equality:

${\displaystyle{ \binom {2n}{n} = \frac{\left(2n\right)!}{\left(n!\right)^2} }}$

which means that we can skip one part of the calculation of the final prime factorization.

1. ${\displaystyle{ f_{2n} - \left( f_n + f_n\right) } }$
2. ${\displaystyle{ f_{2n} } }$

The following snippet might be of use for a better understanding.

  prime = 2;K=0;
do{
diff = mp_prime_divisors(2*n,prime)-(2*mp_prime_divisors(n,prime));
if(diff != 0){
prime_list[K] = prime;
prime_list[K+1] = diff;
K+=2;
}
prime          = mp_bitset_nextset(bst,prime+1);
}while(prime <= n);
do {
prime_list[K] = prime;
prime_list[K+1] = mp_prime_divisors(n,prime);
prime          = mp_bitset_nextset(bst,prime+1);
K+=2;
}while(prime <= 2*n);


The time spend in calculating the prime factorization is negligible in contrast to the final computation but all of the exponents for the primes larger than $n$ are one for $2n$, so it might be a good idea to use binary splitting for this primorial. Former experiments showed that the cutoff is above the factorial of two hundred thousand (2,000,000!) for my machine.

Adventures of a Programmer: Parser Writing Peril VIII

Still no grammar, mainly because of the “Things are falling apart” theme I mentioned in my last post. But away with boring physical masonry and its dirt and on to some cleaner hobbies. One of those is the implementation of the numerical computation of binomial coefficients. Binomial coefficients are useful in a lot of places, to many to even think about listing them, although all have one thing in common: they want them all, in the right order and fast. Really fast.
No problem.
If somebody remembers my last posts about the numerical evaluation of the factorial function like the one here for example and the one describing Borwein’s trick in detail.
With less details, more in a nutshell, the trick speeds up computing by using the prime factorization of the factorial.
The formula for the binomial coefficients is

${\displaystyle{ \frac{n!}{k!\,(n-k)!} } }$

The exclamation marks mark factorials and we have a fast algorithm for factorials. It works with prime factorization and from the Fundamental Theorem of Arithmetic

${\displaystyle { n = \prod_{i=1}^{k}p_i^{\alpha_i} } }$

Now the formula above is not quite right, the product is actually infinite because of the rule $x^0 = 1$ (with $x \not= 0$):

${\displaystyle { n = \prod_{i=1}p_i^{\alpha_i} } }$

To multiply two numbers in, say, the decimal system, we multiply the first place with the first place (ignoring the carry for now)), the second with the second place and so on. We can do the same with the polynomials from the prime factorization. Given the prime factorization of two numbers $f_1 = \prod_{i=1}^{k_1}p_i^{\alpha_i}$ and $f_2 =\prod_{i=1}^{k_2}p_i^{\alpha_i}$ we can multiply both numbers by adding the exponents of the primes in the prime factorization. We do not have to care if $k_1$ and $k_2$ are of the same length, we can just pad with primes with an exponent of zero. Let me steal (shamelessly, as always) the equation from wikipedia here:

${\displaystyle {a\cdot b=2^{a_2+b_2}\,3^{a_3+b_3}\,5^{a_5+b_5}\,7^{a_7+b_7}\cdots=\prod p_i^{a_{p_i}+b_{p_i}}} }$

It is of not much use under normal circumstances because factoring of an integer is complicated. Factoring a factorial, on the other side, is very simple.
We could write functions for it but I wanted to put it in libtommath (I already did in my fork) and they want single, independent functions. It is also easier because there are no carries or negative exponents in the case of the binomial coefficient and we can spare us the trouble of implementing it.
The implementation is straightforward:
Some general shortcuts for ${}_nC_k$ are: $n < k \to 0$, $n = k \vee k = 0 \to 0$, and $n-k=1 \vee k = 1 \to n$.
We should also make use of the symmetry of the binomial coefficient:

${\displaystyle {\binom nk = \binom n{n-k}}}$

(with $0\leq k\leq n$ o course). We need an array then, twice the size of the number of primes in $n$ (we use the approximation $\tfrac{30\log 113}{113}\tfrac{n}{\log n}$ instead of countig them. The amount of wasted memory is small) and fill it with primes up to $n$. Well, actually, we don't fill it, we do it all inline.
If we replace the factorials $n!$, $k!$ in the equation of the binomial coefficient with their prime factorizations $f_n$, $f_k$ we get

${\displaystyle{ f_n - \left( f_k + f_{n-k}\right) } }$

We have taken care that $k, which means also that $k < (n-k) < n$, so we can do it in three consecutive loops, the irst one up to and including $k$, the second one up to and including $n-k$ and the last one up to and including $n$. In order

1. ${\displaystyle{ f_n - \left( f_k + f_{n-k}\right) } }$
2. ${\displaystyle{ f_n - f_{n-k} } }$
3. ${\displaystyle{ f_n } }$

with the last on just padding. We can do that because the real prime factorization is an infinite product but we can ignore all exponents that are zero which leads to the truncation in the above listed way.
The last step is the actual computation of the resulting polynomial and that is the only thing I did in an external function because it is simple and used more often. It will not get used in the factorial function because of the optimizations there which are not applicable for a more general function. The implementation of this function (bn_mp_compute_factored_factorial.c) is as you might have expected:

/* I think this is the third version, isn't it?*/
static long local_highbit(unsigned  long n){
long r=0;
while (n >>= 1) {
r++;
}
return r;
}
unsigned long BN_MP_FACTORED_FACTORIAL_CUTOFF = 0x10000000;
/* Input is to be sorted as: prime,exponent,prime,exponent,... */
int mp_compute_factored_factorial(unsigned long *f, unsigned long f_length,
mp_int *c,   unsigned long stop){

unsigned long length,start,i;
unsigned long shift = 0;
long bit,k=0,hb;
mp_int temp;
int e;

if(stop == 0){
length = f_length;
}
else{
length = stop;
}
if(f[0] == 2 && f[1] > 0){
shift = f[1];
if(f_length == 2){
if( (e = mp_set_int(c,1LU<<f[1]) ) != MP_OKAY){
return e;
}
return MP_OKAY;
}
}

start = 0;
if(shift){start+=2;k=2;}
bit = 0;
for(;k<(long)f_length;k+=2){
hb=local_highbit(f[k+1]);
if(bit < hb)bit=hb;
}
/* just for safety reasons, y'know */
if(bit >(long) DIGIT_BIT){
return MP_VAL;
}
if( (e = mp_set_int(c,1) ) != MP_OKAY){
return e;
}
if(f_length > BN_MP_FACTORED_FACTORIAL_CUTOFF){
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
for( ; bit>=0; bit--) {
if( (e =  mp_sqr(c, c) ) != MP_OKAY){
return e;
}
if( (e =  mp_set_int(&temp,1) ) != MP_OKAY){
return e;
}
for(i=start; i<f_length; i+=2) {
if ((f[i+1] & (1LU << (unsigned long)bit)) != 0) {
if( (e =  mp_mul_d(&temp,(mp_digit)f[i], &temp) ) != MP_OKAY){
return e;
}
}
}
if( (e =  mp_mul(&temp,c,c) ) != MP_OKAY){
return e;
}
}
}
else{
for( ; bit>=0; bit--) {
if( (e = mp_sqr(c, c) ) != MP_OKAY){
return e;
}
for(i=start; i<f_length; i+=2) {
if ((f[i+1] & (1LU << (unsigned long)bit)) != 0) {
if( (e = mp_mul_d(c,(mp_digit)f[i], c) ) != MP_OKAY){
return e;
}
}
}
}
}
if(shift){
/* DIGIT_BIT might be 60 which makes this extra check necessary */
if(shift < INT_MAX){
/* The choice of types is a bit questionable. Sometimes. */
if( (e = mp_mul_2d (c, (int) shift, c) ) != MP_OKAY){
return e;
}
}
else{
int multiplicator=0;
while(shift > INT_MAX){
shift >>= 1;
multiplicator++;
}
if( (e = mp_mul_2d (c, (int) shift, c) ) != MP_OKAY){
return e;
}
if( (e = mp_mul_2d (c, 1<<multiplicator, c) ) != MP_OKAY){
return e;
}
}
}
return MP_OKAY;
}


The branch in the middle with the cutoff BN_MP_FACTORED_FACTORIAL_CUTOFF makes no difference in this case but might in other. To be tested.
The actual code for the binomial coefficients uses two simpler fallbacks, on with unsigned long and one simple iterative algorithm with big integers. The one with native integers uses one of the normal algorithm but divides by $\gcd \left( n,k\right)$ first. Seems to be a waste of good electrons to compute the GCD but it keeps the full range for an unsigned long without having to use an intermediate unsigned long long.

static unsigned long gcd ( unsigned long a, unsigned long b ){
unsigned long c;
while ( a != 0 ) {
c = a; a = b%a;  b = c;
}
return b;
}
/*
This version using the gcd is a bit slower but avoids the long-long.
Upper bound of the number of recursions is n/2.
*/
static unsigned long comb(unsigned long n, unsigned long k){
unsigned long d;
unsigned long q;
if(n<k)return 0;
if(n==k || k == 0)return 1;
if (k > n/2){
k = n-k;
}
d = gcd( n, k );
q = k / d;
return ( comb( n-1, k-1 ) / q ) * ( n / d );
}


And the small big integer function is

static int mp_comb(unsigned long n, unsigned long k,mp_int *c){
unsigned long i;
mp_int temp;
int e;

if(n<k){
mp_set_int(c,0);
return MP_OKAY;
}
if(n==k){
mp_set_int(c,1);
return MP_OKAY;
}
/* That's over-simplified, e.g. comb(99,6) = 1120529256 is still
smaller than 2^32.
An upper limit would be (the magic number is 1/sqrt(2*pi))

ulim = (0.398942280401430*pow(n,(n+.5 ))*pow((n-k),(k-n-.5 ))
/pow(p,(k + 0.5 )));

Stanica, Pantelimon. "Good lower and upper bounds on binomial
coefficients." Journal of Inequalities in Pure and Applied
Mathematics 2.3 (2001): 30.
*/
if(n < 35){
if( (e = mp_set_int(c,comb(n,k)) ) != MP_OKAY){
return e;
}
return MP_OKAY;
}
if( (e = mp_set_int(c,1) ) != MP_OKAY){
return e;
}
if (k > n/2){
k = n-k;
}
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
for (i=1; i<=k;i++){
if( (e = mp_set_int(&temp,(n - k + i) ) ) != MP_OKAY){
return e;
}
if( (e = mp_mul(c,&temp,c) ) != MP_OKAY){
return e;
}
if( (e = mp_set_int(&temp,i) ) != MP_OKAY){
return e;
}
if( (e =  mp_div(c,&temp,c,NULL) ) != MP_OKAY){
return e;
}
}
mp_clear(&temp);
return MP_OKAY;
}


As you can see from the comment in the code, that the points of the cutoffs are hard to find.

int mp_binomial(unsigned long n,unsigned long  k, mp_int *c){
/*  Idea shamelessly stolen from Calc.
Hey, wait, I wrote that calc-script myself!
Oh well, my age's starting to show, I'm afraid ;-)*/

unsigned long *prime_list;
unsigned long pix=0,prime,K,diff;
mp_bitset_t *bst;
int e;

if(n<k){
mp_set_int(c,0);
return MP_OKAY;
}
if(n==k || k==0){
mp_set_int(c,1LU);
return MP_OKAY;
}
if((n-k)==1 || k==1){
/* TODO: either check for sizeof(unsigned long) == 4 or repair mp_set_int() */
mp_set_int(c,n);
return MP_OKAY;
}

/* The function mp_comb(n,k) is faster if k<<n (and v.v.), the exact cut-off
has not been calculated yet but is quite large.*/
if(k < n/900 || (n-k) < n/900){
if( (e = mp_comb(n, k, c) ) != MP_OKAY){
return e;
}
return MP_OKAY;
}

if (k > n/2){
k = n-k;
}
bst = malloc(sizeof(mp_bitset_t));
if (bst == NULL) {
return MP_MEM;
}
mp_bitset_alloc(bst, n+1);
mp_eratosthenes(bst);
/* One could also count the number of primes in the already filled sieve */
pix = (unsigned long) (LN_113 *  n/log(n) );

prime_list = malloc(sizeof(unsigned long) * (pix) * 2);
if (prime_list == NULL) {
return MP_MEM;
}
prime = 2;K=0;
do{
diff = mp_prime_divisors(n,prime)-
( mp_prime_divisors(n-k,prime)+mp_prime_divisors(k,prime));
if(diff != 0){
prime_list[K] = prime;
prime_list[K+1] = diff;
K+=2;
}
prime          = mp_bitset_nextset(bst,prime+1);
}while(prime <= k);
do {
diff = mp_prime_divisors(n,prime)-mp_prime_divisors(n-k,prime);
if(diff != 0){
prime_list[K] = prime;
prime_list[K+1] = diff;
K+=2;
}
prime          = mp_bitset_nextset(bst,prime+1);
}while(prime <= n-k);
do {
prime_list[K] = prime;
prime_list[K+1] = mp_prime_divisors(n,prime);
prime          = mp_bitset_nextset(bst,prime+1);
K+=2;
}while(prime <= n);
prime_list = realloc(prime_list,sizeof(unsigned long) * K);
if (prime_list == NULL) {
return MP_MEM;
}
if( (e = mp_compute_factored_factorial(prime_list,K-1,c,0) ) != MP_OKAY){
return e;
}
free(bst);
free(prime_list);
return MP_OKAY;
}


Your milage may vary but that cut-off point is quite off, isn’t it?
More optimiztions are possible, especially to save some memory if we don’t keep the actual values of the primes, only start and end-point but that would make the code more complicated. taking the unsigned long to have 32 bits, we have a need for about 927 MiBytes for the exponents alone or around 2 GiBytes for all, using the upper bound for the number of primes listed above (the thing with the 113) or about 1.5 GiBytes for all using the exact number of primes lower than $\pi\left(2^{32}-1\right) = 203\,280\,221$. How large is that number? With $n = 2^{32}$ and $k = 2^{31}$

${\displaystyle{ { \binom nk } = \frac{n!}{k!\,(n-k)!} }}$

using the logarithm of the Gamma function instead of the factorials we get

${\displaystyle{ { \binom nk } = \exp\bigg(\log\Gamma\left( n +1\right) - \Big( \log\Gamma\left( k +1\right) + \log\Gamma\left( n- k +1\right) \Big)\bigg) }}$

Which is roundabout $3.778 \times 10^{1\,292\,913\,981}$. I have no idea how long the computation would last but over one and a quarter billion decimal digits? That’ll last a while and I’m pretty sure about that guess! 😉
Hint: I have an old machine, and AMD Duron with just 1 GHz. Computing $\binom {2n}{n}$ with $n = 10^5$ needs about half a second and with $n = 10^6$ it already needs nearly a minute, so with $n = 2\times 10^9$ assuming linear growth it’s about 2 years if I calculated it correctly. The whole thing is parallelizable to some very large extent with very little overhead, say 1.2. 64 cores are no rarity anymore, so we could do it in about 13 days. Oh, who would have thought!
So if you need more: just drop me a note.

On the numerical evaluation of factorials IX

Subfactorial

The family of the factorials is larger than the mere number of permutations. On of the many offsprings is the subfactorial or derangement.
The normal implementation, using only integers can be done with a recursion or, to save some memory and leave the stack where it is, in an iterative way.
The recursion is quite complicated with libtommath (more probable is the reason that I am just too stupid to get it right 😉 )

static mp_int kill_the_stack;
static int subfactorial_error = MP_OKAY;
static mp_int * __subfactorialrecursive(unsigned long n){
mp_digit sign;
int e;
if(n==0){
if ( ( e =  mp_set_int(&kill_the_stack,1) ) != MP_OKAY){
subfactorial_error = e;
}
return &kill_the_stack;
}
if(n==1){
if ( ( e =  mp_set_int(&kill_the_stack,0) ) != MP_OKAY){
subfactorial_error = e;
}
return &kill_the_stack;
}
if(n==2){
if ( ( e =  mp_set_int(&kill_the_stack,1) ) != MP_OKAY){
subfactorial_error = e;
}
return &kill_the_stack;
}
if ( ( e =  mp_mul_d(__subfactorialrecursive(n-1),n,&kill_the_stack) ) != MP_OKAY){
subfactorial_error = e;
}
sign = (n&01)?-1:1;
if ( ( e =  mp_add_d(&kill_the_stack,sign,&kill_the_stack) ) != MP_OKAY){
subfactorial_error = e;
}
return &kill_the_stack;
}
static int subfactorial_recursive(unsigned long n, mp_int *result){
mp_init(&kill_the_stack);
subfactorial_error = MP_OKAY;
__subfactorialrecursive(n);
mp_copy(&kill_the_stack,result);
mp_clear(&kill_the_stack);
return subfactorial_error;
}


I left all o the error checks in this time to show one of the problems with recursion: making clear that an error occured. The deep copy in subfactorial_recursive() is needed to be able to clear the globali-ish variable kill_the_stack. Keeping would cause a small but significant memory-leak.

The iterative version is more…uhm…straight forward:

static int subfactorial_iterative(unsigned long n, mp_int *result){
mp_int temp1, temp2;
mp_digit k;
int e;

if(n==0){
return mp_set_int(result,1);
}
if(n==1){
return mp_set_int(result,0);
}
if(n==2){
return mp_set_int(result,1);
}
if ( ( e = mp_init_multi(&temp1, &temp2,NULL) ) != MP_OKAY){
return e;
}
if ( ( e = mp_set_int(&temp1, 0) ) != MP_OKAY){
mp_clear_multi(&temp1, &temp2,NULL);
return e;
}
if ( ( e = mp_set_int(result,1) ) != MP_OKAY){
mp_clear_multi(&temp1, &temp2,NULL);
return e;
}
for(k=3;k<=n;k++){
mp_exch ( &temp1,&temp2 );
mp_exch ( result, &temp1);
if ( ( e = mp_add(&temp1, &temp2,result))  != MP_OKAY){
mp_clear_multi(&temp1, &temp2,NULL);
return e;
}
if ( ( e = mp_mul_d(result,(k-1),result))  != MP_OKAY){
mp_clear_multi(&temp1, &temp2,NULL);
return e;
}
}
mp_clear_multi(&temp1, &temp2,NULL);
return MP_OKAY;
}


The use of libtommath’s mp_set_int() needs a restriction to the size of n. The type allowed is an unsigned long; I restrict further to the actual bit-size of mp_digit which is probably way too conservative, especially when those digits are smaller than the default of 28.
The subfactorial of 2^28 is about 1.7862e2,146,019,442, that’s a number with over two billion decimal digits. One needs a different method to calculate it exactly. This one is good up to about 100,000 which already lasts a minute with this old computer I have here, more with the iterative version.

#ifndef MP_DIGIT_SIZE
#define MP_DIGIT_SIZE (1<<DIGIT_BIT)
#endif
int mp_subfactorial(unsigned long n, mp_int *result){
if(n > MP_DIGIT_SIZE ){
return MP_RANGE;
}
#ifdef SAVE_SOME_MEMORY
return subfactorial_iterative(n, result);
#else
return  subfactorial_recursive(n, result);
#endif
}


The code above has been merged to my fork of libtommath over at github.