Adventures of a Programmer: Parser Writing Peril XVI

Back to the experiment with balancing multiplicands which failed despite the advantage it must have shown based on the theory and if the margin of this blog post would have been wider I could even proof that it is better!
*stomps foot*
And it is better! All I have done was to comment the wrong line out and instead of getting two distinct benchmarks the second benchmark was the sum of both that let me think that balancing needs about twice the time than normal multiplication.
Moral of the story: if some results look suspicious, they mostly are. Or would you buy a machine that generates energy for free?

So, how does the real benchmark look?
Well, differently π

The method is as described before but let me talk a bit about the numbers to be tested.
It should be obvious that the balancing will make sense only for numbers large enough to pass the cut-off point of the Toom-Cook algorithms (T-C 2 {Karatsuba} and 3 are implemented in Libtommath) otherwise it would slow the multiplication down—costly overhead without any effect. The cut-off points will differ from architecture to architecture and mine are (in mp_digits): TC2 = 48,TC3 190 and for the very big numbers FFT=4000 (which is oversimplified but I’m working at it).
The tests for the small numbers run 1,000 times each. Time is in seconds.

</tr

Number Pair Normal Multiplication Balanced Multiplication
50 * 100 0.04 0.07
100 * 150 0.14 0.15
100 * 200 0.18 0.19
150 * 300 0.35 0.34
100 * 400 0.37 0.44
200 * 400 0.79 0.61
300 * 400 0.98 0.91
150 * 500 0.62 0.63
250 * 500 1.13 0.81
350 * 500 1.38 1.19
400 * 500 1.35 1.28
450 * 500 1.30 1.30
50 * 600 0.53 0.54
100 * 600 1.00 0.60
150 * 600 1.42 0.75
200 * 600 1.31 1.12
250 * 600 1.44 1.14
300 * 600 1.78 1.37
350 * 600 1.81 1.45
400 * 600 1.83 1.78
450 * 600 1.86 1.57
500 * 600 1.74 1.68
550 * 600 1.62 1.88
50 * 700 0.61 0.63
100 * 700 1.18 1.21
150 * 700 1.69 0.97
200 * 700 2.83 1.39
250 * 700 3.23 1.55
300 * 700 2.16 1.91
350 * 700 2.22 1.44
400 * 700 2.34 1.87
450 * 700 2.35 2.06
500 * 700 2.32 2.27
550 * 700 2.26 2.09
600 * 700 2.68 2.76
650 * 700 2.39 2.53
50 * 800 0.69 0.74
100 * 800 1.36 1.42
150 * 800 1.95 1.91
200 * 800 3.34 1.70
250 * 800 3.84 1.90
300 * 800 4.46 2.35
350 * 800 3.18 2.25
400 * 800 2.85 1.70
450 * 800 2.91 2.22
500 * 800 2.93 2.60
550 * 800 2.88 2.70
600 * 800 3.73 3.07
650 * 800 3.51 3.51
700 * 800 3.16 3.37
750 * 800 2.82 2.96
50 * 900 0.78 0.83
100 * 900 1.54 1.57
150 * 900 2.21 2.15
200 * 900 3.94 3.43
250 * 900 4.48 2.20
300 * 900 5.29 2.64
350 * 900 5.72 2.42
400 * 900 6.00 2.29
450 * 900 6.20 2.01
500 * 900 3.58 2.64
550 * 900 3.55 2.99
600 * 900 4.91 3.56
650 * 900 4.79 3.76
700 * 900 4.33 5.07
750 * 900 4.07 4.21
800 * 900 3.73 4.07
850 * 900 3.68 3.89
50 * 1000 0.87 0.93
100 * 1000 1.72 1.78
150 * 1000 2.50 2.49
200 * 1000 4.35 4.01
250 * 1000 5.12 4.38
300 * 1000 5.99 3.03
350 * 1000 6.55 3.07
400 * 1000 6.96 2.80
450 * 1000 7.25 2.62
500 * 1000 7.44 2.32
550 * 1000 7.57 2.96
600 * 1000 5.86 3.64
650 * 1000 5.74 3.99
700 * 1000 5.55 4.40
750 * 1000 5.35 6.00
800 * 1000 5.08 6.12
850 * 1000 5.08 5.26
900 * 1000 4.87 5.01
950 * 1000 4.28 4.51

Some points a more off than others, that might have a reason in the actual numbers which get produced with a cheap PRNG and are used over the whole loop. Let me repeat the last round ([50,950] * 1,000) with a different number each time. (Generating thousands of large numbers takes some time but we can ignore it, it is the same for both)

Number Pair Normal Multiplication Balanced Multiplication
50 * 1000 5.03 3.87
100 * 1000 5.05 3.84
150 * 1000 4.99 3.85
200 * 1000 4.99 3.86
250 * 1000 5.00 3.89
300 * 1000 5.00 3.85
350 * 1000 5.00 3.86
400 * 1000 5.00 3.92
450 * 1000 4.99 3.85
500 * 1000 5.05 3.87
550 * 1000 4.99 3.89
600 * 1000 5.05 3.86
650 * 1000 4.98 3.88
700 * 1000 5.08 3.86
750 * 1000 5.05 3.88
800 * 1000 4.98 3.84
850 * 1000 5.01 3.94
900 * 1000 5.12 3.86
950 * 1000 5.03 3.85

Oh?
Let’s use libtomath’s own tool mp_rand,too:

Number Pair Normal Multiplication Balanced Multiplication
50 * 1000 8.50 7.34
100 * 1000 8.50 7.37
150 * 1000 8.49 7.38
200 * 1000 8.47 7.35
250 * 1000 8.48 7.34
300 * 1000 8.50 7.39
350 * 1000 8.48 7.36
400 * 1000 8.49 7.35
450 * 1000 8.47 7.30
500 * 1000 8.47 7.33
550 * 1000 8.49 7.37
600 * 1000 8.47 7.33
650 * 1000 8.47 7.32
700 * 1000 8.47 7.34
750 * 1000 8.51 7.34
800 * 1000 8.46 7.36
850 * 1000 8.49 7.37
900 * 1000 8.47 7.33
950 * 1000 8.46 7.38

The function mp_rand is more exact as it makes sure that the MSD is always different from zero. This gives an interesting effect: the balanced version is even faster when both multiplicands have been ordered to have the same size. So let me get something to read while the following script runs:

for i in seq 50 50  1000;
do  for j in seq 50 50  $i; do ./testbalancing$i j;done;done  The last round is the most significant: Number Pair Normal Multiplication Balanced Multiplication 1000 * 50 1.31 1.32 1000 * 100 1.72 1.67 1000 * 150 2.03 1.90 1000 * 200 2.75 2.30 1000 * 250 2.99 2.35 1000 * 300 3.26 2.60 1000 * 350 3.37 2.66 1000 * 400 3.49 2.79 1000 * 450 3.57 2.98 1000 * 500 3.66 3.31 1000 * 550 3.73 3.63 1000 * 600 4.26 4.20 1000 * 650 4.49 4.41 1000 * 700 4.85 4.86 1000 * 750 5.15 5.18 1000 * 800 5.63 5.40 1000 * 850 6.28 5.88 1000 * 900 6.92 6.30 1000 * 950 7.67 6.80 1000 * 1000 8.47 7.36 As I said at the start of this post: “If some result look suspicious, they mostly are.”, so let’s do a check: Mmh…  n = strtoul(argv[1],NULL,10); m = strtoul(argv[2],NULL,10); --- for(n=0;n<1000;n++){ ... mp_rand(&a,n); mp_rand(&b,m); }  Ouch! Now if that’s not embarassing, I don’t know what else is π Ok, now on with the real one. The round with one thousand first to see if the results are reasonable now. Number Pair Normal Multiplication Balanced Multiplication 50 * 1000 0.920000 1.060000 100 * 1000 1.770000 1.930000 150 * 1000 2.400000 2.460000 200 * 1000 4.640000 4.070000 250 * 1000 5.140000 4.560000 300 * 1000 6.380000 2.880000 350 * 1000 7.080000 2.980000 400 * 1000 7.420000 2.860000 450 * 1000 7.210000 2.520000 500 * 1000 7.360000 2.410000 550 * 1000 7.560000 2.980000 600 * 1000 5.940000 3.630000 650 * 1000 5.910000 3.600000 700 * 1000 5.760000 4.270000 750 * 1000 5.440000 6.060000 800 * 1000 5.210000 5.880000 850 * 1000 5.460000 5.380000 900 * 1000 5.190000 5.100000 950 * 1000 4.410000 4.400000 1000 * 1000 3.540000 3.530000 Yepp, that makes more sense; the data supports the theory. There is a jump about 300*1,000, increases smoothly (more or less) up to about 700*1,000 and…oh, forgot to switch off the shortcuts. Aaaaaand again π Number Pair Normal Multiplication Balanced Multiplication 50 * 1000 1.040000 0.870000 100 * 1000 1.870000 2.010000 150 * 1000 2.670000 2.400000 200 * 1000 4.610000 3.980000 250 * 1000 5.180000 4.510000 300 * 1000 6.330000 3.000000 350 * 1000 6.770000 2.850000 400 * 1000 7.030000 2.890000 450 * 1000 7.490000 2.430000 500 * 1000 7.600000 2.450000 550 * 1000 7.730000 2.980000 600 * 1000 5.680000 3.620000 650 * 1000 6.110000 3.990000 700 * 1000 5.890000 4.350000 750 * 1000 5.630000 5.910000 800 * 1000 5.150000 6.110000 850 * 1000 5.360000 5.250000 900 * 1000 5.180000 5.090000 950 * 1000 4.620000 4.340000 1000 * 1000 4.160000 4.290000 Nearly the same. There are two peaks where the differences are close to the Toom-Cook cut-off point. I’ll put the full table after the fold but the conclusion is that this kind of balancing makes most sense between about 3/10 and 7/10 and both multiplicands should be larger than the Toom-Cook 3 cut-off. Continue reading Adventures of a Programmer: Parser Writing Peril XV To test the last try of balancing multiplication in libtommath I needed to generate some large numbers. Really large numbers. Tens of millions of decimal digits long numbers. Using e.g.: stdin as the input needs patience and has limits regarding the length of the argument buffer. It is more elegant to produce them directly with the conditions that the bits should be more or less uniformly distributed, can get generated fast and have no unwelcome side effects. There is a function in libtommath called mp_rand which produces a pseudo-random integer of a given size but it does not meet the above conditions. It uses a slow method involving elementary functions like add and shift but that is negligible. It uses rand() which has side effects. The first one is that is not in the C-standard ISO/IEC 9899/2011 but in Posix (POSIX.1-2001) and the second one is that calls to rand() might be implemented in a cryptographically secure way and its usage reduces the entropy pool without a good reason when used to generate large numbers for mere testing. The method I used was to take a simple PRNG (pseudo random number generator) and copy the generated small 32-bit integers direct into the digits. int make_large_number(mp_int * c, int size) { int e; if ((e = mp_grow(c, size)) != MP_OKAY) { return e; } c->used = size; while (size--) { c->dp[size] = (mp_digit)(light_random() & MP_MASK); } mp_clamp(c); return MP_OKAY; }  With light_random() a small generator of the form $x_{n+1} = \mathop{\mathrm{remainder}}( \frac{48271*x_n}{ 2^31-1})$[ [4] based on [3] see also [2] ] (48271 is a primitive root modulo $2^31-1$). #include <stdint.h> static unsigned int seed = 1; int light_random(void) { int32_t result; result = seed; result = 48271 * (result % 44488) - 3399 * (int32_t) (result / 44488); if (result < 0){ result += ((1U<<31) -1); } seed = result; return result; } void slight_random(unsigned int grain){ /* The seed of the Lehmer-PRNg above must be co-prime to the modulus. The modulus 2^31-1 is prime (Mersenne prime M_{31}) so all numbers in [1,2^31-2] are co-prime with the exception of zero. */ seed = (grain)?seed:1; }  The method used to compute the remainder is called Schrage’s method[1]. Let me give a short description. Given $x_{n+1} = ax_n \mod m$ than $m = aq + r$, so $q = \lfloor\frac{m}{a}\rfloor$ and $r = m\mod a$ such that we can write {\displaystyle{ \begin{aligned} x_{n+1} &= ax_n \mod m \\ &= ax_n - \bigg\lfloor\frac{ax_n}{m}\bigg\rfloor m \\ &= ax_n - \bigg\lfloor\frac{ax_n}{aq+r}\bigg\rfloor (aq+r) \end{aligned} } } I’ll ommit the full proof and point to the paper but will give a short sketch of it. Expanding the inner fractions ${\displaystyle{ \frac{ax_n}{aq+r} = \frac{x_n}{q} \frac{1}{1+\frac{r}{aq}} } }$ and with the additional conditions $r < a$ and $r < q$ the fraction $\tfrac{r}{aq}$ is much smaller than unity. With the Taylor expansion $\frac{1}{1+\epsilon}= 1-\epsilon$ in hand and replacing $\epsilon$ with $\frac{r}{aq}$ we get {\displaystyle{ \begin{aligned} \frac{ax_n}{aq+r} &= \frac{x_n}{q} \frac{1}{1+\frac{r}{aq}} \\ &= \frac{x_n}{q} \biggl(1 - \frac{r}{aq}\biggr) \\ &= \frac{x_n}{q} - \frac{x_n}{aq}\cdot \frac{r}{q} \end{aligned} } } As both $\frac{x_n}{aq}$ and $\frac{x_n}{m}$ are smaller than unity and with $\frac{r}{q}$ much smaller than unity we can state that ${\displaystyle{ 0 \le \biggl\lfloor\frac{x_n}{q}\biggr\rfloor - \biggl\lfloor\frac{x_n}{q}\biggl(1-\frac{r}{aq}\biggr)\biggr\rfloor \le 1 }}$ This allows us to conclude that ${\displaystyle{ ax_n \mod q - r \biggl\lfloor\frac{ax_n}{q}\biggr\rfloor + \begin{cases} 0 & \quad\text{if}\quad ax_n \mod q - r \biggl\lfloor\frac{ax_n}{q}\biggr\rfloor \ge 0 \\[3ex] m & \quad\text{if}\quad ax_n \mod q - r \biggl\lfloor\frac{ax_n}{q}\biggr\rfloor < 0 \end{cases} }}$ Put in the values and we have the code from above. The period is $2^{31}-2$ which is enough to fill the maximum number of digits assuming sizeof(int) = 4. With 28 bit long digits it is good enough for numbers up to 18,100,795,794 decimal digits. [1] Schrage, L., A More Portable Fortran Random Number Generator, ACM Trans. Math. Software 5, 132-138, 1979. [2] Stephen K. Park and Keith W. Miller and Paul K. Stockmeyer, Another Test for Randomness: Response, Communications of the ACM 36.7 (1993): 108-110. [3] Lehmer, D. H., Mathematical methods in large-scale computing units, Proceedings of a Second Symposium on Large-Scale Digital Calculating Machinery, 1949,141β146. Harvard University Press, Cambridge, Mass., 1951. [4] Park, Stephen K., and Keith W. Miller, Random number generators: good ones are hard to find, Communications of the ACM 31.10 (1988): 1192-1201. Adventures of a Programmer: Parser Writing Peril XIV As I noted in my last post, the big integer library libtommath lacks a method to balance the multiplicands in size. The method to do it is quite simple and based on the rule: ${ \displaystyle{ (a_1\beta+a_0) \cdot b = (a_1b)\beta + a_0\cdot b }}$ Where $a,b$ are the multiplicands and $\beta$ is a multiplier of positive integer value. Example shall be $12345678 \cdot 8765$: ${ \displaystyle{ (1234\cdot 10^4 + 5678) \cdot 8765 = (1234\cdot 8765)\cdot 10^4 + 5678\cdot 8765 }}$ If we use a binary multiplier instead of the decimal one we can use simple shifting to do the multiplication and we should use the big-number digits instead of the decimal ones, too, I think. { \displaystyle{ \begin{aligned} & \text{\textbf{function}}\;\text{\textit{Balance Multiplication}}\;n\cdot m \\ & \qquad \text{\textbf{Ensure: }}\; \mathop{\mathrm{min}}(n,m) > C_1\\ & \qquad \text{\textbf{Ensure: }}\; \frac{\mathop{\mathrm{min}}(n,m)}{\mathop{\mathrm{max}}(n,m)} < C_2 \\ & \qquad \beta \gets \mathop{\mathrm{length}}\left(\mathop{\mathrm{min}}\left(n,m\right) \right) \\ & \qquad a_1 \gets \bigg\lfloor \frac{\mathop{\mathrm{max}}(n,m)}{\beta} \bigg\rfloor \\ & \qquad a_1 \gets a_1\cdot\mathop{\mathrm{min}}\left(n,m\right) \\ & \qquad a_1 \gets a_1\cdot 2^\beta \\ & \qquad a_0 \gets a - \bigg\lfloor \frac{\mathop{\mathrm{max}}(n,m)}{\beta} \bigg\rfloor \\ & \qquad a_0 \gets a_0 \cdot \mathop{\mathrm{min}}\left(n,m\right) \\ & \qquad \text{\textbf{Return}}; a_1 + a_0 \\ \end{aligned} }} Here $C_1$ denotes the cut-off point marking the minimum size of the smaller multiplicand. This could be as small as 1 (one) but I would take that as a special value where the algorithm used in mp_*_d will show better results (and it should be done in mp_mul directly). The other cut-off point $C_2$ is the relation of $a,b$. It should be smaller than 1 (one), of course, but how much? $\tfrac{1}{2}$? Or even earlier at $\tfrac{2}{3}$? Hard to tell without a test but I think $C_1 = 10$ and $C_2 = \tfrac{2}{3}$ will do for a start. A straightforward implementation could be #define MP_C1 2 #define MP_C2 0.5f int mp_balance_mult(mp_int *a, mp_int *b, mp_int *c){ int e, count,len_a, len_b; mp_int a_0,a_1; /* get digit size of a and b */ len_a = a->used; len_b = b->used; /* check if size of smaller one is larger than C1 */ if(MIN(len_a,len_b) < MP_C1){ //printf("Checkpoint C1 failed with length(a) = %d, length(b) = %d\n", // a->used,b->used); mp_mul(a,b,c); return MP_OKAY; } /* check if the sizes of both differ enough (smaller than C2)*/ if(( (float)MAX(len_a,len_b) / (float)MIN(len_a,len_b)) < MP_C2){ //printf("Checkpoint C2 failed with %f\n",( (float)len_a / (float)len_b)); mp_mul(a,b,c); return MP_OKAY; } /*Make sure that a is the larger one */ if(len_a < len_b){ mp_exch(a,b); } /* cut larger one in two parts a1, a0 with the smaller part a0 of the same length as the smaller input number b_0 */ mp_init_size(&a_0,b->used); a_0.used = b->used; mp_init_size(&a_1,a->used - b->used); a_1.used = a->used - b->used; /* fill smaller part a_0 */ for (count = 0; count < b->used ; count++) { a_0.dp[count] = a->dp[count]; } /* fill bigger part a_1 with the counter already at the right place*/ for (; count < a->used; count++) { a_1.dp[count-b->used] = a->dp[count]; } /* Speciale aanbieding: Zeeuwse mosselen maar 1,11 EUR/kg! */ mp_clamp(&a_0); mp_clamp(&a_1); /* a_1 = a_1 * b_0 */ mp_mul(&a_1,b,&a_1); /* a_1 = a_1 * 2^(length(a_0)) */ mp_lshd(&a_1,b->used); /* a_0 = a_0 * b_0 */ mp_mul(&a_0,b,&a_0); /* c = a_1 + a_0 */ mp_add(&a_1,&a_0,c); /* Don't mess with the input more than necessary */ if(len_a < len_b){ mp_exch(a,b); } return MP_OKAY; }  To make it short: it is slower and needs twice the time on average in contrast to the native multiplication algorithms tested with two numbers in relation $\tfrac{a}{b} = \tfrac{1}{2}$ with $a \le 4\,000\,000\cdot 28 \text{bits}$, using other relations makes it even worse. So we can call it, with good conscience, an utter failure. Back to the blackboard. Adventures of a Programmer: Parser Writing Peril XIII The biggest native integer libtommath allowed to set directly seems to be[1] an unsigned long in the function mp_set_int. The biggest native integer used, on the other side, is hidden behind mp_word which is the type able to hold twice the size of mp_digit and can be larger than an unsigned long. I need for my calculator some ways to work with native numbers without much ado where ado means a lot of conditionals, preprocessor directives, complicated data structures and all that mess. One of the ways to avoid it is to use the digits of the large integers directly if the large integer has only one. An example? Ok, an example. Take the partial harmonic series, for example ${\displaystyle { \mathrm{H}_n = \sum_{k=1}^n \frac{1}{k} }}$ If you calculate it with the help of the binary-splitting algorithm, especially than, a lot of the numbers involved are in the range of native integers and hold only on digit of the big numbers. The initialization of the big numbers in libtommath set them to 8 digits at least (the responsible variable is MP_PREC in tommath.h) and consumes costly heap memory to do so. Fredrik Johansson proposed in a blogpost to postpone the reducing of the fraction to the very end. It is not much but it is something so let’s follow his advice and do so using my rational library (as mostly: without any error checking for less code-cluttering). static mp_rat h_temp; mp_rat *_harmonics(unsigned long a, unsigned long b) { unsigned long m; mp_rat ta, tb; mp_int p, q, r, s; mp_word ps, qr; int e; if (b - a == 1) { mpq_set_int(&h_temp, (long) 1, (long) a); return &h_temp; } m = (a + b) >> 1; mpq_init_multi(&ta, &tb, NULL); // This is not necessarily necessary mp_init_multi(&p, &q, &r, &s, NULL); mpq_exch(_harmonics(a, m), &ta); mpq_exch(_harmonics(m, b), &tb); mp_exch(&ta.numerator, &p); mp_exch(&ta.denominator, &q); mp_exch(&tb.numerator, &r); mp_exch(&tb.denominator, &s); if ((&p)->used == 1 && (&s)->used == 1) { ps = (&p)->dp[0] * (mp_word) (&s)->dp[0]; mp_set_word(&ta.numerator, ps); } else { mp_mul(&p, &s, &ta.numerator); } if ((&q)->used == 1 && (&r)->used == 1) { qr = (&q)->dp[0] * (mp_word) (&r)->dp[0]; mp_set_word(&tb.numerator, qr); } else { mp_mul(&q, &r, &tb.numerator); } mp_add(&ta.numerator, &tb.numerator, &h_temp.numerator); mp_mul(&q, &s, &h_temp.denominator); mp_clear_multi(&p, &q, &r, &s, NULL); mpq_clear_multi(&ta, &tb, NULL); return &h_temp; } int harmonics(unsigned long n, mp_rat * c) { mpq_init(&h_temp); mpq_exch(_harmonics(1, n + 1), c); mpq_reduce(c); mpq_clear(&h_temp); return 1; }  The library libtommath is not very friendly if used in such a way but I cannot find out a better way to implement the binary splitting algorithm. This implementation of the partial harmonic series for example, is much slower than Fredriks implementation with gmpy (but faster than the native Python one, at least π ). It takes about 0.67 seconds for $\mathop{\mathrm{H}}_{10\,000}$ but already 193.61 seconds—yes, over 3 minutes!— for $\mathop{\mathrm{H}}_{100\,000}$. That is definitely too much. Funny thing: the normal algorithm is much faster, just 40 seconds for $\mathop{\mathrm{H}}_{100\,000}$ but also slower for smaller values, like about 1.09 seconds for $\mathop{\mathrm{H}}_{10\,000}$ whith the cut-off point at about $\mathop{\mathrm{H}}_{21\,000}$ on my machine. And it is asymptotically slower, I measured some points in between to find out. It is really time to implement fast multiplication in full in libtommath. Some of the problems with the first algorithm might have their reason in the nonexistent balancing of the multiplicands. There is a balancing function in the pull-queue but it seems to be a port from Ruby which makes it impossible to accept because of Ruby’s license (given that the submitter is not the original programmer of the Ruby code which I haven’t checked.) static mp_rat h_temp; mp_rat * _harmonics(unsigned long a,unsigned long b){ unsigned long m; mp_rat ta,tb; if(b-a==1){ mpq_set_int(&h_temp,(long)1,(long)a); return &h_temp; } m = (a+b)>>1; mpq_init_multi(&ta,&tb,NULL); mpq_exch(_harmonics(a,m),&ta); mpq_exch(_harmonics(m,b),&tb); mpq_add(&ta,&tb,&h_temp); mpq_clear_multi(&ta,&tb,NULL); return &h_temp; } // use the same caller as above  However, it was just meant to be used as an example for mp_set_word which I’ll present here: #if (MP_PREC > 1) int mp_set_word(mp_int *c,mp_word w){ mp_zero(c); if(w == 0){ return MP_OKAY; } do{ c->dp[c->used++] = (mp_digit)w&MP_MASK; }while( (w >>= DIGIT_BIT) > 0 && c->used < MP_PREC); if( w != 0 ){ return MP_VAL; } return MP_OKAY; } #else #warning variable "MP_PREC" should be at least 2 #endif  The preprocessor mess is necessary even if the constant MP_PREC should be at least 8 (eight) but, as I can tell you from some very bad experience, one never knows. [1] The reason for the stylistically awkward subjunctive is: I mean it. I really could have overseen an already implemented function doing exactly what I wanted in the first place and hence is not a case of NIH. This time π On the numerical evaluation of factorials XIII Rising Factorial We have ${\displaystyle{ x^{(n)} = {(x + n - 1)}_n }}$ That gives int mp_rising_factorial(unsigned long n, unsigned long k, mp_int * c) { int e; if ((e = mp_falling_factorial(n - k + 1, k, c)) != MP_OKAY) { return e; } return MP_OKAY; }  hth bye π On the numerical evaluation of factorials XII Falling Factorial The algorithm for the falling factorial $(x)_n$ makes shamelessly use of $\frac{(x)_n}{n!} = \binom xn$ which can be written in terms of factorials as $(x)_n = \frac{x!}{n! (x-n)!}n! = \frac{x!}{(x-n)!}$ which is our algorithm for the binomial coefficients without the first loop. Oh, that was simple! #ifndef LN_113 # define LN_113 1.25505871293247979696870747618124469168920275806274 #endif #include <math.h> int mp_falling_factorial(unsigned long n, unsigned long k, mp_int * c) { unsigned long *prime_list; unsigned long pix = 0, prime, K, diff; mp_bitset_t *bst; int e; if (n < k) { mp_set(c, 0); return MP_OKAY; } if (k == 0) { mp_set(c, 1); return MP_OKAY; } if (k == 1) { if ((e = mp_set_int(c, n)) != MP_OKAY) { return e; } return MP_OKAY; } if (k == n) { if ((e = mp_factorial(n, c)) != MP_OKAY) { return e; } return MP_OKAY; } bst = malloc(sizeof(mp_bitset_t)); if (bst == NULL) { return MP_MEM; } mp_bitset_alloc(bst, n + 1); mp_eratosthenes(bst); /* One could also count the number of primes in the already filled sieve */ pix = (unsigned long) (LN_113 * n / log(n))+2; prime_list = malloc(sizeof(unsigned long) * (pix) * 2); if (prime_list == NULL) { return MP_MEM; } prime = 2; K = 0; do { diff = mp_prime_divisors(n, prime) - mp_prime_divisors(n - k, prime); if (diff != 0) { prime_list[K] = prime; prime_list[K + 1] = diff; K += 2; } prime = mp_bitset_nextset(bst, prime + 1); } while (prime <= n - k); do { prime_list[K] = prime; prime_list[K + 1] = mp_prime_divisors(n, prime); prime = mp_bitset_nextset(bst, prime + 1); K += 2; } while (prime <= n); prime_list = realloc(prime_list, sizeof(unsigned long) * K); if (prime_list == NULL) { return MP_MEM; } if ((e = mp_compute_factored_factorial(prime_list, K - 1, c, 0)) != MP_OKAY) { return e; } free(bst); free(prime_list); return MP_OKAY; }  But if you think that was simple, wait for the next post π On the numerical evaluation of factorials XI Superfactorial The superfactorial is, according to Sloane and Plouffe ${\displaystyle{ \mathop{\mathrm{sf}}\left(n\right)= \prod_{k=1}^n k! }}$ Another definition is according to Clifford A. Pickover’s “Keys to Infinity” (Wiley,1995, ISBN-10: 0471118575, ISBN-13: 978-0471118572, saw it at Amazon offered by different sellers between2,57 for a used one and \$6,89 for new ones. I just ordered one and let you know if it is worth the money). The notation with the arrows is Donald Knuth’s up-arrow notation.

${\displaystyle{ n\!\mathop{\mathrm{\}} = (n!)\uparrow\uparrow(n!) }}$

This post is about the first variation. The superfactorial is also equivalent to the following product of powers.

${\displaystyle{ n\!\mathop{\mathrm{\}}=\prod_{k=1}^n k^{n-k+1}}}$

It is possible to multiply such a product in a fast way with the help of nested squaring, as Borwein and SchΓΆnhage have shown. It was also the basis for the fast algorithm for factorials described in this series of postings.

The original product is over the natural numbers up to and including $n$ and it can be made faster by restricting it to the primes and works, roughly, by making the exponents bigger which gives a change for more squarings and lowering the number of intermediate linear multiplications.

The question is: What is faster, factorizing every single power of the product or factorizing every single factorial?

Factorizing every power of the product reduces to factoring of every base which reduces to all bases without the primes. The time complexity of Eratosthenes’ sieve is $\mathop{\mathrm{O}}(n \log\log n)$ which can be told, without much loss of accuracy, linear. With this sieve (there are faster ones, but we need the full sequence, which makes it difficult to implement the following algorithm with other sieves) we get in one wash the prime factors for every number. The prime factors, not the exponents of these prime factors (i.e.: the single information we get is if the exponent is bigger than zero).

An example with factorizing $10$ which should make it more clear.

${\displaystyle{ \begin{matrix} 7 & & & & & & x & & & \\ 5 & & & & x & & & & & x \\ 3 & & x & & & x & & & x & \\ 2 & x & & x & & x & & x & & x \\ / & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \end{matrix} }}$

Once we have this result we are bound to trial division to find the exponents[1]. That is costly but how much does it cost at the end?
We have to do at most $\lfloor\log_2 n \rfloor$ trial divisions (biggest exponent possible is reserved for 2) with at most $\mathop{\pi}\left(\lfloor\sqrt{n} \rfloor\right)$ primes (e.g.: $2\cdot 3\cdot 5 = 30$ but $5^2 = 25$ ). The same has to be done for the exponents[1] of the powers from the superfactorial without the prime bases of which are all in all $n-\pi(n)$.

Assuming Riemann…no, we don’t need to do that π

Because all of the trial divisions are in fact equivalent to taking the nth-root which can be done in $\mathop{\mathrm{O}}(\mathop{\mathrm{M}}(m)\log m)$ it is quite fast. In theory.

Calculating the individual factorials on the other side would need $\mathop{\mathrm{O}}(\mathop{\mathrm{M}}(m\log m)$ time but that includes the final multiplication, too.
Now, how much does this cost?
The number of digits of $\mathop{\mathrm{sf}}(n)$ is $\sum_{k=1}^m k\log k$ which is the logarithm of the hyperfactorial $\mathop\mathrm{{hf}}(m) = \prod_{k=1}^m k^k$ that can be approximated with $\log \left( \mathop{\mathrm{hf}}\left( m\right)\right) \sim 0.5 m^2 \log m$ which we can put in our time complexity and solemnly conclude: this is the largest part of the work and the rest pales in contrast.

It also means that we can just choose the simplest way to do it and that’s how we do it. We have implemented some methods before to do some simple arithmetic with lists of prime factors, to find the prime factors of a factorial, and to compute the final result.
Let’s put it al together:

int superfactorial(unsigned long n, mp_int * c)
{
long *t, *s;
unsigned long i, length_t, length_s;
int e;
if (n < 2) {
mp_set(c, 1);
return MP_OKAY;
}
if (n == 2) {
mp_set(c, 2);
return MP_OKAY;
}
if ((e = mp_factor_factorial(2, 0, &s, &length_s)) != MP_OKAY) {
return e;
}
for (i = 3; i < n; i++) {
if ((e = mp_factor_factorial(i, 1, &t, &length_t)) != MP_OKAY) {
return e;
}
if ((e =
&length_s)) != MP_OKAY) {
return e;
}
free(t);
t = NULL;
length_t = 0;
}
if ((e =
mp_compute_signed_factored_factorials(s, length_s, c,
NULL)) != MP_OKAY) {
return e;
}
return MP_OKAY;
}


The relevant functions are in my fork of libtommath.

A first test against the naïve implementation, admittedly quite unfair, gave a difference in time of 1:300 for $\mathop{\mathrm{sf}}(1\,000)$. But more elborated algorithms do not give much back for such small numbers, maybe binary splitting could gain something by making the numbers to multiply more balanced in size.
With $\mathop{\mathrm{sf}}(1\,000) \sim 3.2457\mathrm{\mathbf{e}}1\,177\,245$ the cutoff to FFT multiplication gets reached.

[1] The trial divisions to find the exponents is actually integer root finding, as you might have known already, which can be done faster than just trial division, much faster. With $\mathop{\mathrm{M}}(m)$ the time to multiply two m-digit long integers the time complexity for the nnth-root is $\mathop{\mathrm{O}}(\mathop{\mathrm{M}}(m)\log m)$ with the AGM.

On the numerical evaluation of factorials X

The last post in this sequence was about some more general way to work with the prime factorization of the factorials and about the Narayana numbers:

${\displaystyle{ N(n,k) = \frac{1}{n}{ \binom nk}{\binom n{k-1}} }}$

Which expands to

${\displaystyle{ {{n!^2}\over{\left(k-1\right)!\,k!\,n\,\left(n-k\right)!\,\left(n-k +1\right)!}} }}$

Using $(n+1)! = n(n!)$ and vice versa $(n-1)! = \frac{n!}{n}$ we can shove more things together and get

${\displaystyle{ \frac{k n!^2}{k!^2 n \left(n-k+1\right) \left(n-k\right)!^2} }}$

Get the non-factorials out of the way

${\displaystyle{ \frac{k}{n \left(n-k+1\right)}\; \frac{n!^2}{k!^2 \left(n-k\right)!^2} }}$

With $P_n$ the prime factorization of $n!$ and with their arithmetic resembling the relation of the arithmetic of numbers and the arithmetic of their logarithms (e.g.:$x\cdot y = \exp\left(\log\left(x\right)+\log\left(y\right) \right)$ ) we get (dropping the fraction with the non-factorials)

{\displaystyle{ \begin{aligned} \phantom{\Leftrightarrow\;} &2P_n - \left( 2P_k + 2P_{n-k} \right) \\ \Leftrightarrow\; &2P_n - 2P_k - 2P_{n-k} \\ \Leftrightarrow\; &2\left(P_n - P_k - P_{n-k} \right) \end{aligned} }}

We have to be carefull with the term $n \left(n-k+1\right)$ for it can overflow, so we should divide the prime factorization by and by instead of multiplying $n \left(n-k+1\right)$ out and divide by that all together.

The actual implementation is left as an exercise for the reader.
Or just wait until I do it myself in the next days π

Adventures of a Programmer: Parser Writing Peril XII

Last time we talked about and implemented a fast algorithm for Fibonacci numbers which was based on an optimzed version of the matrix exponentiation algorithm. optimized only in way that just stripped the redundant parts of the matrix algorithm, taking advantage of the fact that the place $a_{2,2}$ is of no use at all and the places $a_{1,2}$ and $a_{2,1}$ are equal, so half of the computation can be skipped. That means that the original algorithm should need about twice the running time, all other things being equal. Let’s see.

The family of functions with such a recursive definition as the Fibonacci and the Lucas numbers is small but not that small. It includes the Pell numbers $P_n$, too, with their near relatives the Pell-Lucas numbers $Q_n$ and the Modified Pell numbers (PDF) $q_n$, the former also known as companion Pell numbers with the latter two having the relation $Q_n = 2q_n$.

The algorithm itself is very simple: exponentiate by repeated squaring, just like the ordinary binary exponentiation algorithm for integer exponents.

\begin{aligned} & \text{\textbf{function}}\;\text{\textit{Binary Exponentiation}}\;x^n\\ & \qquad \text{\textbf{ensure: }}\;\neg\left(x = 0 \wedge n = 0 \right) \\ & \qquad \text{\textbf{ensure: }}\;n\in\mathbb{N}\\ & \qquad r \gets 1 \\ & \qquad \text{\textbf{while}}\; n \not= 0 \\ & \qquad \qquad \text{\textbf{if}}\; \mathop{\mathrm{odd}} n \\ & \qquad \qquad \qquad r \gets r\cdot x \\ & \qquad \qquad \qquad n \gets n -1 \\ & \qquad \qquad \text{\textbf{endif}} \\ & \qquad \qquad x \gets x^2 \\ & \qquad \qquad n \gets \biggl\lfloor\frac{n}{2}\biggr\rfloor \\ & \qquad \text{\textbf{end while}} \\& \qquad \text{\textbf{return}}\; r \\ & \text{\textbf{end function}} \end{aligned}

We should put the starting points in an extra matrix and use a temporary matrix to keep things simple and legible, too. The last thing we need to know is how to multiply two square 2×2 matrices and before you try to remember in which bonfire you put your highschool math-book and if it makes any sense at all to start searching Wikipedia I’ll show you.

${\displaystyle{ \begin{bmatrix}a & b\\ c & d\end{bmatrix} \begin{bmatrix}e & f\\ g & h\end{bmatrix} = \begin{bmatrix}ae+bg &af+bh\\ ce+dg& cf+dh\end{bmatrix} }}$

Squaring is the same method and results in two elements getting squared, a place for later optimization.

${\displaystyle{ \begin{bmatrix}a & b\\ c & d\end{bmatrix}^2 = \begin{bmatrix}b c+a^2&b d+a\,b\\ c d+a c&d^2+b c\end{bmatrix} }}$

We don’t do any optimization, we just loop over the elements of the matrices in the following way:

int mp_pell(unsigned long n,mp_int *r){
int i,j,k;
mp_int a,b,c,d,e,f,g,h,t1,t2,t3,t4,t5;
int E;
mp_init_multi(&a,&b,&c,&d,&e,&f,&g,&h,&t1,&t2,&t3,&t4,&t5,NULL);

mp_set(&a,2); /* set to {1,1,1,0} to get Fibonacci/Lucas numbers */
mp_set(&b,1);
mp_set(&c,1);
mp_set(&d,0);

mp_set(&e,1); /* set to {1,2,2,1} to get Lucas numbers */
mp_set(&f,0); /* set to {2,2,2,2} to get Pell-Lucas numbers */
mp_set(&g,0); /* set to {1,1,1,1} to get Modified Pell numbers */
mp_set(&h,1);

mp_int pell[2][2] = { { a , b } , { c , d } };
mp_int ret[2][2]  = { { e , f } , { g , h } };
mp_int tmp[2][2]  = { { t1,t2 } , { t3,t4 } };

while(n){
if(n&1){
for(i=0; i<2; i++)
for(j=0; j<2; j++){
mp_set(&tmp[i][j],0);
}
for(i=0; i<2; i++)
for(j=0; j<2; j++)
for(k=0; k<2; k++){
mp_mul(&ret[i][k],&pell[k][j],&t5);
}
for(i=0; i<2; i++)
for(j=0; j<2; j++){
mp_copy(&tmp[i][j],&ret[i][j]);
}
}
for(i=0; i<2; i++)
for(j=0; j<2; j++){
mp_set(&tmp[i][j],0);
}
for(i=0; i<2; i++)
for(j=0; j<2; j++)
for(k=0; k<2; k++){
mp_mul(&pell[i][k],&pell[k][j],&t5);
}
for(i=0; i<2; i++)
for(j=0; j<2; j++){
mp_copy(&tmp[i][j],&pell[i][j]);
}
n/=2;
}
mp_copy(&ret[0][1],r);
/*
The matrix handling seems to mess up some pointers. If I try it with
mp_clear_multi() I get a SIGABRT with a "nmap_chunk(): invalid pointer".
*/
for(i=0; i<2; i++)
for(j=0; j<2; j++){
mp_clear(&tmp[i][j]);
mp_clear(&ret[i][j]);
mp_clear(&pell[i][j]);
}
return MP_OKAY;
}


I tried hard (no, that’s a lie π ) to make it simpler but to no avail.

The necessary formulas for optimization are:

{\displaystyle { \begin{aligned} P_{2n+1} &= 4 * P_{n + 1}^2 + P_n^2 \\ P_{2n} & = P_n (2 P_{n + 1} + P_{n - 1}) \end{aligned} }}

I put some notes in the code for how to make it compute the Fibonacci and Lucas numbers respectively. I tried it with the Fibonacci numbers and found this code a runtime that is bit more than twice as long as the runtime of the optimized version.

The above code has been generalized and merged as mp_fpl_matrix into my fork of libtommath. The functions mp_pell, mp_pell_lucas, and mp_pell_modified respectively call the function mp_fpl_matrix with the correct values.

The pseudocode has been typeset with the following Latex code (You have to strip the newlines and be gracious with whitespace to make in work in WordPress) Don’t forget to put it between the WordPress parentheses for Latex: a dollar-sign followed immediately by the word latex followed by whitespace, the actual Latex code, whitespace and another dollar-sign to end it.

\begin{aligned}
& \text{\textbf{function}}\;\text{\textit{Binary Exponentiation}}\;x^n  \\
& \qquad \text{\textbf{Ensure: }}\;\neg\left(x = 0 \wedge n = 0 \right) \\
& \qquad r \gets 1                                                      \\
& \qquad \text{\textbf{while}}\; n \not= 0                              \\
& \text{\textbf{end function}}
\end{aligned}


On the numerical evaluation of factorials IX

Number nine, who would have thought?!

As it is well known and laid down elsewhere, even by myself in a different post, it is quite easy to multiply and divide integers when their prime factorization is known. It is too complicated for almost all integers but it can be done very fast for factorials.
We have implemented algorithms to compute the binomial coefficient and the Catalan numbers but we did all of it inline and not as a generally applicable function. This generally applicable function needs to be able to handle negative exponents for example and we are going to implement one now.
Why now? Because of the Rule “One, two, many” and we used it three times already. The usage in the factorial function is excused for reasons of speed, so we are back down to two if we ignore the double-factorial. The next function I want to implement is the so called Narayana number:

${\displaystyle{ N(n,k) = \frac{1}{n}{ \binom nk}{\binom n{k-1}} }}$

Which can be written with factorials as

${\displaystyle{ \frac{1}{n} \; {{n!}\over{k!\,\left(n-k\right)!}} \;{{n!}\over{\left(k-1\right)!\,\left(n-k+1\right)!}} }}$

Shoving all together

${\displaystyle{ {{n!^2}\over{\left(k-1\right)!\,k!\,n\,\left(n-k\right)!\,\left(n-k +1\right)!}} }}$

It is most probably faster to use the first formula with the binomials but it is a good example with many factorials to multiply, divide and even use exponentiation (with an integer).

The first function is the one we need to decompose the factorial. It is not much different from the one in the factorial or binomial function but I took the chance to clean it up a little bit.

#include <tommath.h>
#ifndef LN_113
#define LN_113 1.25505871293247979696870747618124469168920275806274
#endif
#include <math.h>
int mp_factor_factorial(unsigned long n, unsigned long start,
long **prime_list,unsigned long *length){

unsigned long pix=0,prime,K;
mp_bitset_t *bst;

/* To be able to handle negative values in subtraction and to be sure that
n is positive*/
if(n > LONG_MAX){
return MP_VAL;
}

bst = malloc(sizeof(mp_bitset_t));
if (bst == NULL) {
return MP_MEM;
}
mp_bitset_alloc(bst, n+1);
mp_eratosthenes(bst);
pix = (unsigned long) (LN_113 *  n/log(n) );

*prime_list = malloc(sizeof(long) * (pix) * 2);
if (prime_list == NULL) {
return MP_MEM;
}
if(start <= 2){
prime = 2;
}
else{
prime = start;
}
K=0;
do {
(*prime_list)[K]   = (long)prime;
(*prime_list)[K+1] = (long)mp_prime_divisors(n,prime);
prime              = mp_bitset_nextset(bst,prime+1);
K+=2;
}while(prime <= n);
*prime_list = realloc(*prime_list,sizeof(long) * K);
if (prime_list == NULL) {
return MP_MEM;
}
free(bst);
*length = K;
return MP_OKAY;
}


The library we use is libtommath with my extensions available at my Github account and spread over this blog.
The function itself has n’t changed: make list of primes, walk over the primes and calculate for every prime the factorization. The result is put in a list in order and with the exponent following the prime, starting with the prime 2.
The mess of different types is due to a desperate try to keep all of the stuff compatible.

What next? Subtraction, OK.

These lists of primes and their exponents are nothing but simple polynomials and subtraction can be done likewise. One little exception: the polynomials are multi-variabled. That means that we need all primes in a consecutive row and we need as much as the bigger one of subtrahend and minuend respectively have.
It is a good idea to save some memory and skip any primes that have an exponent of zero as the result of the subtraction but only as long as the result doesn’t get squeezed another time through the function. The rule of consecutiveness is broken then, so we have to take care to check for such blank spots and fill them in if necessary.

The version without this checks first:

int mp_subtract_factored_factorials( long *subtrahend,
unsigned long l_subtrahend,
long *minuend,
unsigned long l_minuend,
long **difference,
unsigned long /* Vive */ *l_difference /*!*/
){
unsigned long k, counter=0;
unsigned long max_length = MAX(l_subtrahend,l_minuend );
unsigned long min_length = MIN(l_subtrahend,l_minuend );
long d,p,d_1,d_2;
/* TODO: check for sizes > 0 here */

/* Things work a bit different from ordinary arithmetic from now on */
*difference = malloc(sizeof(unsigned long)*(max_length+1));
if(*difference == NULL){
return MP_MEM;
}

/* Loop over smaller chunk first */
for(k=0;k<min_length;k+=2){
/* both have same length, we can take the value of the prime from any */
p = subtrahend[k];
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1];
d_2 = minuend[k+1];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
/* We only keep results that are different from zero */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
/* We need to find out which one is the smaller array and we have basically
two ways to approach the problem:
a) complicated and error-prone pointer juggling
b) just two loops
Here I have chosen b.
*/

/* If both arrays have the same length we can just stop here */
if(max_length == min_length){
/* We mad nothing dirty, so there's nothing to clean up here, let's just
grab our stuff and run */
return MP_OKAY;
}

/* If the subtrahend is bigger we subtract zeros, so simply copy */
if(l_subtrahend >= l_minuend ){
for(k=min_length;k<max_length;k+=2){
p = subtrahend[k];
d = subtrahend[k+1];
/* We still check, one never knows where things came from */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
}
/* If the minuend is bigger we subtract from zeros, so negate before copy*/
else{
for(k=min_length;k<max_length;k+=2){
p = minuend[k];
/* Yes, even negation can overflow */
d_1 = minuend[k+1];
if(d_1 == LONG_MIN){
return MP_VAL;
}
d = -d_1;
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
}
/* Take care of the little difference the length might have */
*l_difference = counter;
return MP_OKAY;
}


Yes, I even checked for under/overflow this time π
To implement the test for consecutiveness we have to check every step. Let’s pick out the first loop and put a comment where it has to change

  for(k=0;k<min_length;k+=2){
/* Both have same length, we can take the value of the prime from any?
No, that is not true, we have to check and insert if none is there.
We don't have to insert every prime, just enough that both have the
same amount
*/
p_1 = subtrahend[k];
p_2 = minuend[k];
if(p_1 != p_2){
/* where is the gap? */
if(p_1 < p_2){
/* the gap is in the minuend, fill up subtrahend  */
}
if(p_1 > p_2){
/* the gap is in the subtrahend, fill up minuend */
}
}
/* handle difference in indices */
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1];
d_2 = minuend[k+1];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
/* We only keep results that are different from zero */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}


As you might have guessed already, it is not very straight forward. We cannot simply fill-up the minuend because we have to subtract the exponents, so we have to fill-up with negated exponents. To keep up with the difference in the indices we have to know which one, subtrahend or minuend differs and how much. Which results in this (untested! I tend to be one off in these situations the first time, so please check with a debugger before you try it) loop:

  unsigned long gap_in_subtrahend = 0;
unsigned long gap_in_minuend    = 0;
for(k=0;k<min_length;k+=2){
p_1 = subtrahend[k+gap_in_minuend];
p_2 = minuend[k+gap_in_subtrahend];
if(p_1 != p_2){

if(p_1 < p_2){
for(i=0;p_1 < p_2;i+=2){
p_1 = subtrahend[k+i+gap_in_minuend];
(*difference)[counter]   = 1;
(*difference)[counter+1] = 0;
}
gap_in_minuend += i;
}
else{
for(i=0;p_1 > p_2;i+=2){
p_2 = minuend[k+i+gap_in_subtrahend];
d_2 = minuend[k+i+1+gap_in_subtrahend];
if(d_1 == LONG_MIN){
return MP_VAL;
}
d = -d_1;
(*difference)[counter]   = p_2;
(*difference)[counter+1] = d;
}
gap_in_subtrahend += 0;
}
}
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1+gap_in_minuend];
d_2 = minuend[k+1+gap_in_subtrahend];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
/* We only keep results that are different from zero */
if(d!=0){
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}


There is still amiss a check for the bounds of the participating arrays. And all we would gain is some saved memory. And how often does it happen? It does happen only if you manipulate the arrays by hand e.g.: divide by a small integer which means subtracting the prime factorization of a small integer or after a couple of divisions with factorials.
Two small examples:

${\displaystyle{ \frac{\left(\frac{20!}{10!}\right)}{10!} = 2^2 \cdot 3^0 \cdot 5^0 \cdot 7^0 \cdot 11^1 \cdot 13^1 \cdot 19^1 = 184\,756 }}$

And

${\displaystyle{ \frac{10!}{25} = 2^8 \cdot 3^4 \cdot 5^0 \cdot 7^1 = 145\,152 }}$

So it does happen and doesn’t happen rarely, we cannot ignore it.
Another one of the basic rules for programming is:”Make it work first, optimize later!” where “works unusable slow” is to be considered as non-working, too. (One example might be the too optimistic starting value for the nth-root function in libtommath which makes it unusable slow).
Skip the array compression? At what cost? Two examples

${\displaystyle{ \frac{20!}{19!} = 2^2 \cdot 3^0 \cdot 5^1 \cdot 7^0 \cdot 11^0 \cdot 13^0 \cdot 19^0 = 20 }}$

${\displaystyle{ \frac{20!}{18!} = 2^2 \cdot 3^0 \cdot 5^1 \cdot 7^0 \cdot 11^0 \cdot 13^0 \cdot 19^1 = 380 }}$

If we divide two factorials $\frac{m!}{n!}$ the gaps start to build if $n>\frac{m}{2}$ (proof omitted) and if we divide by a small integer $\frac{m!}{k}$ with $k \frac{}{m/2}$, it just needs a place to keep the original, un-evaluated factorials. Maybe at the end of the array with the prime-part set to one which does not change the behavior of the final multiplication ($1^x = 1$). It should be treated, because it would loop several times for nothing (max. 32 or 64 times depending on the bit size of the unsigned long  type but it does loop over the whole array every time).
So it is the check for $n>\frac{m}{2}$ we can fall back to if the result without it is too slow (unlikely, the heaviest burden computationally is the multiplication and squaring in the final multiplication, but cache misses can hurt. It is to be tested) or uses too much memory (more likely, because we don’t do this for small factorials but for large ones, too large to be handled otherwise, something in the range of $n > 10^6$ for $n!$).
So (I am such a so-so man π ) without the memory optimizations fro know. That makes the code look a bit different. Not much but, well, a bit. We just strip the checks if the difference is zero and leave the rest untouched.

int mp_subtract_factored_factorials( long *subtrahend,
unsigned long l_subtrahend,
long *minuend,
unsigned long l_minuend,
long **difference,
unsigned long /* Vive */ *l_difference /*!*/
){
unsigned long k, counter=0;
unsigned long max_length = MAX(l_subtrahend,l_minuend );
unsigned long min_length = MIN(l_subtrahend,l_minuend );
long d,p,d_1,d_2;
/* TODO: check for sizes > 0 here */

/* Things work a bit different from ordinary arithmetic from now on */
*difference = malloc(sizeof(unsigned long)*(max_length+1));
if(*difference == NULL){
return MP_MEM;
}

/* Loop over smaller chunk first */
for(k=0;k<min_length;k+=2){
/* both have same length, we can take the value of the prime from any */
p = subtrahend[k];
/*d = subtrahend[k+1] - minuend[k+1];*/
d_1 = subtrahend[k+1];
d_2 = minuend[k+1];
/* handle over/underflow */
if((d_2 > 0) && (d_1 < LONG_MIN + d_2)){
return MP_VAL;
}
if((d_2 < 0) && (d_1 > LONG_MAX + d_2)){
return MP_VAL;
}
d = d_1 - d_2;
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
/* We need to find out which one is the smaller array and we have basically
two ways to approach the problem:
a) complicated and error-prone pointer juggling
b) just two loops
Here I have chosen b.
*/

/* If both arrays have the same length we can just stop here */
if(max_length == min_length){
/* We made nothing dirty, so there's nothing to clean up here, let's just
grab our stuff and run */
return MP_OKAY;
}

/* If the subtrahend is bigger we subtract zeros, so simply copy */
if(l_subtrahend >= l_minuend ){
for(k=min_length;k<max_length;k+=2){
p = subtrahend[k];
d = subtrahend[k+1];
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
/* If the minuend is bigger we subtract from zeros, so negate before copy*/
else{
for(k=min_length;k<max_length;k+=2){
p = minuend[k];
/* Yes, even negation can overflow */
d_1 = minuend[k+1];
if(d_1 == LONG_MIN){
return MP_VAL;
}
d = -d_1;
(*difference)[counter]   = p;
(*difference)[counter+1] = d;
counter += 2;
}
}
/* this is now the same as max_length */
*l_difference = counter;
return MP_OKAY;
}


Addition follows the example of subtraction

int mp_add_factored_factorials(  long *summand_1,
unsigned long l_summand_1,
long *summand_2,
unsigned long l_summand_2,
long **sum,
unsigned long *l_sum){
unsigned long k, counter=0;
unsigned long max_length = MAX(l_summand_1,l_summand_2);
unsigned long min_length = MIN(l_summand_1,l_summand_2);
long s,p, s_1,s_2;
/* For more detailed comments see mp_subtract_factored_factorials() */
*sum = malloc(sizeof(unsigned long)*(max_length+1));
if(*sum == NULL){
return MP_MEM;
}
for(k=0;k<min_length;k+=2){
p = summand_1[k];
/* Over/underflow possible! */
/*s = summand_1[k+1] + summand_2[k+1];*/
s_1 = summand_1[k+1];
s_2 = summand_2[k+1];

if((s_2 > 0) && (s_1 > LONG_MAX - s_2)){
/* overflow */
return MP_VAL;
}
if((s_2 < 0) && (s_1 < LONG_MIN - s_2)){
/* underflow */
return MP_VAL;
}
s = s_1 + s_2;
(*sum)[counter]   = p;
(*sum)[counter+1] = s;
counter += 2;
}
if(l_summand_1 >= l_summand_2){
for(k=0;k<max_length;k+=2){
p = summand_1[k];
s = summand_1[k+1];
(*sum)[counter]   = p;
(*sum)[counter+1] = s;
counter += 2;
}
}
else{
for(k=0;k<max_length;k+=2){
p = summand_2[k];
s = summand_2[k+1];
(*sum)[counter]   = p;
(*sum)[counter+1] = s;
counter += 2;
}
}
*l_sum = counter;
return MP_OKAY;
}


Multiplication and division of the polynomials make not much sense, because of their multi-variability but multiplication with a small integer makes. Example

{\displaystyle{ \begin{aligned}20! &= 2^{18} \cdot 3^8 \cdot 5^4 \cdot 7^2 \cdot 11^1 \cdot 13^1 \cdot 19^1 \\ (20!)^2 &= 2^{36} \cdot 3^{16} \cdot 5^8 \cdot 7^4 \cdot 11^2 \cdot 13^2 \cdot 19^3 \end{aligned} }}

We need to check for over/underflow at multiplications here where we have to alternatives: use a data type that has twice the size of the data types of the multiplicands or do it the hard way.
The hard, but safe and portable way is to check every possible combination and test by trial division.

if (a > 0) {
if (b > 0) {
if (a > (SIGNED_TYPE_MAX / b)) {
return ERROR;
}
}
else {
if (b < (SIGNED_TYPE_MIN / a)) {
return ERROR;
}
}
}
else {
if (b > 0) {
if (a < (SIGNED_TYPE_MIN / b)) {
return ERROR;
}
}
else {
if ( (a != 0) && (b < (SIGNED_TYPE_MAX / a))) {
return ERROR;
}
}
}


The other alternative is to use long long and hope for the best or use the explicit data types in stdint.h throughout.
We use libtommath here where a data type mp_word is defined to be able to hold the multiplication of two mp_digit numbers but mp_digit can be as small as a char.

int mp_power_factored_factorials(  long *input,
unsigned long l_input,
long multiplicator,
long **product,
unsigned long *l_product){

unsigned long k, counter=0;

long prod,p, p_1,p_2;
#ifdef MP_USE_LONG_LONG_AND_HOPE_FOR_THE_BEST
long long temp;
#endif
*product = malloc(sizeof(unsigned long)*(l_input+1));
if(*product == NULL){
return MP_MEM;
}
p_2 = multiplicator;
for(k=0;k<l_input;k+=2){
p = input[k];
/* Over/underflow possible! */
/*prod = input[k+1] * multiplicator;*/
p_1 = input[k+1];
/* Two alternatives: use "long long" and hope for the best or do it the
hard way */
#ifdef MP_USE_LONG_LONG_AND_HOPE_FOR_THE_BEST
temp = p * (long long)p_1;
if ((temp > LONG_MAX) || (tmp < LONG_MIN)) {
return MP_VAL;
}
prod = (long) temp;
#else
if (p > 0) {
if (p_1 > 0) {
if (p > (LONG_MAX / p_1)) {
return MP_VAL;
}
}
else {
if (p_1 < (LONG_MIN / p)) {
return MP_VAL;
}
}
}
else {
if (p_1 > 0) {
if (p < (LONG_MIN / p_1)) {
return MP_VAL;
}
}
else {
if ( (p != 0) && (p_1 < (LONG_MAX / p))) {
return MP_VAL;
}
}
}
prod = p * p_1;
#endif
(*product)[counter]   = p;
(*product)[counter+1] = prod;
counter += 2;
}

*l_product = counter;
return MP_OKAY;
}


We can produce the reciprocal of a number if we negate all exponents of its prime factorization.

int mp_negate_factored_factorials(  long *input,
unsigned long l_input,
long **output,
unsigned long *l_output){
unsigned long k, counter=0;
long neg;

*output = malloc(sizeof(unsigned long)*(l_input+1));
if(*output == NULL){
return MP_MEM;
}
for(k=0;k<l_input;k+=2){
neg = input[k+1];
if(neg == LONG_MIN){
return MP_VAL;
}
(*output)[counter]   = input[k];
(*output)[counter+1] = -neg;
counter += 2;
}
*l_output = counter;
return MP_OKAY;
}


One can do the above inline, of course and save a copy but doing things inline is always a rich source of unexpected side-effects and an abundantly gushing spring of feet with gunshot injuries, so pleeease be careful.

The final step, computing the polynomials, is different from the function mp_compute_factored_factorials() in that we have to be able to handle negative exponents, too.
One simple thing to handle negative exponents is to keep them in a second array with absolute values of the exponents and divide at the end. Another alternative might be to return both arrays and let the user decide. No matter which way, we have to check and if there are at least one negative exponent decide what to do.

It is also the place where we have to decide what to do with the fact that the useful function mp_mul_d() accepts only mp_digit data types as the small integers. If we approximate the maximum exponent of a factorial as the exponent of two with a magnitude of roughly $n$ for $n!$ with the next exponents roughly half of the magnitude of their predecessor than the biggest possible exponent for the smallest possible mp_digit data type is 255 which makes the largest processable factorial $256!$ if I calculated it correctly (the biggest prime smaller than 255 is 251). That is not very much.

From here on I assume DIGIT_BIT == 28 and check for it. We will handle the exponent of two separately which will allow us to compute $n!$ for $n<2^{29} = 536\,870\,912$. The factorial of half a billion is about $1.6631e4\,453\,653\,131$, a number with nearly four and a half billion digits. Close to the age of the Earth in years (4.54 Β± 0.05 billion according to Wikipedia), not so close to the age in minutes (3.15 billion, according to one James Ushher).
What, me and digressing? Ow, c'mon!

static long local_highbit(unsigned  long n){
long r=0;
while (n >>= 1) {
r++;
}
return r;
}
/*
To compute the resulting polynomials we need an approach different from
mp_compute_factored_factorials() because we have to be able to handle negative
exponents, too.
*/
int mp_multiply_factored_factorials(long *f, unsigned long f_length,
mp_int *c, mp_int *r){
unsigned long start=0,i,count=0,neg_count = 0,k=0;
long shift = 0;
long bit=0,neg_bit=0,hb=0,neg_hb=0,*p=NULL;
mp_int temp;
int e;

if(f[0] == 2){
shift = f[1];
if(f_length == 2){
if(shift > 0){
if( (e = mp_set_int(c,1LU<<(unsigned long)f[1]) ) != MP_OKAY){
return e;
}
}
return MP_OKAY;
}
}
if(shift){
start+=2;
k=2;
}
/*
This loop is expensive and should be done inline instead but for now...
*/
for(k=start;k<f_length;k+=2){
if(f[k+1] < 0){
/* count how many to save some memory */
count++;
}
/* Only primes up to 2^DIGIT_BIT because of mp_mul_d() */
if(f[k] >= 1<<DIGIT_BIT){
return MP_VAL;
}
}
/* all are negative */
//if(count && count == f_length/2){
/* if the alternative with two outputs has been chosen just skip the
computation of the polynomial with all positive exponents and set
*/
/* goto: negative_exponents; */

/* The other alternative would compute 1/x that gives always 0 in integer
divisions if x>1. But it would need exactly one exponent with zero which
has been filtered out already. It is debatable now if that was a good
decision.*/
//return MP_OKAY;
//}

if(count){
p = malloc( sizeof(long)* (count * 2) +1);
if(p == NULL){
return MP_MEM;
}
}

for(k=start;k<f_length;k+=2){
/* won't branch if count == 0 */
if(f[k+1] < 0){
p[neg_count]   = f[k];
p[neg_count+1] = abs(f[k+1]);
neg_count +=2 ;
/* And while we are already here ... */
neg_hb = local_highbit(abs(f[k+1]));
if(neg_bit < neg_hb)neg_bit = neg_hb;
/* This allows for some optimization, mainly w.r.t memory. Not done yet */
f[k]   = 1;
/* Set to zero to let the main loop skip it */
f[k+1] = 0;
}
else{
hb = local_highbit(f[k+1]);
if(bit < hb)bit = hb;
}
}

/* DIGIT_BIT can be as small as 8 */
if(bit >(long) DIGIT_BIT || neg_bit >(long) DIGIT_BIT){
return MP_VAL;
}
/* Anything times zero is zero, so set the result to one in advance */
if( (e = mp_set_int(c,1) ) != MP_OKAY){
return e;
}

/* Everything is in place, so let's start by computing with the positive
exponents first */

/* The other function mp_compute_factored_factorials() has a branch
but the cases where it makes sense are quite rare. Feel free to add it
yourself */
for( ; bit>=0; bit--) {
if( (e = mp_sqr(c, c) ) != MP_OKAY){
return e;
}
for(i=start; i<f_length; i+=2) {
if ((f[i+1] & (1LU << (unsigned long)bit)) != 0) {
/* This is problematic if DIGIT_BIT is too small. checked above */
if( (e = mp_mul_d(c,(mp_digit)f[i], c) ) != MP_OKAY){
return e;
}
}
}
}

/* Compute c * 2^n if n!=0 */
if(shift && shift > 0){
if(shift < 1<<DIGIT_BIT){
/* The choice of types is a bit questionable. Sometimes. */
if( (e = mp_mul_2d (c, (mp_digit) shift, c) ) != MP_OKAY){
return e;
}
}
else{
long multiplicator = 0;
another_round:
while(shift > 1<<DIGIT_BIT){
shift >>= 1;
multiplicator++;
}
if( (e = mp_mul_2d (c, (mp_digit) shift, c) ) != MP_OKAY){
return e;
}
if(multiplicator< DIGIT_BIT){
if( (e = mp_mul_2d (c, (mp_digit)(1<<multiplicator), c) ) != MP_OKAY){
return e;
}
}
else{
shift = 1<<multiplicator;
multiplicator = 0;
goto another_round;
}
}
}

/* None are negative*/
if(count == 0){
/* Clean up */
/* Nothing to clean up */
return MP_OKAY;
}

/* Compute with the negative eponents */
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_set_int(&temp,1) ) != MP_OKAY){
return e;
}
for( ; neg_bit>=0; neg_bit--) {
if( (e = mp_sqr(&temp, &temp) ) != MP_OKAY){
return e;
}
/* The exponents of 2 have been stripped already so we start at zero.
"count" counts only the occurrences, the size itself is twice as large.
*/
for(i=0; i<count*2; i+=2) {
if ((p[i+1] & (1LU << (unsigned long)neg_bit)) != 0) {
if( (e = mp_mul_d(&temp,(mp_digit)p[i], &temp) ) != MP_OKAY){
return e;
}
}
}
}

/* Compute c * 2^n if n!=0  for negative n*/
if(shift && shift < 0){
/* check for overflow */
if(shift == LONG_MIN){
return MP_VAL;
}
shift = -shift;
if(shift < 1<<DIGIT_BIT){
/* The choice of types is a bit questionable. Sometimes. */
if( (e = mp_mul_2d (&temp, (mp_digit) shift, &temp) ) != MP_OKAY){
return e;
}
}
else{
long multiplicator = 0;
another_round:
while(shift > 1<<DIGIT_BIT){
shift >>= 1;
multiplicator++;
}
if( (e = mp_mul_2d (&temp, (mp_digit) shift, &temp) ) != MP_OKAY){
return e;
}
if(multiplicator< DIGIT_BIT){
if( (e = mp_mul_2d (&temp, (mp_digit)(1<<multiplicator), &temp) ) != MP_OKAY){
return e;
}
}
else{
shift = 1<<multiplicator;
multiplicator = 0:
goto another_round;
}
}
}
#ifdef BN_MP_DO_LAST_DIVISION
if( (e = mp_div(c,&temp,c,r) ) != MP_OKAY){
return e;
}
#else
if(r != NULL){
if( (e = mp_copy(&temp,r) ) != MP_OKAY){
return e;
}
}
#endif
free(p);
mp_clear(&temp);
return MP_OKAY;
}


It compiles and works for small inputs but it cannot be called well tested. Use with car and drop me a note if you found a bug.

How to compute Narayana numbers? As I said in the beginning: computing them with the formula

${\displaystyle{ N(n,k) = \frac{1}{n}{ \binom nk}{\binom n{k-1}} }}$

would most probably be the fastest way and if not, it is the easiest to implement at least.

int mp_narayana(unsigned long n, unsigned long k, mp_int *c){
mp_int temp;
int e;
if( (e = mp_init(&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_binomial(n,k,c) ) != MP_OKAY){
return e;
}
if( (e = mp_binomial(n,k-1,&temp) ) != MP_OKAY){
return e;
}
if( (e = mp_mul(c,&temp,c) ) != MP_OKAY){
return e;
}
if( (e = mp_set_int(&temp,n) ) != MP_OKAY){
return e;
}
if( (e = mp_div(c,&temp,c,NULL) ) != MP_OKAY){
return e;
}
mp_clear(&temp);
return MP_OKAY;
}


The above code will be merged as mp_narayana() into my fork of libtommath in the next days. The rest only after some thorough tests.