Nobody believed it but finally: FFT multiplication for my Little big-integer library. It is quite general and can be used for Tom Wu’s bigint library, too, if the the digit size is not larger than 28 bits (default for Mozilla but check the start of Tom Wu’s file).

I don’t go into the mathematical details, there is more than enough to find online but feel free to ask if you have any questions.

The code-base I used is my own implementation of FFT multiplication for Libtommath, so no license problems. Continue reading

# Tag Archives: Big Integer

# Adventures of a Programmer: Parser Writing Peril XVI

Back to the experiment with balancing multiplicands which failed despite the advantage it must have shown based on the theory and if the margin of this blog post would have been wider I could even proof that it is better!

*stomps foot*

And it *is* better! All I have done was to comment the wrong line out and instead of getting two distinct benchmarks the second benchmark was the sum of both that let me think that balancing needs about twice the time than normal multiplication.

Moral of the story: if some results look suspicious, they mostly are. Or would you buy a machine that generates energy for free?

So, how does the real benchmark look?

Well, differently ðŸ˜‰

The method is as described before but let me talk a bit about the numbers to be tested.

It should be obvious that the balancing will make sense only for numbers large enough to pass the cut-off point of the Toom-Cook algorithms (T-C 2 {Karatsuba} and 3 are implemented in Libtommath) otherwise it would slow the multiplication down—costly overhead without any effect. The cut-off points will differ from architecture to architecture and mine are (in mp_digits): TC2 = 48,TC3 190 and for the very big numbers FFT=4000 (which is oversimplified but I’m working at it).

The tests for the small numbers run 1,000 times each. Time is in seconds.

</tr

Number Pair | Normal Multiplication | Balanced Multiplication |
---|---|---|

50 * 100 | 0.04 | 0.07 |

100 * 150 | 0.14 | 0.15 |

100 * 200 | 0.18 | 0.19 |

150 * 300 | 0.35 | 0.34 |

100 * 400 | 0.37 | 0.44 |

200 * 400 | 0.79 | 0.61 |

300 * 400 | 0.98 | 0.91 |

150 * 500 | 0.62 | 0.63 |

250 * 500 | 1.13 | 0.81 |

350 * 500 | 1.38 | 1.19 |

400 * 500 | 1.35 | 1.28 |

450 * 500 | 1.30 | 1.30 |

50 * 600 | 0.53 | 0.54 |

100 * 600 | 1.00 | 0.60 |

150 * 600 | 1.42 | 0.75 |

200 * 600 | 1.31 | 1.12 |

250 * 600 | 1.44 | 1.14 |

300 * 600 | 1.78 | 1.37 |

350 * 600 | 1.81 | 1.45 |

400 * 600 | 1.83 | 1.78 |

450 * 600 | 1.86 | 1.57 |

500 * 600 | 1.74 | 1.68 |

550 * 600 | 1.62 | 1.88 |

50 * 700 | 0.61 | 0.63 |

100 * 700 | 1.18 | 1.21 |

150 * 700 | 1.69 | 0.97 |

200 * 700 | 2.83 | 1.39 |

250 * 700 | 3.23 | 1.55 |

300 * 700 | 2.16 | 1.91 |

350 * 700 | 2.22 | 1.44 |

400 * 700 | 2.34 | 1.87 |

450 * 700 | 2.35 | 2.06 |

500 * 700 | 2.32 | 2.27 |

550 * 700 | 2.26 | 2.09 |

600 * 700 | 2.68 | 2.76 |

650 * 700 | 2.39 | 2.53 |

50 * 800 | 0.69 | 0.74 |

100 * 800 | 1.36 | 1.42 |

150 * 800 | 1.95 | 1.91 |

200 * 800 | 3.34 | 1.70 |

250 * 800 | 3.84 | 1.90 |

300 * 800 | 4.46 | 2.35 |

350 * 800 | 3.18 | 2.25 |

400 * 800 | 2.85 | 1.70 |

450 * 800 | 2.91 | 2.22 |

500 * 800 | 2.93 | 2.60 |

550 * 800 | 2.88 | 2.70 |

600 * 800 | 3.73 | 3.07 |

650 * 800 | 3.51 | 3.51 |

700 * 800 | 3.16 | 3.37 |

750 * 800 | 2.82 | 2.96 |

50 * 900 | 0.78 | 0.83 |

100 * 900 | 1.54 | 1.57 |

150 * 900 | 2.21 | 2.15 |

200 * 900 | 3.94 | 3.43 |

250 * 900 | 4.48 | 2.20 |

300 * 900 | 5.29 | 2.64 |

350 * 900 | 5.72 | 2.42 |

400 * 900 | 6.00 | 2.29 |

450 * 900 | 6.20 | 2.01 |

500 * 900 | 3.58 | 2.64 |

550 * 900 | 3.55 | 2.99 |

600 * 900 | 4.91 | 3.56 |

650 * 900 | 4.79 | 3.76 |

700 * 900 | 4.33 | 5.07 |

750 * 900 | 4.07 | 4.21 |

800 * 900 | 3.73 | 4.07 |

850 * 900 | 3.68 | 3.89 |

50 * 1000 | 0.87 | 0.93 |

100 * 1000 | 1.72 | 1.78 |

150 * 1000 | 2.50 | 2.49 |

200 * 1000 | 4.35 | 4.01 |

250 * 1000 | 5.12 | 4.38 |

300 * 1000 | 5.99 | 3.03 |

350 * 1000 | 6.55 | 3.07 |

400 * 1000 | 6.96 | 2.80 |

450 * 1000 | 7.25 | 2.62 |

500 * 1000 | 7.44 | 2.32 |

550 * 1000 | 7.57 | 2.96 |

600 * 1000 | 5.86 | 3.64 |

650 * 1000 | 5.74 | 3.99 |

700 * 1000 | 5.55 | 4.40 |

750 * 1000 | 5.35 | 6.00 |

800 * 1000 | 5.08 | 6.12 |

850 * 1000 | 5.08 | 5.26 |

900 * 1000 | 4.87 | 5.01 |

950 * 1000 | 4.28 | 4.51 |

Some points a more off than others, that might have a reason in the actual numbers which get produced with a cheap PRNG and are used over the whole loop. Let me repeat the last round ([50,950] * 1,000) with a different number each time. (Generating thousands of large numbers takes some time but we can ignore it, it is the same for both)

Number Pair | Normal Multiplication | Balanced Multiplication |
---|---|---|

50 * 1000 | 5.03 | 3.87 |

100 * 1000 | 5.05 | 3.84 |

150 * 1000 | 4.99 | 3.85 |

200 * 1000 | 4.99 | 3.86 |

250 * 1000 | 5.00 | 3.89 |

300 * 1000 | 5.00 | 3.85 |

350 * 1000 | 5.00 | 3.86 |

400 * 1000 | 5.00 | 3.92 |

450 * 1000 | 4.99 | 3.85 |

500 * 1000 | 5.05 | 3.87 |

550 * 1000 | 4.99 | 3.89 |

600 * 1000 | 5.05 | 3.86 |

650 * 1000 | 4.98 | 3.88 |

700 * 1000 | 5.08 | 3.86 |

750 * 1000 | 5.05 | 3.88 |

800 * 1000 | 4.98 | 3.84 |

850 * 1000 | 5.01 | 3.94 |

900 * 1000 | 5.12 | 3.86 |

950 * 1000 | 5.03 | 3.85 |

Oh?

Let’s use libtomath’s own tool `mp_rand`

,too:

Number Pair | Normal Multiplication | Balanced Multiplication |
---|---|---|

50 * 1000 | 8.50 | 7.34 |

100 * 1000 | 8.50 | 7.37 |

150 * 1000 | 8.49 | 7.38 |

200 * 1000 | 8.47 | 7.35 |

250 * 1000 | 8.48 | 7.34 |

300 * 1000 | 8.50 | 7.39 |

350 * 1000 | 8.48 | 7.36 |

400 * 1000 | 8.49 | 7.35 |

450 * 1000 | 8.47 | 7.30 |

500 * 1000 | 8.47 | 7.33 |

550 * 1000 | 8.49 | 7.37 |

600 * 1000 | 8.47 | 7.33 |

650 * 1000 | 8.47 | 7.32 |

700 * 1000 | 8.47 | 7.34 |

750 * 1000 | 8.51 | 7.34 |

800 * 1000 | 8.46 | 7.36 |

850 * 1000 | 8.49 | 7.37 |

900 * 1000 | 8.47 | 7.33 |

950 * 1000 | 8.46 | 7.38 |

The function `mp_rand`

is more exact as it makes sure that the MSD is always different from zero. This gives an interesting effect: the balanced version is even faster when both multiplicands have been ordered to have the same size. So let me get something to read while the following script runs:

for i in `seq 50 50 1000`; do for j in `seq 50 50 $i`; do ./testbalancing $i $j;done;done

The last round is the most significant:

Number Pair | Normal Multiplication | Balanced Multiplication |
---|---|---|

1000 * 50 | 1.31 | 1.32 |

1000 * 100 | 1.72 | 1.67 |

1000 * 150 | 2.03 | 1.90 |

1000 * 200 | 2.75 | 2.30 |

1000 * 250 | 2.99 | 2.35 |

1000 * 300 | 3.26 | 2.60 |

1000 * 350 | 3.37 | 2.66 |

1000 * 400 | 3.49 | 2.79 |

1000 * 450 | 3.57 | 2.98 |

1000 * 500 | 3.66 | 3.31 |

1000 * 550 | 3.73 | 3.63 |

1000 * 600 | 4.26 | 4.20 |

1000 * 650 | 4.49 | 4.41 |

1000 * 700 | 4.85 | 4.86 |

1000 * 750 | 5.15 | 5.18 |

1000 * 800 | 5.63 | 5.40 |

1000 * 850 | 6.28 | 5.88 |

1000 * 900 | 6.92 | 6.30 |

1000 * 950 | 7.67 | 6.80 |

1000 * 1000 | 8.47 | 7.36 |

As I said at the start of this post: “If some result look suspicious, they mostly are.”, so let’s do a check:

Mmh…

n = strtoul(argv[1],NULL,10); m = strtoul(argv[2],NULL,10); --- for(n=0;n<1000;n++){ ... mp_rand(&a,n); mp_rand(&b,m); }

Ouch!

Now if that’s not embarassing, I don’t know what else is ðŸ˜‰

Ok, now on with the real one. The round with one thousand first to see if the results are reasonable now.

Number Pair | Normal Multiplication | Balanced Multiplication |
---|---|---|

50 * 1000 | 0.920000 | 1.060000 |

100 * 1000 | 1.770000 | 1.930000 |

150 * 1000 | 2.400000 | 2.460000 |

200 * 1000 | 4.640000 | 4.070000 |

250 * 1000 | 5.140000 | 4.560000 |

300 * 1000 | 6.380000 | 2.880000 |

350 * 1000 | 7.080000 | 2.980000 |

400 * 1000 | 7.420000 | 2.860000 |

450 * 1000 | 7.210000 | 2.520000 |

500 * 1000 | 7.360000 | 2.410000 |

550 * 1000 | 7.560000 | 2.980000 |

600 * 1000 | 5.940000 | 3.630000 |

650 * 1000 | 5.910000 | 3.600000 |

700 * 1000 | 5.760000 | 4.270000 |

750 * 1000 | 5.440000 | 6.060000 |

800 * 1000 | 5.210000 | 5.880000 |

850 * 1000 | 5.460000 | 5.380000 |

900 * 1000 | 5.190000 | 5.100000 |

950 * 1000 | 4.410000 | 4.400000 |

1000 * 1000 | 3.540000 | 3.530000 |

Yepp, that makes more sense; the data supports the theory. There is a jump about 300*1,000, increases smoothly (more or less) up to about 700*1,000 and…oh, forgot to switch off the shortcuts. Aaaaaand again ðŸ˜‰

Number Pair | Normal Multiplication | Balanced Multiplication |
---|---|---|

50 * 1000 | 1.040000 | 0.870000 |

100 * 1000 | 1.870000 | 2.010000 |

150 * 1000 | 2.670000 | 2.400000 |

200 * 1000 | 4.610000 | 3.980000 |

250 * 1000 | 5.180000 | 4.510000 |

300 * 1000 | 6.330000 | 3.000000 |

350 * 1000 | 6.770000 | 2.850000 |

400 * 1000 | 7.030000 | 2.890000 |

450 * 1000 | 7.490000 | 2.430000 |

500 * 1000 | 7.600000 | 2.450000 |

550 * 1000 | 7.730000 | 2.980000 |

600 * 1000 | 5.680000 | 3.620000 |

650 * 1000 | 6.110000 | 3.990000 |

700 * 1000 | 5.890000 | 4.350000 |

750 * 1000 | 5.630000 | 5.910000 |

800 * 1000 | 5.150000 | 6.110000 |

850 * 1000 | 5.360000 | 5.250000 |

900 * 1000 | 5.180000 | 5.090000 |

950 * 1000 | 4.620000 | 4.340000 |

1000 * 1000 | 4.160000 | 4.290000 |

Nearly the same. There are two peaks where the differences are close to the Toom-Cook cut-off point. I’ll put the full table after the fold but the conclusion is that this kind of balancing makes most sense between about 3/10 and 7/10 and both multiplicands should be larger than the Toom-Cook 3 cut-off.

Continue reading

# Adventures of a Programmer: Parser Writing Peril XV

To test the last try of balancing multiplication in libtommath I needed to generate some large numbers. Really large numbers. Tens of millions of decimal digits long numbers. Using e.g.: `stdin`

as the input needs patience and has limits regarding the length of the argument buffer. It is more elegant to produce them directly with the conditions that the bits should be more or less uniformly distributed, can get generated fast and have no unwelcome side effects.

There is a function in libtommath called `mp_rand`

which produces a pseudo-random integer of a given size but it does not meet the above conditions. It uses a slow method involving elementary functions like `add`

and `shift`

but that is negligible. It uses `rand()`

which *has* side effects. The first one is that is not in the C-standard ISO/IEC 9899/2011 but in Posix (POSIX.1-2001) and the second one is that calls to `rand()`

might be implemented in a cryptographically secure way and its usage reduces the entropy pool without a good reason when used to generate large numbers for mere testing.

The method I used was to take a simple PRNG (pseudo random number generator) and copy the generated small 32-bit integers direct into the digits.

int make_large_number(mp_int * c, int size) { int e; if ((e = mp_grow(c, size)) != MP_OKAY) { return e; } c->used = size; while (size--) { c->dp[size] = (mp_digit)(light_random() & MP_MASK); } mp_clamp(c); return MP_OKAY; }

With `light_random()`

a small generator of the form [ [4] based on [3] see also [2] ] (48271 is a primitive root modulo ).

#include <stdint.h> static unsigned int seed = 1; int light_random(void) { int32_t result; result = seed; result = 48271 * (result % 44488) - 3399 * (int32_t) (result / 44488); if (result < 0){ result += ((1U<<31) -1); } seed = result; return result; } void slight_random(unsigned int grain){ /* The seed of the Lehmer-PRNg above must be co-prime to the modulus. The modulus 2^31-1 is prime (Mersenne prime M_{31}) so all numbers in [1,2^31-2] are co-prime with the exception of zero. */ seed = (grain)?seed:1; }

The method used to compute the remainder is called Schrage’s method[1]. Let me give a short description.

Given than , so and such that we can write

I’ll ommit the full proof and point to the paper but will give a short sketch of it.

Expanding the inner fractions

and with the additional conditions and the fraction is much smaller than unity.

With the Taylor expansion in hand and replacing with we get

As both and are smaller than unity and with

much smaller than unity we can state that

This allows us to conclude that

Put in the values and we have the code from above.

The period is which is enough to fill the maximum number of digits assuming `sizeof(int) = 4`

. With 28 bit long digits it is good enough for numbers up to 18,100,795,794 decimal digits.

[1] Schrage, L., *A More Portable Fortran Random Number Generator*, ACM Trans. Math. Software 5, 132-138, 1979.

[2] Stephen K. Park and Keith W. Miller and Paul K. Stockmeyer, *Another Test for Randomness: Response*, Communications of the ACM 36.7 (1993): 108-110.

[3] Lehmer, D. H., *Mathematical methods in large-scale computing units*, Proceedings of a Second Symposium on Large-Scale Digital Calculating Machinery, 1949,141â€“146. Harvard University Press, Cambridge, Mass., 1951.

[4] Park, Stephen K., and Keith W. Miller, *Random number generators: good ones are hard to find*, Communications of the ACM 31.10 (1988): 1192-1201.

# Adventures of a Programmer: Parser Writing Peril XIV

As I noted in my last post, the big integer library libtommath lacks a method to balance the multiplicands in size. The method to do it is quite simple and based on the rule:

Where are the multiplicands and is a multiplier of positive integer value. Example shall be :

If we use a binary multiplier instead of the decimal one we can use simple shifting to do the multiplication and we should use the big-number digits instead of the decimal ones, too, I think.

Here denotes the cut-off point marking the minimum size of the smaller multiplicand. This could be as small as 1 (one) but I would take that as a special value where the algorithm used in `mp_*_d`

will show better results (and it should be done in `mp_mul`

directly).

The other cut-off point is the relation of . It should be smaller than 1 (one), of course, but how much? ? Or even earlier at ? Hard to tell without a test but I think and will do for a start.

A straightforward implementation could be

#define MP_C1 2 #define MP_C2 0.5f int mp_balance_mult(mp_int *a, mp_int *b, mp_int *c){ int e, count,len_a, len_b; mp_int a_0,a_1; /* get digit size of a and b */ len_a = a->used; len_b = b->used; /* check if size of smaller one is larger than C1 */ if(MIN(len_a,len_b) < MP_C1){ //printf("Checkpoint C1 failed with length(a) = %d, length(b) = %d\n", // a->used,b->used); mp_mul(a,b,c); return MP_OKAY; } /* check if the sizes of both differ enough (smaller than C2)*/ if(( (float)MAX(len_a,len_b) / (float)MIN(len_a,len_b)) < MP_C2){ //printf("Checkpoint C2 failed with %f\n",( (float)len_a / (float)len_b)); mp_mul(a,b,c); return MP_OKAY; } /*Make sure that a is the larger one */ if(len_a < len_b){ mp_exch(a,b); } /* cut larger one in two parts a1, a0 with the smaller part a0 of the same length as the smaller input number b_0 */ mp_init_size(&a_0,b->used); a_0.used = b->used; mp_init_size(&a_1,a->used - b->used); a_1.used = a->used - b->used; /* fill smaller part a_0 */ for (count = 0; count < b->used ; count++) { a_0.dp[count] = a->dp[count]; } /* fill bigger part a_1 with the counter already at the right place*/ for (; count < a->used; count++) { a_1.dp[count-b->used] = a->dp[count]; } /* Speciale aanbieding: Zeeuwse mosselen maar 1,11 EUR/kg! */ mp_clamp(&a_0); mp_clamp(&a_1); /* a_1 = a_1 * b_0 */ mp_mul(&a_1,b,&a_1); /* a_1 = a_1 * 2^(length(a_0)) */ mp_lshd(&a_1,b->used); /* a_0 = a_0 * b_0 */ mp_mul(&a_0,b,&a_0); /* c = a_1 + a_0 */ mp_add(&a_1,&a_0,c); /* Don't mess with the input more than necessary */ if(len_a < len_b){ mp_exch(a,b); } return MP_OKAY; }

To make it short: it is slower and needs twice the time on average in contrast to the native multiplication algorithms tested with two numbers in relation with , using other relations makes it even worse.

So we can call it, with good conscience, an utter failure. Back to the blackboard.

# Adventures of a Programmer: Parser Writing Peril XIII

The biggest native integer libtommath allowed to set directly seems to be[1] an `unsigned long`

in the function `mp_set_int`

. The biggest native integer used, on the other side, is hidden behind `mp_word`

which is the type able to hold twice the size of `mp_digit`

and can be larger than an `unsigned long`

.

I need for my calculator some ways to work with native numbers without much ado where *ado* means a lot of conditionals, preprocessor directives, complicated data structures and all that mess. One of the ways to avoid it is to use the digits of the large integers directly if the large integer has only one. An example? Ok, an example.

Take the partial harmonic series, for example

If you calculate it with the help of the binary-splitting algorithm, especially than, a lot of the numbers involved are in the range of native integers and hold only on digit of the big numbers. The initialization of the big numbers in libtommath set them to 8 digits at least (the responsible variable is `MP_PREC`

in `tommath.h`

) and consumes costly heap memory to do so.

Fredrik Johansson proposed in a blogpost to postpone the reducing of the fraction to the very end. It is not much but it is something so let’s follow his advice and do so using my rational library (as mostly: without any error checking for less code-cluttering).

static mp_rat h_temp; mp_rat *_harmonics(unsigned long a, unsigned long b) { unsigned long m; mp_rat ta, tb; mp_int p, q, r, s; mp_word ps, qr; int e; if (b - a == 1) { mpq_set_int(&h_temp, (long) 1, (long) a); return &h_temp; } m = (a + b) >> 1; mpq_init_multi(&ta, &tb, NULL); // This is not necessarily necessary mp_init_multi(&p, &q, &r, &s, NULL); mpq_exch(_harmonics(a, m), &ta); mpq_exch(_harmonics(m, b), &tb); mp_exch(&ta.numerator, &p); mp_exch(&ta.denominator, &q); mp_exch(&tb.numerator, &r); mp_exch(&tb.denominator, &s); if ((&p)->used == 1 && (&s)->used == 1) { ps = (&p)->dp[0] * (mp_word) (&s)->dp[0]; mp_set_word(&ta.numerator, ps); } else { mp_mul(&p, &s, &ta.numerator); } if ((&q)->used == 1 && (&r)->used == 1) { qr = (&q)->dp[0] * (mp_word) (&r)->dp[0]; mp_set_word(&tb.numerator, qr); } else { mp_mul(&q, &r, &tb.numerator); } mp_add(&ta.numerator, &tb.numerator, &h_temp.numerator); mp_mul(&q, &s, &h_temp.denominator); mp_clear_multi(&p, &q, &r, &s, NULL); mpq_clear_multi(&ta, &tb, NULL); return &h_temp; } int harmonics(unsigned long n, mp_rat * c) { mpq_init(&h_temp); mpq_exch(_harmonics(1, n + 1), c); mpq_reduce(c); mpq_clear(&h_temp); return 1; }

The library `libtommath`

is not very friendly if used in such a way but I cannot find out a better way to implement the binary splitting algorithm. This implementation of the partial harmonic series for example, is much slower than Fredriks implementation with `gmpy`

(but faster than the native `Python`

one, at least ðŸ˜‰ ). It takes about 0.67 seconds for but already 193.61 seconds—yes, over 3 minutes!— for . That is definitely too much.

Funny thing: the normal algorithm is much faster, just 40 seconds for but also slower for smaller values, like about 1.09 seconds for whith the cut-off point at about on my machine. And it is *asymptotically* slower, I measured some points in between to find out. It is really time to implement fast multiplication in full in `libtommath`

.

Some of the problems with the first algorithm might have their reason in the nonexistent balancing of the multiplicands. There is a balancing function in the pull-queue but it seems to be a port from Ruby which makes it impossible to accept because of Ruby’s license (given that the submitter is not the original programmer of the Ruby code which I haven’t checked.)

static mp_rat h_temp; mp_rat * _harmonics(unsigned long a,unsigned long b){ unsigned long m; mp_rat ta,tb; if(b-a==1){ mpq_set_int(&h_temp,(long)1,(long)a); return &h_temp; } m = (a+b)>>1; mpq_init_multi(&ta,&tb,NULL); mpq_exch(_harmonics(a,m),&ta); mpq_exch(_harmonics(m,b),&tb); mpq_add(&ta,&tb,&h_temp); mpq_clear_multi(&ta,&tb,NULL); return &h_temp; } // use the same caller as above

However, it was just meant to be used as an example for `mp_set_word`

which I’ll present here:

#if (MP_PREC > 1) int mp_set_word(mp_int *c,mp_word w){ mp_zero(c); if(w == 0){ return MP_OKAY; } do{ c->dp[c->used++] = (mp_digit)w&MP_MASK; }while( (w >>= DIGIT_BIT) > 0 && c->used < MP_PREC); if( w != 0 ){ return MP_VAL; } return MP_OKAY; } #else #warning variable "MP_PREC" should be at least 2 #endif

The preprocessor mess is necessary even if the constant `MP_PREC`

should be at least 8 (eight) but, as I can tell you from some very bad experience, one never knows.

[1] The reason for the stylistically awkward subjunctive is: I mean it. I really could have overseen an already implemented function doing exactly what I wanted in the first place and hence is not a case of NIH. This time ðŸ˜‰

# Adventures of a Programmer: Parser Writing Peril XII

Last time we talked about and implemented a fast algorithm for Fibonacci numbers which was based on an optimzed version of the matrix exponentiation algorithm. optimized only in way that just stripped the redundant parts of the matrix algorithm, taking advantage of the fact that the place is of no use at all and the places and are equal, so half of the computation can be skipped. That means that the original algorithm should need about twice the running time, all other things being equal. Let’s see.

The family of functions with such a recursive definition as the Fibonacci and the Lucas numbers is small but not that small. It includes the Pell numbers , too, with their near relatives the Pell-Lucas numbers and the Modified Pell numbers (PDF) , the former also known as companion Pell numbers with the latter two having the relation .

The algorithm itself is very simple: exponentiate by repeated squaring, just like the ordinary binary exponentiation algorithm for integer exponents.

We should put the starting points in an extra matrix and use a temporary matrix to keep things simple and legible, too. The last thing we need to know is how to multiply two square 2×2 matrices and before you try to remember in which bonfire you put your highschool math-book and if it makes any sense at all to start searching Wikipedia I’ll show you.

Squaring is the same method and results in two elements getting squared, a place for later optimization.

We don’t do any optimization, we just loop over the elements of the matrices in the following way:

int mp_pell(unsigned long n,mp_int *r){ int i,j,k; mp_int a,b,c,d,e,f,g,h,t1,t2,t3,t4,t5; int E; mp_init_multi(&a,&b,&c,&d,&e,&f,&g,&h,&t1,&t2,&t3,&t4,&t5,NULL); mp_set(&a,2); /* set to {1,1,1,0} to get Fibonacci/Lucas numbers */ mp_set(&b,1); mp_set(&c,1); mp_set(&d,0); mp_set(&e,1); /* set to {1,2,2,1} to get Lucas numbers */ mp_set(&f,0); /* set to {2,2,2,2} to get Pell-Lucas numbers */ mp_set(&g,0); /* set to {1,1,1,1} to get Modified Pell numbers */ mp_set(&h,1); mp_int pell[2][2] = { { a , b } , { c , d } }; mp_int ret[2][2] = { { e , f } , { g , h } }; mp_int tmp[2][2] = { { t1,t2 } , { t3,t4 } }; while(n){ if(n&1){ for(i=0; i<2; i++) for(j=0; j<2; j++){ mp_set(&tmp[i][j],0); } for(i=0; i<2; i++) for(j=0; j<2; j++) for(k=0; k<2; k++){ mp_mul(&ret[i][k],&pell[k][j],&t5); mp_add(&t5,&tmp[i][j],&tmp[i][j]); } for(i=0; i<2; i++) for(j=0; j<2; j++){ mp_copy(&tmp[i][j],&ret[i][j]); } } for(i=0; i<2; i++) for(j=0; j<2; j++){ mp_set(&tmp[i][j],0); } for(i=0; i<2; i++) for(j=0; j<2; j++) for(k=0; k<2; k++){ mp_mul(&pell[i][k],&pell[k][j],&t5); mp_add(&t5,&tmp[i][j],&tmp[i][j]); } for(i=0; i<2; i++) for(j=0; j<2; j++){ mp_copy(&tmp[i][j],&pell[i][j]); } n/=2; } mp_copy(&ret[0][1],r); /* The matrix handling seems to mess up some pointers. If I try it with mp_clear_multi() I get a SIGABRT with a "nmap_chunk(): invalid pointer". */ for(i=0; i<2; i++) for(j=0; j<2; j++){ mp_clear(&tmp[i][j]); mp_clear(&ret[i][j]); mp_clear(&pell[i][j]); } return MP_OKAY; }

I tried hard (no, that’s a lie ðŸ˜‰ ) to make it simpler but to no avail.

The necessary formulas for optimization are:

I put some notes in the code for how to make it compute the Fibonacci and Lucas numbers respectively. I tried it with the Fibonacci numbers and found this code a runtime that is bit more than twice as long as the runtime of the optimized version.

The above code has been generalized and merged as `mp_fpl_matrix`

into my fork of libtommath. The functions `mp_pell`

, `mp_pell_lucas`

, and `mp_pell_modified`

respectively call the function `mp_fpl_matrix`

with the correct values.

The pseudocode has been typeset with the following Latex code (You have to strip the newlines and be gracious with whitespace to make in work in WordPress) Don’t forget to put it between the WordPress parentheses for Latex: a dollar-sign followed immediately by the word `latex`

followed by whitespace, the actual Latex code, whitespace and another dollar-sign to end it.

\begin{aligned} & \text{\textbf{function}}\;\text{\textit{Binary Exponentiation}}\;x^n \\ & \qquad \text{\textbf{Ensure: }}\;\neg\left(x = 0 \wedge n = 0 \right) \\ & \qquad \text{\textbf{Ensure: }}\;n\in\mathbb{N} \\ & \qquad r \gets 1 \\ & \qquad \text{\textbf{while}}\; n \not= 0 \\ & \qquad \qquad \text{\textbf{if}}\; \mathop{\mathrm{odd}} n \\ & \qquad \qquad \qquad r \gets r\cdot x \\ & \qquad \qquad \qquad n \gets n -1 \\ & \qquad \qquad \text{\textbf{endif}} \\ & \qquad \qquad x \gets x^2 \\ & \qquad \qquad n \gets \biggl\lfloor\frac{n}{2}\biggr\rfloor \\ & \qquad \text{\textbf{end while}} \\ & \qquad \text{\textbf{return}}\; r \\ & \text{\textbf{end function}} \end{aligned}

# Adventures of a Programmer: Parser Writing Peril XI

Now that all is done—yes, with the exception of the grammar, I know, as a lot of other things, too—that we never did Fibonacci numbers. I’m not full sure about the pronunciation of his name (although two Italian men with different dialects give an example) but quite sure about the fast computation of the individual values.

The normal implementation is by following the definition:

static unsigned long fibonacci(unsigned long n){ unsigned long i = 1, j = 0, k, l; for (k = 1; k <= n; k++) { l = i + j; i = j; j = l; } return j; }

This is good for up to for 32-bit `unsigned long`

.

We want more, of course, at least.

One fast algorithm is by matrix exponentiation which is fast but has a lot of redundancies.

These can be stripped easily and leaves us with the following:

which assumes that and are known which is not the biggest problem, I think. I had to massage it a bit for easier implementation with libtommath but I hope it is still readable.

int mp_fibonacci(unsigned long n, mp_int * r) { unsigned long i = n - 1; mp_int a, b, c, d, t, t1, t2, t3; int e; if ((e = mp_init_multi(&a, &b, &c, &d, &t, &t1, &t2, &t3, NULL)) != MP_OKAY) { return e; } mp_set(&a, (mp_digit) 1); mp_set(&b, (mp_digit) 0); mp_set(&c, (mp_digit) 0); mp_set(&d, (mp_digit) 1); while (i > 0) { if (i & 0x1) { //t = d*(a + b) + c*b; if ((e = mp_mul(&c, &b, &t1)) != MP_OKAY) { return e; } if ((e = mp_add(&a, &b, &t2)) != MP_OKAY) { return e; } if ((e = mp_mul(&d, &t2, &t3)) != MP_OKAY) { return e; } if ((e = mp_add(&t3, &t1, &t)) != MP_OKAY) { return e; } //a = d*b + c*a; if ((e = mp_mul(&d, &b, &t1)) != MP_OKAY) { return e; } if ((e = mp_mul(&c, &a, &t2)) != MP_OKAY) { return e; } if ((e = mp_add(&t1, &t2, &a)) != MP_OKAY) { return e; } //b = t; if ((e = mp_copy(&t, &b)) != MP_OKAY) { return e; } } //t = d*(2*c + d); if ((e = mp_mul_2d(&c, 1, &t1)) != MP_OKAY) { return e; } if ((e = mp_add(&t1, &d, &t2)) != MP_OKAY) { return e; } if ((e = mp_mul(&d, &t2, &t)) != MP_OKAY) { return e; } //c = c*c + d*d; if ((e = mp_sqr(&c, &t1)) != MP_OKAY) { return e; } if ((e = mp_sqr(&d, &t2)) != MP_OKAY) { return e; } if ((e = mp_add(&t1, &t2, &c)) != MP_OKAY) { return e; } //d = t; if ((e = mp_copy(&t, &d)) != MP_OKAY) { return e; } i /= 2; } if ((e = mp_add(&a, &b, r)) != MP_OKAY) { return e; } mp_clear_multi(&a, &b, &c, &d, &t, &t1, &t2, &t3, NULL); return MP_OKAY; }

This function is implemented as `bn_mp_fibonacci.c`

in my fork of libtommath.

Once we got the Fibonacci numbers the Lucas numbers fall out quasi automatically.

We have two ways to solve the problem. One is the obvious and most probably also the faster one and is based on the fact that Lucas numbers lie between two consecutive Fibonacci numbers and can be calculated in this way:

Another way is by way of the closed form for the Fibonacci numbers, known as Binet’s formula

From this follows for Lucas numbers

Which makes it obvious now[1] that we can have

take the square root of both sides, keep the principal values, drop the rest and get

It is one evaluation of a Fibonacci number replaced with a square root and a squaring.

It’s slower but less than expected.

The implementation for the Lucas number evaluation with the very first algorithm is

int mp_lucas(unsigned long n, mp_int * c){ mp_int fnm1, fnp1; int e; if ((e = mp_init_multi(&fnm1, &fnp1, NULL)) != MP_OKAY) { return e; } if ((e = mp_fibonacci(n - 1, &fnm1)) != MP_OKAY) { return e; } if ((e = mp_fibonacci(n + 1, &fnp1)) != MP_OKAY) { return e; } if ((e = mp_add(&fnm1, &fnp1, c)) != MP_OKAY) { return e; } return MP_OKAY; }

The above code has been merged as `bn_mp_lucas.c`

into my fork of libtommath.

[1] took me a couple of sheets of paper and two pencils but once you got it it is obviously—obvious.