## The Burnikel-Ziegler Algorithm

A method to divide two numbers in an asymptotically faster way then simple schoolbook-division but also usable below the large limits of Barret reduction and multiplication with the reciprocal generated by some rounds of Newton-Raphson root-finding have been found by Christoph Burnikel and Joachim Ziegler and described in their paper “Recursive Division”[1].

The cutoff-point in my JavaScript Bigint library for Little is about 500 limbs for the numerator and 300 limbs for the denominator, about 4,000 and 2350 decimal digits respectively.

The algorithm itself is a more or less simple divide&conquer algorithm, taken quite literally here. It is described well and clearly in the paper linked to above. The implementation on the other side is not so simple, it took me some time.

The most complicated part was the main function that slices the numerator in pieces of the length of the denominator. The paper has a complicated method to squeeze the very last out of the algorithm which I was not able to implement successfully in JavaScript, something was always wrong; correct at the end but still very slow. So I had the weird idea to take the first sentence in this paragraph literally: cut the numerator in slices of the (bit)length of the denominator. This method leaves some rest most of the times which has to be handled specially. The rest worked—astonishingly!—quite smoothly and was fast at the end. Not as fast as the paper promised (mostly due to JavaScript), but still quite fast.

Without further ado, here is the first function, slicing the input which consists of two positive integers `a`

and `b`

with `a>=b`

:

function divrem(a, b) {
// (max) size of one block
var n = b.highBit() + 1;
var tlen = a.highBit() + 1;
// # of blocks
var nblocks = Math.ceil((a.highBit() + 1) / n);
// # of n-sized blocks
var mblocks = Math.floor((a.highBit() + 1) / n);
var firstblock;
var r, q, qr, t;
var count;
var mask = new Bigint(0);
mask.mask(n);
count = 0;
firstblock = a.rShift(mblocks * n).and(mask);
if (firstblock.cmp(b) != MP_LT) {
r = new Bigint(0);
} else {
r = firstblock;
mblocks--;
nblocks--;
}
q = new Bigint(0);
while (nblocks--) {
t = a.rShift(mblocks * n).and(mask);
qr = div2n1n(r.lShift(n).add(t), b, n);
t = qr[0];
r = qr[1];
q = q.lShift(n).add(t);
mblocks--;
}
return [q, r];
}

The numerator gets sliced MSB first, so the very first block is the one which needs special treatment: if it is equal of greater the the denominator we preset the remainder to zero or to the first block otherwise. This gets done such that the input to `div2n1n`

is always `2n`

large. It is not necessary to use two indices, I have done it for legibility. You might also use a logical `or`

(addition base 2) instead of normal addition but I couldn’t find it faster (YMMV etc. p.p.).

The simplicity of the starting function does come with a small cost: the other two functions are slightly more complicated.

The first function that gets called is `div2n1n`

:

function div2n1n(a, b, n) {
var mask, q1, q2, r, qr, a3, b1, b2;
var half;
if (n <= BURN_ZIEG_CUTOFF) {
return a.divrem(b);
}
var isodd = n & 1;
if (isodd) {
a.lShiftInplace(1);
b = b.lShift(1);
//b.lShiftInplace(1);
n++;
}
half = n >>> 1;
mask = new Bigint(0);
mask.mask(half);
b1 = b.rShift(half);
b2 = b.and(mask);
a3 = a.rShift(half).and(mask);
r = a.rShift(n);
qr = div3n2n(r, a3, b1, b2, half);
q1 = qr[0];
r = qr[1];
a3 = a.and(mask);
qr = div3n2n(r, a3, b1, b2, half);
q2 = qr[0];
r = qr[1];
if (isodd) {
r.rShiftInplace(1);
}
return [q1.lShift(half).or(q2), r];
}

The variable `BURN_ZIEG_CUTOFF`

is set to 3,000 bits in my case. It depends a bit on the size of the denominator but not very much in contrast to the original code where it was taken as the variable `m`

which had to fit an integral times into the denominator which means that the denominator had to be expanded if that was not the case together with the numerator causing a lot of error-prone code writing.

One optimization of the algorithm has been implemented here: the original method does not handle the case of odd `n`

and throws it to the normal division function, too. That is slower than making it even (proof omitted for brevity) for the recursion to consume. The original has the computation of the variables `a12`

, `a3`

, `b1`

, and `b2`

put into the `div3n2n`

but `b1`

and `b2`

is the same in both instances for example.

That makes the function `div3n2n`

quite short:

function div3n2n(a12, a3, b1, b2, n) {
var t = a12.rShift(n), qr, q, r;
if (t.cmp(b1) == MP_EQ) {
q = new Bigint(1);
q = q.lShift(n);
q.decr();
r = a12.sub(b1.lShift(n)).add(b1);
} else {
qr = div2n1n(a12, b1, n);
q = qr[0];
r = qr[1];
}
r = r.lShift(n).or(a3).sub(q.mul(b2));
while (r.sign == MP_NEG) {
q.decr();
r = r.add(b1.lShift(n).or(b2));
}
return [q, r];
}

The differences are not very large (and some were made larger for legibility without harm to the speed)

Compare to the according functions that follow the paper nearly to the letter:

function divide2n1n(a,b){
var n = b.used;
var odd = n&1;
if(n < BURNIKEL_ZIEGLER_THRESHOLD){
return a.divrem(b);
}
if(odd){
a.lShiftInplace(1);
b = b.lShift(1);//b.lShiftInplace(1);
n++;
}
n = n/2;
var c1 = divide3n2n(a.slice(n,a.used), b);
var a4 = a.slice(0,n);
var c2 = divide3n2n(c1[1].dlShift(n).add(a4), b);
if(odd){
c2[1].rShiftInplace(1);
}
return [ c1[0].dlShift(n).add(c2[0]), c2[1] ];
}
function fullmask(n) {
var ret = [];
while (n--) {
ret[n] = MP_MASK;
}
return ret;
}
function divide3n2n( a, b){
var n = b.used >>> 1;
var a1 = a.slice(2 * n, a.used);
//var a2 = a.slice(n,2 * n);
var a3 = a.slice(0, n);
var b1 = b.slice(n, b.used);
var b2 = b.slice(0, n);
var q, r1,d,r,c;
var a12 = a.slice(n, a.used);
if(a1.cmp(b1) == MP_LT){
c = divide2n1n(a12, b1);
q = c[0];
r = c[1];
d = q.mul(b2);
} else {
q = new Bigint(0);
q.dp = fullmask(n);
q.used = q.dp.length;
r = a12.add(b1).sub(b1.dlShift(n));
d = b2.dlShift(n);
d = d.sub(b2);
}
r = r.dlShift(n).add(a3);
while (r.cmp(d) == MP_LT) {
r = r.add(b);
q.decr();
}
r = r.sub(d);
q.clamp();
return [q, r];
}

(I spare you the mess that I made of the slicing function ðŸ˜‰ )

As a side-note: it does not matter which algorithm you use for the final division; it can be schoolbook division or Barret division or even Newton division as long as it is not burnikel-ziegler division itself for obvious reasons. So it might be a good idea to do some branching at that point to chose the right algorithm depending on the size of the input. Or not, to be tested.

In the next post I will talk about Barret reduction and it’s use for an even faster algorithm. The post after that, if nothing comes in the way, is about multiplication with the reciprocal (with the Newton-Raphson root-finding algorithm). All of the algorithms are already implemented in the file `bignum.js`

in the `lib`

subdirectory in Little.

[1] I have a lot of left-over punctuation marks. Anyone interested?