# On the Value of the Initial Value in Root Finding

While playing around with Libtommath, a nearly public domain(1) library for big integers, to see if I can change (parts of) the backend of the planned calculator to use this library instead of a different one with a different license, I did not only find out that I really love to build long-winded sentences but that the implementation of the Newton-Raphson nth-root algorithm is very slow.
One of first programs to port is a nearly straight forward implementation of the quadratic sieve with only some small but necessary optimizations to make ist useful for up to about 40 decimal digit long semiprimes build from two 20 decimal digits long primes. One of the necessary helper function sis a wy to determine if the factor the sieve found is indeed prime with a high probability or composite. State of the art to do it seems to be a combination of several primality testing algorithms called Baillie-Pomerance-Selfridge-Wagstaff primality test or BPSW for short. There are no BPSW-pseudoprimes known below 10^15 so any prime fond inside that limit can be called a quasi certified prime. That means it is exact for factoring composite numbers up to 30 decimal digits. Probable primes with more than 15 decimal digits would need a second run through the sieve or an exact test.

I digress? Yepp, that’s what I do for a living.

One of the disadvantages of all of the primality tests above is the inability to stand perfection—perfect powers get them in trouble. Testing for perfect powers can be done in essentially linear time if you do not need the radicants and exponents if any. They are needed for factorizing, though, but no large amount of root-finding-runs is necessary.

• only roots with a prime exponent need to be checked because:
$4^x = 2^{2x}$ and
$x^4 = x^{2+2} = x^2 \cdot x^2$
• only exponents up to $\lceil\log_2 n\rceil$ need to be checked:
$x^\frac{1}{n} = exp\left(\frac{\log_e x}{n} \right)$
Together with the the fact that the smallest possible integer radicant larger than 1 (one) is, well, 2 (two) the largest integer exponent $x$ such that $2^x \le n$ is $\lceil\log_2 n\rceil$ which is just the index of the highest bit starting from 1 (one). Simple to implement. Also simple to test, no need for a fancy root finding function here.

I already wrote it with the help of another big integer library (not GMP), so porting was not a large problem.

• finding the highest bit set is implemented as mp_count_bits()
• finding the lowest bit set is not implemented, had to be done
• finding the n-th root is implemented as mp_n_root() which returns the root but no remainder.
• a small sieve delivers the primes necessary. Not shown here, it is a very simple Eratosthenes sieve.

The listing without all of the later-to-be-added code-cluttering error checks. Some of the tricks used here were needed for the other big integer library, it is in dire need of a bit of clean-up, I’m afraid.
The low/high-bit functions:

static const int nibbles[16] =
{ -1, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0 };
/* One can do it with some simple bitjuggling w/o a loop but the author
PS: the ffs() function in string.h is not in C-99, so cannot be used here.
*/
int lowbit(mp_digit d)
{
int ret = 0;
int j = 0;

while ((d != 0) && (ret == 0)) {
ret = nibbles[d & 0x0f];
if (ret > 0) {
break;
}
d >>= (mp_digit) (4);
j += 4;
d &= DIGIT_BIT;
}

if (ret != 0) {
ret += j;
}

return ret;
}

int mp_lowbit(mp_int * a)
{
int r, i;
mp_digit q;

if (a->used == 0) {
return 0;
}

r = 0;

for (i = 0; i < a->used; i++) {
q = a->dp[i];
if (q == (mp_digit) (0)) {
r += DIGIT_BIT;
continue;
} else {
r += lowbit(q);
break;
}
}
return r;
}

int mp_highbit(mp_int * a)
{
return mp_count_bits(a);
}


And the isperfpower() code (mp_issquare has been changed to return the actual root if any. Code not given here, see comment in code for the reason, but if you insist: just ask for it):

int mp_isperfpower(mp_int * z, mp_int * rootout, mp_int * exponent)
{
mp_int root, power,tempr;
bitset_t *bst;
unsigned long p, max, highbit, lowbit;
int ret, r;
p = 2;
/* Place some sanity checks here */
/* No, really! Do it! */

/* ceil(log_2(n)) */
max = (unsigned long) mp_highbit(z);
highbit = max;
lowbit = (unsigned long) mp_lowbit(z);
/* Is it N=2^p ?*/
if (highbit == lowbit) {
mp_set_int(exponent, (mp_digit) highbit);
mp_set_int(rootout, (mp_digit) (2));
return true;
}
/* mp_issquare(n) is not much faster*/
r = mp_issquare(z, &ret, rootout);
if (ret != 0) {
mp_set_int(exponent, (mp_digit) (2));
return true;
}

mp_init_multi(&root, &power, &tempr, NULL);

/* find perfection in higher powers */
bst = malloc(sizeof(bitset_t));
bitset_alloc(bst, max);
eratosthenes(bst);

while ((p = bitset_nextset(bst, p + 1)) < max) {
mp_n_root(z, p, &root);
mp_copy(&root, &tempr);
mp_expt_d(&root, p, &power);
if (mp_cmp(z, &power) == MP_EQ) {
mp_copy(&tempr, rootout);
mp_set_int(exponent, (mp_digit) (p));
mp_clear_multi(&tempr, &root, &power);
bitset_free(bst);
return true;
}
}
mp_set_int(rootout, (mp_digit) (0));
mp_set_int(exponent, (mp_digit) (0));
mp_clear_multi(&root, &power);
bitset_free(bst);

return false;
}


Testing, aaaaaand: slower than a snail heading to its tax office.
As the title of this post suggests: the culprit was the initial value of the Newton-Raphson algorithm in the root finding function. It was set to 2 (two), fix. The N-R algorithm is forgiving but not that forgiving, so somebody added an error correction method. The problem is that the error is most of the time so much off the error correction is doing more or less the whole work to calculate the root in a way that is probably the slowest possible: exponentiate the found root r with the given exponent e, compare the result with the radicant and subtract one from the root if the result is too big; rinse and repeat.
The new algorithm is based on the rule ( with $\{n,x\}\in\mathbb{N^+}$)
$\displaystyle{ n^\frac{1}{x} = \exp\left( \frac{\log_e n}{x} \right) }$
Changing the base to 2 (two) gives the same equation
${\displaystyle n^\frac{1}{x} = 2^{\left( \frac{\log_2 n}{x} \right)} }$
which is simpler to compute, especially with integer logarithms. Doing it with integer logarithms gives the inequation
$\displaystyle{ n^\frac{1}{x} \ge 2^{\left( \frac{\log_2 n}{x} \right)} }$
which is close enough for the Newton-Raphson trick to work fast and correct.

int mp_n_root(mp_int * a, mp_digit b, mp_int * c)
{
mp_int  t1, t2, t3;
int     res, neg;
int ilog2;

/* input must be positive if b is even */
if ((b & 1) == 0 && a->sign == MP_NEG) {
return MP_VAL;
}

if ((res = mp_init (&t1)) != MP_OKAY) {
return res;
}

if ((res = mp_init (&t2)) != MP_OKAY) {
goto LBL_T1;
}

if ((res = mp_init (&t3)) != MP_OKAY) {
goto LBL_T2;
}

/* if a is negative fudge the sign but keep track */
neg     = a->sign;
a->sign = MP_ZPOS;

ilog2 = mp_highbit(a);
ilog2 = ilog2/ b;

/* CZ: TODO: if(ilog2 == 0) return 1; ???*/
if ((  res = mp_2expt(&t2,ilog2+1)) != MP_OKAY) {
goto LBL_T2;
}
/* t2 = 2 */
/*mp_set (&t2, 2);*/

do {
/* t1 = t2 */
if ((res = mp_copy (&t2, &t1)) != MP_OKAY) {
goto LBL_T3;
}

/* t2 = t1 - ((t1**b - a) / (b * t1**(b-1))) */

/* t3 = t1**(b-1) */
if ((res = mp_expt_d (&t1, b - 1, &t3)) != MP_OKAY) {
goto LBL_T3;
}

/* numerator */
/* t2 = t1**b */
if ((res = mp_mul (&t3, &t1, &t2)) != MP_OKAY) {
goto LBL_T3;
}

/* t2 = t1**b - a */
if ((res = mp_sub (&t2, a, &t2)) != MP_OKAY) {
goto LBL_T3;
}

/* denominator */
/* t3 = t1**(b-1) * b  */
if ((res = mp_mul_d (&t3, b, &t3)) != MP_OKAY) {
goto LBL_T3;
}

/* t3 = (t1**b - a)/(b * t1**(b-1)) */
if ((res = mp_div (&t2, &t3, &t3, NULL)) != MP_OKAY) {
goto LBL_T3;
}

if ((res = mp_sub (&t1, &t3, &t2)) != MP_OKAY) {
goto LBL_T3;
}
}  while (mp_cmp (&t1, &t2) != MP_EQ);
/*
printf("Value before:");
mp_print(&t1);
puts("");
*/
/* result can be off by a few so check */
/*
CZ:
The value is most probably 1 (one) digit too large, so to save one
exponentiation we subtract 1 (one) in advance.
TODO: check if that is always the case, as the author conjectures
*/
if ((res = mp_sub_d (&t1, 1, &t1)) != MP_OKAY) {
goto LBL_T3;
}
for (;;) {
if ((res = mp_expt_d (&t1, b, &t2)) != MP_OKAY) {
goto LBL_T3;
}

if (mp_cmp (&t2, a) == MP_GT) {
if ((res = mp_sub_d (&t1, 1, &t1)) != MP_OKAY) {
goto LBL_T3;
}
} else {
break;
}
}

/* reset the sign of a first */
a->sign = neg;

/* set the result */
mp_exch (&t1, c);

/* set the sign of the result */
c->sign = neg;

res = MP_OKAY;

LBL_T3:mp_clear (&t3);
LBL_T2:mp_clear (&t2);
LBL_T1:mp_clear (&t1);
return res;
}


Testing (on a slow 1Ghz AMD w. 512 MB RAM)

./nthroot 12345678910111213141516171819202212223242526272829303112345678910\
111213141516171819202212223242526272829303112345678910111213141516171819202\
212223242526272829303112345678910112345678910111213141516171819202212223242\
5262728293031123456789101 200
15
New version: 0.020000 seconds
15
Builtin    : 492.010000 seconds


Ah, yes, that seems to work better now.
It is even more impressing if I tell you that the 0.02 seconds is for 1,000 (one thousand) rounds and the 492.01 seconds is for just one.
But that is not always the case, the original version is sometimes faster. Lets take a very large number, 1015 digits if I counted correctly.

./nthroot 123456789101112131415161718192022122232425262728293031123456789101112\
1314151617181920221222324252627282930311234567891011121314151617181920221222324\
2526272829303112345678910111213141516171819202212223242526272829303112345678910\
1112131415161718192022122232425262728293031123456789101112131415161718192022122\
2324252627282930311234567891011121314151617181920221222324252627282930311234567\
8910111213141516171819202212223242526272829303123456789101112131415161718192022\
1222324252627282930311234567891011121314151617181920221222324252627282930311234\
5678910111213141516171819202212223242526272829303112345678910111213141516171819\
2022122232425262728293031123456789101112131415161718192022122232425262728293031\
1234567891011121314151617181920221222324252627282930311234567891011121314151617\
1819202212223242526272829303112345678910111213141516171819202212223242526272829\
3031123456789101112131415161718192022122232425262728293031123456789101112131415\
1617181920221222324252627282930311234567891011121314151617181920221222324252627\
2829303112345678910111213141516171819202212223242526272829303111234567891011121\
3141516171819202212223242526272829303112345678910111213141516171819202212223242\
5262728293031123456789101112131415161718192022122232425262728293031123456789101\
1121314151617181920221222324252627282930311234567891011121314151617181920221222\
3242526272829303112345678910111213141516171819202212223242526272829303112345678\
9101112131415161718192022122232425262728293031123456789101112131415161718192022\
1222324252627282930312345678910111213141516171819202212223242526272829303112345\
6789101112131415161718192022122232425262728293031123456789101112131415161718192\
0221222324252627282930311234567891011121314151617181920221222324252627282930311\
2345678910111213141516171819202212223242526272829303112345678910111213141516171\
8192022122232425262728293031123456789101112131415161718192022122232425262728293\
0311234567891011121314151617181920221222324252627282930311234567891011121314151\
6171819202212223242526272829303112345678910111213141516171819202212223242526272\
8293031123456789101112131415161718192022122232425262728293031123456789101112131\
4151617181920221222324252627282930311 20
9006908459572274742850091692732473786309400354936204993716414314994690397129088\
10814335572796526904903015291
New version: 3.380000 seconds
## didn't wait for the builtin version


Doing square roots of this number 1,000 times and compare the timing with the builtin square root function (also run 1,000 times) gives

New version: 6.590000 seconds
Builtin    : 4.580000 seconds


So no extra branch for square roots needed almost all of the time.
But I digress again.
The 7173rd root of the above number is the first one rounding to 1:

1
New version: 0.870000 seconds
1
Builtin    : 0.220000 seconds


And we have a case when the builtin function is indeed faster.
Doesn’t hold for long, though, with the 7159th root it starts to get slower again

2
New version: 0.860000 seconds
2
Builtin    : 1.000000 seconds


And with the 7158th root:

2
New version: 0.840000 seconds
2
Builtin    : 2.990000 seconds


Root #7157

2
New version: 0.840000 seconds
2
Builtin    : 8.140000 seconds


root #7156

2
New version: 0.820000 seconds
2
Builtin    : 21.740000 seconds


So the builtin function is faster if the actual root is very near 1 and gets
slower very rapidly when the actual root gets >2. The new version is more dependent on the size of the radicant but slows down, too, when the exponent gets smaller which was expected.
Root #71

2568855334376777830014137923182
New version: 9.560000 seconds


Was too impatient to wait for the builtin version and switched off the loop but even one round stressed my patience too much.

2568855334376777830014137923182
New version: 0.010000 seconds


The extra code to save one round of exponentiation in the result check was added later that is at this point but I had only time left for one test so I checked the square root function another time. Again with 1,000 rounds and the large number from above:

New version: 7.000000 seconds
Builtin    : 6.680000 seconds


Looks negligible. Thought the difference would be larger.

(1) only nearly public domain because legislation had changed, so they had to give that beast a license. The license they chose is as near to public domain as possible. Some older versions are still public domain, that didn’t change retroactively but look in the code to be sure.