On the Principal Branch of the Log-Gamma Function [updated]

[UPDATE at Mo 29. Jul 17:09:30 CEST 2013] fixed a small bug.
We did a fast glimps—albeit with working code—at the implementation of the gamma function with Spouge’s formula lastly. This code implements the logarithm of the gamma function $\log (\Gamma z)$ with the principal branches of the complex logarithm. Another function, named $\log\Gamma z$ uses different principal branches that are more useful. You will find that it is this function that is implemented in most programs that offer arbitrary precision.
To go from $\log (\Gamma z)$ to $\log\Gamma z$ is not easy:
It is equivalent in the strip

$\displaystyle{ \log\Gamma z = \log(\Gamma z) \quad \text{with} \quad 0 < \Re z \le 2 \vee \lvert \Im z\rvert \le \frac{7}{2} }$

The imaginary part of negative values are simply multiples of $\pi$

$\displaystyle{ \log\Gamma x = \log (\Gamma z) + 2 i \pi \biggl\lceil\frac{x}{2}-1\biggr\rceil \quad \text{with} \quad x<0 \vee -x\not\in\mathbb{N}}$

The interesting part comes with the rest of the complex plane:

$\displaystyle{\log\Gamma x = \log (\Gamma z) +2 i \pi k(z)}$

The function $k(z)$ returns an integer. To compute these size of the aforementioned integer the following simple integral has to be solved:

$\displaystyle{k(z) = \int_0^z \theta \bigl( -\Re(\Gamma t) \bigr) \bigl\lvert \Im\bigl(\Gamma(t)\psi(t) \bigr) \bigr\rvert \delta\bigl( \Im(\Gamma t) \bigr) \mathrm{d}t}$

with $\theta (z)$ the unit step function

$\displaystyle{\theta (x) = \begin{cases} 0& x< 0\\ 1&<\ge 0 \end{cases} }$

with $\psi (z)$ the first logarithmic derivative of the Gamma function

$\displaystyle{ \psi (z) = \frac{\mathrm{d}}{\mathrm{d}z}\Gamma z = \frac{\Gamma '(z)}{\Gamma(z)} }$

and finally $\delta (x)$ a kind of a derivative of the Heaviside step function and has the following limit

$\displaystyle{\delta(x)\lim_{\epsilon\to 0}\frac{\sin\frac{x}{\epsilon}}{\pi x} }$

All that information should suffice to handle the integral. Not that it wouldn’t be a good training, but others have solved the problem already, namely D. E. G. Hare in “Computing the Principal Branch of log-Gamma” published in the Journal of Algorithms 25(2):221-236 in 1997.

Attached to a simple Gamma function with the Lanczos coefficients

define kroneckerdelta(i,j){
if(isnull(j)){
if(i==0) return 1;
else return 0;
}
if(i!=j) return 0;
else return 1;
}
define __CZ__lngamma_lanczos(z){
local a k g zghalf lanczos;
mat lanczos[15] = {
9.9999999999999709182e-1,
5.7156235665862923516e1,
-5.9597960355475491248e1,
1.4136097974741747173e1,
-4.9191381609762019978e-1,
3.3994649984811888638e-5,
4.6523628927048576010e-5,
-9.8374475304879566105e-5,
1.5808870322491249322e-4,
-2.1026444172410489480e-4,
2.1743961811521265523e-4,
-1.6431810653676390482e-4,
8.4418223983852751308e-5,
-2.6190838401581411237e-5,
3.6899182659531626821e-6
};
g = 607/128;
z = z-1;
a = 0;
for(k = 12; k >= 1; k--){
a += lanczos[k]/(z+k);
}
a += lanczos[0];
zghalf = z + g + .5;
return ( ln(sqrt(2*pi())) + ln(a)  -zghalf  ) + (z+.5)*ln( zghalf );
}
define lngamma(z){
local ret eps flag corr;
flag = 0;
if(isint(z)){
if(z <=0)
return newerror("gamma(z): z is a negative integer");
else{
/* may hold up accuracy a bit longer, but YMMV */
if(z < 20)
return ln((z-1)!);
else
return __CZ__lngamma_lanczos(z);
}
}
else{
if(re(z)<0){
if(im(z) && im(z)<0){
z = conj(z);
flag = 1;
}

ret = ln( pi() / sin(pi() *z ) ) -__CZ__lngamma_lanczos(1-z);

if(!im(z)){
if(abs(z) <= 1/2)
ret = re(ret) - pi()*1i;
else if( isodd(floor(abs(re(z)))) ){
ret = re(ret) + ( ceil(abs(z))   * pi() * 1i);
}
else if( iseven(floor(abs(re(z)))) ){
/* < n+1/2 */
if(iseven(floor(abs(z)))){
ret = re(ret) + ( ceil(abs(z)-1/2 -1 )   * pi() * 1i);
}
else{
ret = re(ret) + ( ceil(abs(z) -1/2 )   * pi() * 1i);
}
}
}
else{
corr = ceil( re(z)/2 -3/4 - kroneckerdelta(im(z))/4);
ret = ret +2*(corr *pi() )*1i;
}
if(flag == 1)
ret = conj(ret);
return ret;
}
return __CZ__lngamma_lanczos(z);
}
}


My problem is, that this trick does not work with the Spouge coefficients. At least not for me. I didn’t find out if it is a bug in my implementation (most probably) or a more technical problem. The paper mentioned above talks about the Stirling series. It uses Bernoulli numbers which are quite expensive to compute and the series is not convergent which makes it quite tricky to find the right number of terms. D.E.G. Hare showed in section 4.1 a trick to evaluate the error in a simple way for $z\in\mathbb{C}\setminus\mathbb{R}$.

/* Compute error for Stirling series for "z" with "k" terms*/
define R(z,k){
local a b ab stirlingterm;
stirlingterm = bernoulli(2*k)/( 2*k*(2*k-1)*z^(2*k));
a = abs( stirlingterm );
b = (cos(.5*arg(z)^(2*k)));
ab = a/b;
/* a/b might round to zero */
if( abs(ab) < epsilon()){
return digits(1/epsilon());
}
return digits(1/(a/b));
}


On the numerical evaluation of factorials VI

Another addendum to the last post.
After playing around with the implementation of the Borwein algorithm I found that the small multiplication eats a lot of time, most of the all-over running time to be a bit more specific. So I added the splitting of the evaluation between the primes with an exponent larger than one and those with one that is one again. Now the upper half of the primes is now treated as a primorial and evaluated with binary splitting to a good effect.

define factorial(n){
local prime result shift prime_list k k1 k2 expo_list pix cut;
local s1 e1 r1 s2 e2 r2 primorial;

result = 1;
prime      = 2;

if(!isint(n))  return "false"; ## or gamma(n)?
if(n < 0)      return "false";
if(n < 9000)  return n!;

shift = __CZ__prime_divisors(n,prime);
prime = 3;
cut = n//2;
pix   = pix(cut);
prime_list = mat[pix];
expo_list  = mat[pix];
k = 0;
do {
prime_list[k]  = prime;
expo_list[k++] = __CZ__prime_divisors(n,prime);
prime          = nextprime(prime);
}while(prime <= cut);

/* size of the largest exponent in bits  */
k1 = highbit(expo_list[0]);
k2 = size(prime_list)-1;

for(;k1>=0;k1--){
/*
the cut-off for T-C-4 is still to low, using T-C-3 here
TODO: check cutoff of T-C-4
*/
result = toomCook3square(result);
for(k=0; k<=k2; k++) {
if((expo_list[k] & (1 << k1)) != 0) {
result *= prime_list[k];
}
}

}
primorial = primorial( cut, n);
##print digits(result,2), digits(primorial,2);
##result = toomCook3(result,primorial);
result *= primorial;
result <<= shift;
return result;
}


As you can see, the point where the builtin factorial is faster got down from 21,000 to 9,000. I also used fixed size arrays instead of the lists.

A little bug was in the construction of the prime list, too and gets repaired with the follwing replacement:

/* A simple list to keep things...uhm...simple?*/
global __CZ__primelist;
/* Helper for primorial: fill list with primes in range a,b */
define __CZ__fill_prime_list(a,b)
{
local k;
k=a;
/* reset list every time or cache it */
__CZ__primelist = list();
if(isprime(k))k--;
while(1){
k = nextprime(k);
if(k > b) break;
append(__CZ__primelist,k );
}


Next: the double factorial.

On the numerical evaluation of factorials V

It’s been a bit since I started the series and, freely admitted, forgot to add the last part, the actual implementation of Peter B. Borwein’s algorithm[1] for computing factorials.
Some changes from the past: the language changed from Javascript to the script language implemented in Calc. It is easier to read and can be used as runnable pseudo-code. And it is a bit faster than my Javascript implementation of a big-integer library.
A short summary of the algorithm goes:

1. Factorize the factorial
2. Exponentiate the primes according to the factorization
3. Compute the product of the above powers

To factorize a factorial no factorization of the fully computed factorial is needed. Obviously. The number theoretical background is given at Wikipedia, at Wolframs and elsewhere, too.
There are several methods to do that but simple trial division is good enough; these are small numbers: the largest possible exponent is the exponent of the smaller prime, that would be 2 and since $2^x$ cannot exceed $n!$ if $2^x$ is a prime factor so $x$ cannot exceed $n$. Proof omitted, ask Adrien-Marie Legendre if you need one.
The time complexity to do the computation is O(log n) or, in this case, exactly $\lfloor log_p n \rfloor$ integer divisions with $p$ prime the base of the logarithm, therefore only half of the primes have exponents $x>1$. We might treat these separately.

/* A simple list to keep things...uhm...simple?*/
global __CZ__primelist = list();
/* Helper for factorial/primorial: fill list with primes in range a,b */
define __CZ__fill_prime_list(a,b)
{
local k;
k=a;
/*
nextprime(k) means next prime even if k is prime. This is
different from the behaviour of other programs e.g.; PARI/GP
*/
if(isprime(k))k--;
while(1){
k = nextprime(k);
if(k > b) break;
append(__CZ__primelist,k );
}
}
/* Helper for factorial: how often prime p divides the factorial of n */
define __CZ__prime_divisors(n,p)
{
local q,m;
q = n;
m = 0;
if (p > n) return 0;
if (p > n/2) return 1;
while (q >= p) {
q  = q//p; /* double forward slash is integer division */
m += q;
}
return m;
}


To make it more complete and easier to follow we get the primes from a generated list, you might use your own implementation of a sieve or similar methods to get the primes.
Now that we have that ready we can go on, power up, and exponentiate the primes according to the results from above.

/* Helper for factorial: exponentiate p^k */
define __CZ__exponentiate(prime,exponent) {
local bit result temp;

result = 1;
bit = digits(exponent,2);
while((exponent & (1 << bit)) == 0) bit--;
for( ; bit>=0 ; bit--){
result = result^2;
/* square if bit is zero, multiply otherwise */
if((exponent & (1 << bit)) != 0){
result *=  prime;
}
}
return result;
}


The method here is the so called binary exponentiation and has a bit (ha!) of optimization. A simpler and more readable piece of code is probably

define exponentiate(base,exponent)
{
local result e b;

result = 1;
b      = base;
e      = exponent;

while(e != 0){
if(isodd(e)){
result = result * b;
}
e = e>>1;
if(e != 0){
b = b^2;
}
}

return result;
}


This method allows for less multiplications and even better: more squaring which can be done faster than multiplication.

Apropos faster: we already have the number of times 2 divides $n!$ and because shifting is cheap in modern computers it is advisable to use that fact and drop $2^x$ from the product and just shift by $x$ at the end. The second half of the primes have exponent 1 (one) and it is a better idea to treat the separately as a fraction of two primorials $\frac{P_{(n/2)\to n}}{P_{1\to (n/2)}}$ and compute the product with the help of binary splitting.
Now, lets get it all together.

  local prime result prime_raised shift cut;

result = 1;
prime      = 2;
/* checks and balances omitted */
shift = __CZ__prime_divisors(n,prime);

prime = 3;
cut   = n//2;
do {
prime_raised = __CZ__exponentiate(prime,__CZ__prime_divisors(n,prime));
result *= prime_raised;
prime = nextprime(prime);
} while(prime <= cut);
result *= primorial( cut, n);
result <<= shift;
return result


What? Oh, sorry, the primorial, yes.

define primorial( a, b)
{
local C1;
/*
Checks and balances omitted bt keep in mind that the number of primes you
get from functions like nextprime() might be limited. Here the limit is
2^32 so the last prime you can get with nextprime() is 4294967291.
*/
__CZ__fill_prime_list(a,b);
C1 = size(__CZ__primelist)-1;
return __CZ__primorial__lowlevel( 0, C1,1)
}


The actual product gets, as promised, calculated with a simple binary splitting algorithm:

/*
Helper for primorial: do the product with binary splitting
TODO: do it without the intermediate list
*/
define __CZ__primorial__lowlevel( a, b ,p)
{
local  c;
if( b == a) return p ;
if( b-a > 1){
c= (b + a) >> 1;
return  __CZ__primorial__lowlevel( a   , c , __CZ__primelist[a] )
*  __CZ__primorial__lowlevel( c+1 , b , __CZ__primelist[b] ) ;
}
return  __CZ__primelist[a] * __CZ__primelist[b];
}


As we can see it is the very same algorithm as the binary splitting algorithm for the factorial itself with the single difference that a,b are not to be multiplied themselves but act as indices to the ordered(!) list of primes.
This method would lower the time complexity of the complete algorithm if we exponentiate the primes in-place and run the above algorithm over the whole list instead of the upper half only.
We should grab the chance and make some things more general, for example the product of the list of primes can easily be the product of any ordered list. Please be carefull if is possible to persuade your language to pass arguments by pointer and not by copying the whole thingamabob every f’ing time.

define product_list( a, b ,p,list)
{
local  c;
if( b == a) return p ;
if( b-a > 1){
c= (b + a) >> 1;
return  product_list( a   , c , list[a],list )
*  product_list( c+1 , b , list[b],list ) ;
}
return  list[a] * list[b];
}


It is also possible now to skip the global variable which is in most cases a bad idea to have; only in rare cases it is a very bad idea.

define factorial2(n){
local prime result  shift cut prime_list k;

result = 1;
prime      = 2;
prime_list = list();

shift = __CZ__prime_divisors(n,prime);

prime = 3;
cut   = n//2;
/* first half with exponentiation */
do {
append(prime_list,__CZ__exponentiate(prime,__CZ__prime_divisors(n,prime)));
prime = nextprime(prime);
} while(prime <= cut);
/* first half rare primes only */
do{
append(prime_list,prime );
prime = nextprime(prime);
}while(prime <= n);

k = size(prime_list)-1;

result =  product_list( 0, k,1,prime_list);
result <<= shift;

return result;
}


One final note: the respective sizes of the factors in the multiplication is not very equal. That means that the use of the common methods for fast multiplication (e.g.: FFT) which depends on equal size of the multiplicands can be tricky. Simple padding is, uhm, simple but raises the n in M(n).
Well, a theme for a later posting, I guess 😉
UPDATE just an hour later.
Ok, I’ll do it now and here.
I admit: I withhold the essential trick that makes it faster than everything known before: nested squaring.
If you imaging the whole list of primes with their respective exponents and lay it flat, exponents down, primes up, and you’ll get the following:

 2 3 5 7 … 0 1 1 1 … 1 0 1 0 … … … … …

What we do now is reading it row-wise. Multiply the result with the respective prime if the bit is set, do nothing otherwise and when you come at the end of the row square the result, rinse and repeat.
Now we have only fast squaring and multiplication with very small primes which can be done in O(n) if the prime is smaller than one LIMB which has in most cases 32 bit if not 64.

define factorial3(n){
local prime result shift prime_list expo_list k k1 k2;

result = 1;
prime      = 2;
prime_list = list();
expo_list  = list();
/* Checks and balances omitted */
shift = __CZ__prime_divisors(n,prime);
prime = 3;
do {
append(prime_list,prime);
append(expo_list,__CZ__prime_divisors(n,prime));
prime = nextprime(prime);
}while(prime <= n);

/* size of the largest exponent in bits  */
k1 = highbit(expo_list[0]);
k2 = size(prime_list)-1;

for(;k1>=0;k1--){
/*
Square the result every row.
One might skip the first time because 1^2 = 1.
*/
result *= result;
for(k=0; k<=k2; k++) {
/*
check every exponent for a set bit and
multiply the result with the respective prime
*/
if((expo_list[k] & (1 << k1)) != 0) {
result *= prime_list[k];
}
}
}
result <<= shift;
return result;
}


That was it! Really! Promised!
Well …

[1]Peter B. Borwein, On the Complexity of Calculating Factorials, Journal of Algorithms, 3, 1985

In the file tommath.h of the multiple precision library LibTomMath version 0.42.0 on line 281 is this little gem:
/* I Love Earth! */`