[UPDATE]: bugfix.

The Little programming language is in need of arbitrary precision floating point numbers, too. The necessary basics at least: addition; multiplication; division; roots; logarithm; an exponential function; and the three basic trigonometric function sin, cos, and tan; not to forget the constant computed to the current precision. More to come, of course, but these are the minimum for a comfortable environment.

Well, that’s all fine and dandy but how do we get the numbers in and out? Yes, that’s not *that* simple, indeed.

Let’s start at the beginning:

Arbitrary precision floating point numbers are floating point numbers but not of arbitrary precision. It is not possible because some numbers are representable in a finite number of characters but others are not.

(1)

That means the number of digits of the fractional part need a limit which is commonly called the precision of the floating point number and has to be set in advance. Together with all of the other necessities of computerization the definition of a floating point real number is as follows:

- Number of digits allowed: the precision
- Integer part
- Fractional part
- A sign

An optimization you won’t even recognise as one is missing: the exponent. It is a scaling factor that compresses the possibly long number of leading or trailing zeros to a single number, the count of the zeros. A negative exponent numbers leading zeros:

(2)

And a positive exponent counts trailing zeros.

(3)

We can see that the rule is to move the decimal point to the right[1] of the first non-zero digit and count the steps. This allows for moving the radix point quite freely which makes the division between an integer and a rational part not obsolete but easy to handle. We can, for example, fix the radix point and move the number around.

The conversion of floats between different radices gets done in the following way:

- get a number of radix
*r*_{1} - convert it to radix
*r*_{2} - round it to precision
*p*

There are a couple of steps in between step one and step two, we come to them later in this post but point three is the most interesting for now.

IEEE-754 knows and defines four different rounding modes:

- round towards nearest, the default
- round towards zero, which is basically truncating
- round towards negative infinity (or “down”), the floor function
- round towards positive infinity (or “up”), the ceil function

The default function, round to nearest, has in IEEE-754 the rule “half to even” if the number is an exact half integer

(4)

Although this method is the most complicated, it is also the default option in IEEE-754 and as I try to follow this standard as much as possible (but not even the smallest step further!) it is the method of choice for my own implementation.

There is also one extra condition: the number , once converted to , but only then, must never change if converted back and forth. That is

(5)

Meaning: if you, say convert the decimal number to a binary number with the precision and behaviour of and IEEE-754 `double`

you get but when you take and feed that back to the conversion routine you must get back if you used the very same rounding mode: .

Now, how to *do* radix conversion?

The most used radix conversion is probably the conversion between 2 and 10, between decimal and binary numbers. We can assume that we have a method to work with arbitrary long integers (that’s what I hope, at least), do arithmetic with them and have the radix conversion already build in. Nevertheless a short reminder of radix conversion for integers:

- get the most significant digit of the number a
_{1}of radix r_{1} - Add it to the number a
_{2}that has radix r_{2} - Multiply a
_{2}by r_{1} - rinse and repeat

That is the way how it is done in my implementation of a big integer library for now. It is very slow. It would be faster if we do it in a slightly different way by observing that we repeatedly multiply the same r_{1}, we could do that at once or at least in larger chunks.

We also have bigint digits that are most probably much larger than the digits of radix r_{1}. So, if we take *n* digits of *r*_{1} such that we can speed it up a little.

This works well for e.g.: converting decimal strings to the 26-bit big-digits of my big integer implementation but not vice versa. We would need a big integer with a decimal radix for that.

It is soem time ago but I have written a big integer library with a decimal base, and tried it instead of the build-in conversion routine of Tom Wu’s big integer library. It was faster, as expected. Not much but enough that I plan it to do for my radix 2^{26} big integers, too. In the future.

Now back to floating point radix conversions.

To get integers, we need to make the floating point number *x* (we assume finite precision here and anywhere else) of radix *r* a rational number just by moving the radix point while keeping the count *n* such that the resulting rational has the form

(6)

with *n* the smallest possible power of *r* such that the equation (6) holds

Example:

(7)

We can stop here and do rational arithmetic with it to stay exacty. But rational arithmetic can result in very large integers and we wanted to know how to convert radices in the first place, so we need to go on from here[2].

One notation before we go on: let be the number of digits for the representation of the number in radix and let be the *fixed* number of digits of radix of the target, the precision.

We need to scale the rational (constructed as described above) to fill up the target number exactly up to the brim, so we can do the rounding. The target is filled up exactly when (working with logarithms here, hence the simple addition)

(8)

This sum can get negative and with

(9)

we get ( be the radix of the target)

(10)

With our example…oh, wait, we need a precision and a radix! Let’s take and . Then , , and . With we get the scale factor . The scale factor is positive, so we need to

(11)

Which results in a numerator of *995,610,453,248,924,265,350,259,524,281,934,194,034,081,792*

Now we are prepared to do the rounding by doing an integer division with remainder.

(12)

The result still doesn’t fit into the margin of this blog. Maybe I should have chosen a smaller ? It is and .

Thusly prepared the round-to-nearest with half-to-even of the quotient goes

(13)

If it happens that then set and . If you thought that we don’t need anymore and dumped it already…well…

Half of the quotient in our example is bigger than the remainder, that means rounding down, or truncate the result, but because of the integer result of an integer division, we do not need to do anything.

The raw, but correctly rounded result ( for “raw” not for “radix”) is now

(14)

With our example

Good enough for many but not for IEEE-754. They want the result normalized to , too

(15)

With our example makes respectively because you don’t scale the mantissa () only the exponent.

The sign of the original numbers has no significance for the calculations, we can add it at the end.

[1] This is correct even in scripts that write from right to left like the arabic. They took the decimal system from Indian mathematicians who wrote them left to right, although in some of the earliest documents found in India they wrote them right to left.

[2] I will do this way in Little up to a certain amount and/or configurable.

Pingback: Conversion of Binary Floating Point Numbers—Second Try | De Amentiae Mundi

Pingback: Multi-Precision Floats: Radix Conversion with Large Exponents | De Amentiae Mundi