Adventures of a Programmer: Parser Writing Peril XX


The program Little is all about numerical mathematics and, as the expression suggests, contains numbers. That means, in all consequences that we must be able to handle numbers. The ability to parse a number is based on a mathematical rigorous description of the term Number and not in a “I know when I see it!” way as Justice Potter Stewart described his own ability to detect obscenities so succinctly in Jacobellis v. Ohio (1964)—that is not sufficient.

Such a need for an exact description of Number is quite common, hence a committee…oh no! Not again! A committee! But all hand-wringing is moot and of no help, we have to go through all of it, 地獄へ行.

The standard for numbers in the case of floating point numbers (how do I know that we need floating point numbers and nothing else? Don’t know, call it intuition.) is the international standard ISO/IEC/IEEE 60559:2011, mainly based, if not identical to IEEE 754-2008. There is no need to buy a copy (the ISO standard comes down to CHF 178 for the 58 pages), I have a copy of IEEE 754-2008.
To blame for the content of IEEE 754-2008 are: Robert M. Grow, Chair; Thomas Prevost, Vice Chair; Steve M. Mills, Past Chair; Judith Gorman, Secretary and 23 others not including the non-voting IEEE-SA Standards Board liaisons as they liked to call them. Only three women in the committee (but three of four in the liasons). Who would have guessed that?

Back to the main theme.
The format IEEE-754 defines is a finite subset of real numbers(3.1.1). The basic formats have a fixed bit-length (e.g.: 64 bit) which we can not use, but this inability is only theoretical, in praxi we use a fixed bit-length, too it is just presentable in a variable way—the bit-length itself is only limited by hardware. But the generall format, described in section 3.2, table 3.1 as level 3, is applicable for any base and precision not zero:

\left(\mathrm{sign}, \mathrm{exponent}, \mathrm{significand}\right) \cup \{ -\infty , +\infty \} \cup \mathrm{qNaN} \cup \mathrm{sNaN}

That means that there exists a signed zero (+0 and -0 are distinct), two different kind of infinities which contradicts the finite part of the standard (but I already mentioned that it was a work of a committee, did I not?) and two different kinds of errors: qNaN which is a silent NaN (may include information) and sNaN (must cry for help).

Ok, not that bad.

In section 3.3 comes the real meat. As it seems.
We need the integer parameters

  • b = the radix
  • p = the number of digits in the significand (precision)
  • emax = the maximum exponent e
  • emin = the minimum exponent e where emin shall be 1 − emax for all formats

The full format is the well known (-1)^s \times b^e \times m where s\in \{0,1\}, the exponent e_\mathrm{min} \le e < e_\mathrm{max} and the mantissa m with m \in\mathbb{N} the sum m = \sum_i^p d_i^{b^i} with d < b  . The floating point numbers the standard describes is for 0 \le m < b where the mantissa is a normalized real number 0 \le m < b .
It is more useful for our program to use an integer as the mantissa. That changes the rule to (-1)^s \times b^q \times c with q \in\mathbb{Z} and e_{\mathrm{min}} \le q + p - 1 \le e_{\mathrm{max}} where p \in\mathbb{N}\setminus\{0\} is the precision. The mantissa c \in\mathbb{N} is now defined as 0 \le c < b^p.

The always finite precision of the numbers mean that any number smaller than b^{e_\mathrm{min}} has less precision than wanted and is called a subnormal number hence.

We are able to use arbitrary long mantissas in our program thanks to the bigint code but we cannot, without good conscience use a bigint as an exponent[1]. That reduces the size of the exponent to the maximum integer the native number allows. It is still 252 in JavaScript which is a very large number, no doubt about that but might be too small for some calculations. I don’t have the faintest idea what that could be (approximating Graham’s number? No, that is too large, even for bigint loaded exponents) but tell me freely if you have one.
So far for the mathematics, I hope I got them right.

Section 5.12.3 describes the format for an external hexadecimal sequence representing the number. The grammar (translated to ABNF without the shortcuts) looks promising.

hex-sequence     = *1sign hex-indicator hex-significand 
decimal-exponent = hex-exponent-indicator *1sign 1*digit
hex-significand  = (*hex-digit "." 1*hex-digit) /
                   (1*hex-digit ".") /
hex-indicator          = "0" ("x" /%x58 ) ; 0[xX]
hex-exponent-indicator = "p" / %x50 ; [Pp]
hex-digit     = digit / hexl-owercase / hex-uppercase ; [0-9a-fA-F]
hex-uppercase = %x41-46 ; [A-F] ABNF is case insensitive
hex-lowercase = "a" / "b" / "c" / "d" / "e" / "f"
digit         = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7" / "8" / "9"
sign          = "+" / "-"

This can be used without much ado for purely decimal sequences, too.

dec-sequence     = *1sign dec-significand decimal-exponent
dec-exponent     = dec-exponent-indicator *1sign 1*digit
dec-significand  = (*digit "." 1*digit) /
                   (1*digit ".") /
dec-exponent-indicator = "e" / %x45 ; [Ee]
digit         = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7" / "8" / "9"
sign          = "+" / "-"

To make the input of binary numbers easier a sequence can be derived from the hexadecimal sequence by exchanging the indicator. We keep the decimal exponent.

bin-sequence     = *1sign bin-indicator bin-significand 
decimal-exponent = bin-exponent-indicator *1sign 1*digit
bin-significand  = (*bin-digit "." 1*bin-digit) /
                   (1*bin-digit ".") /
bin-indicator          = "0" ("b" /%x42 ) ; 0[bB]
bin-exponent-indicator = "e" / %x45 ; [Ee]
bin-digit     = "0" / "1"
digit         = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7" / "8" / "9"
sign          = "+" / "-"

The standard identifier for octal numbers is a leading zero. That means that any number starting with a zero gets taken as an octal number. But what is with e.g.: 0.01e1?
On way to solve this ambiguity is to insist on at least one digit before the radix point:

oct-sequence     = *1sign dec-significand decimal-exponent
decimal-exponent = oct-exponent-indicator *1sign  1*digit
oct-significand  = (1*digit "." 1*digit) /
                   (1*digit ".") /
bin-indicator          = "0" ; leading zero
oct-exponent-indicator = "e" / %x45 ; [Ee]
oct-digit     = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7"
digit         = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7" / "8" / "9"
sign          = "+" / "-"

The largest base representable in ASCII is not the mere 64 characters of [0-9a-zA-Z#$] but all visible characters. The latter list includes many characters used elsewhere and such a number would be hard to parse amidst other code. Even the simple [0-9a-zA-Z#$] has the problem of how to indicate the exact base—all letters and also all other usable signs are already in use! The biggest usable base is therefore 58 with the digits [0-9a-xA-X].
The letter “z” is pronounced like “zet” in Great Britain and “zee” in the USA and, in honour of Prof. Dr. P. Z. “Peezee” Myers, I take “zee” and call that base the zee-base[2].

zee-sequence     = *1sign zee-indicator zee-significand decimal-exponent
decimal-exponent = zee-exponent-indicator *1sign 1*digit
zee-significand  = (*zee-digit "." 1*zee-digit) /
                   (1*zee-digit ".") /
zee-indicator          = "0" ("z" / %x58 ) ; 0[zZ]
zee-exponent-indicator = "y" / %x59 ; [Yy]
zee-digit     = digit / zee-lowercase / zee-uppercase ; [0-9a-yA-Y]
zee-uppercase = %x41-58 ; [A-X] ABNF is case insensitive
zee-lowercase = %x61-78 ; [a-x]
digit         = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7" / "8" / "9"
sign          = "+" / "-"

In the case that two more characters are safe to use (e.g.: the characters “$” and “#” from the 64er base example above) these can be used for the indicators but I refuse unwaveringly to do any of the cheap shots related to it! Especially not the one where
the numerical representation of the national dept of the USA shall start with “0$”. Nope, that’s too cheap a shot, even for me.

Some programming-languages, e.g: Perl allow for spaces in between the digits as a thousands separator (which is correct according to ISO 80000-1:2009). That is no problem if the programming language is not dependant on white-space as others are, that is, if you strip the code of all whitespace (with the exception of the content of the strings, of course) it still runs without error.
Another problem can be localization, even if we restrict the whole thing to ASCII. The radix point, decimal point or how do you call it, is not necessary a period (ASCII 0x2e) but sometimes a comma (ASCII 0x2c). The thousands delimiters are sometimes periods if the radix point is a comma and vice versa.

Next in this loose series: Strings. Don’t worry, will be quite short if I ignore Unicode.
I think I won’t.

[1] You can turn it around and use a native number (a double in JavaScript) as the mantissa and a bigint as the exponent. Normalizing would mean here to put the exponent of the native float into the exponent. That is good for handling extremely large numbers with only small precision. But that’s what logarithms are for, so such a number is only good for pedagogical reasons.
[2] but never again.


One thought on “Adventures of a Programmer: Parser Writing Peril XX

  1. Pingback: Adventures of a Programmer: Parser Writing Peril XXI | De Amentiae Mundi

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s