Adventures of a Programmer: Parser Writing Peril XXXII

Building Numbers

While the parser slowly idles, another thing gets on the table: the numbers and how to handle them.

We have two basic types, the real and the imaginary numbers which are easy to tell apart, the others are all the different representations and the precision. The precision is too complicated to do it life and gets preset by a user-defined value as everywhere else. The default value is double-precision, simply because it is the default value in ECMAScript.

The lexer will give us, hopefully, clean and validated numbers—no, not numbers, strings, we have to make numbers out of them! But because of the cleanliness we can take some liberties with the checks&balances part at least.
To do the actual parsing, that is, cutting the string representation of the number in its IEEE conforming parts. To make it easy we don’t write our own parser but use regular expressions instead. ECMAScript allows for some sophisticated regular expressions. We won’t use the sophisticated part, but regular expressions we will use.

As said above, the lexer has done a lot of the work; we can keep the regular expressions simple. Planed were four representations: binary, octal, decimal and hexadecimal and I thin k that is more than enough.
Any shortcuts available?
Yes, for example it is possible to do the decimals directly in JavaScript, although for the default precision only.
It would be nice if new Number(string) or parseFloat(string) would work just the opposite way of Number.toString(base), but it is sadly not possible to feed floats in other than decimal representation directly to new Number(string) or parseFloat(string).

Interestingly, IEEE-745 has defined hexadecimal representation but only as as optional, if I understand it correctly. We will use the definition there nevertheless. The single difference from decimals is the mark for the beginning of the exponential part: [pP] instead of [eE] for decimals, so it is all the same anyways.

The regular expressions for the individual representations are:

  switch(base){
    case 2:
      regexp = /([+-]{0,1})([01]*)([.]{0,1})([01]*)([eE]{0,1})([+-]{0,1})([0-9]*)/;
      break;
    case 8:
      regexp = /([+-]{0,1})([0-7]*)([.]{0,1})([0-7]*)([eE]{0,1})([+-]{0,1})([0-9]*)/;
      break;
    case 16:
      regexp = /([+-]{0,1})([0-9a-fA-F]*)([.]{0,1})([0-9a-fA-F]*)([pP]{0,1})([+-]{0,1})([0-9]*)/;
      break;
    default: /* base 10 */
      regexp = /([+-]{0,1})([0-9]*)([.]{0,1})([0-9]*)([eE]{0,1})([+-]{0,1})([0-9]*)/;
      break;
  }

Sorry for the long lines, but I cannot break them easily.
These expressions are all the same with some small but essential differences: the number of digits and the exponent mark.

The parentheses denote the groups which we can get afterwards. The JavaScript regex engine returns an array with the length of the number of groups and it does it always. Otherwise it would be easier to write a parser manually.

The groups are:

  1. parsed string (the whole match so to say). Not used
  2. sign of mantissa if any
  3. integer part if any
  4. radix point if any
  5. fractional part if any
  6. exponent mark if any. [pP] for hexadecimal number, [eE] otherwise (IEEE-754)
  7. exponent sign if any
  8. exponent if any. The exponent is always a decimal integer (IEEE-754)

The obvious problem is the many places where it says “if any”, so some flags might be useful. On the more positive side: parseInt(string,base) but that still means that we have to do the “float” in “floating point” by hand: normalize the number by shifting the radix point to the right and subtract that amount from the exponent if any—sorry, couldn’t resist—stuff the integer (without the exponent) into parseInt(string,base) and multiply by the base set to the power of the exponent.

The code:

function parseNumber(string, base){

  var sign,          has_sign, 
      integer_part,  has_integer_part,
      radix_part,    has_radix_part,
      fraction_part, has_fraction_part,
      exponent_mark, has_exponent_mark,
      exponent_sign, has_exponent_sign,
      exponent,      has_exponent,
      regex_array, regexp, the_number;

  if(arguments.length == 0){
    return Number.NaN;
  }
  if (typeof string !== "string"){
    return Number.NaN;
  }
  if (typeof base === "undefined"){
    base = 10;
  }
  /* shortcut for now to avoid rounding errors */
  if(base == 10 && (typeof Bigfloat === 'undefined') ){
    return parseFloat(string);
  }

The last branch needs to be at the beginning together with a check for precision to speed some things up but for now it is ok. You may set the check for base == 10 to base == 11 for debugging purposes.

  switch(base){
    case 2:
      regexp = /([+-]{0,1})([01]*)([.]{0,1})([01]*)([eE]{0,1})([+-]{0,1})([0-9]*)/;
      break;
    case 8:
      regexp = /([+-]{0,1})([0-7]*)([.]{0,1})([0-7]*)([eE]{0,1})([+-]{0,1})([0-9]*)/;
      break;
    case 16:
      regexp = /([+-]{0,1})([0-9a-fA-F]*)([.]{0,1})([0-9a-fA-F]*)([pP]{0,1})([+-]{0,1})([0-9]*)/;
      break;
    default: /* base 10 */
      regexp = /([+-]{0,1})([0-9]*)([.]{0,1})([0-9]*)([eE]{0,1})([+-]{0,1})([0-9]*)/;
      break;
  }

These regular expressions do not check for validity, e.g.: “-.e” would slip through, we really depend on a good lexer job!

  regex_array = regexp.exec(string);

  has_sign = (regex_array[1] !== 'undefined') ? true : false ;
      sign = (regex_array[1] === "-") ? -1 : 1;

  has_integer_part = (regex_array[2] !== 'undefined') ? true : false ;
      integer_part = (has_integer_part) ? regex_array[2] : 0 ;

  has_radix_part = (regex_array[3] !== 'undefined') ? true : false ;

  has_fraction_part = (regex_array[4] !== 'undefined') ? true : false ;
      fraction_part = (has_fraction_part) ? regex_array[4] : 0 ;

  has_exponent_mark = (regex_array[5] !== 'undefined') ? true : false ;

  has_exponent_sign = (regex_array[6] !== 'undefined') ? true : false ;
      exponent_sign = (regex_array[6] === "-") ? -1 : 1 ;

  has_exponent = (regex_array[7] !== 'undefined') ? true : false ;
      exponent = (has_exponent) ? regex_array[7] : 0 ;

Some, if not many of these flags are not really necessary but some are.
It needs at least one digit anywhere. E.g.: “0.” and “.0” are correct while “.” is not.

  if(!has_integer_part && !has_fraction_part){
    return Number.NaN;
  }

An exponent mark without an exponent is too lonely to be socially acceptable.

  if(has_exponent_mark && !has_exponent){
    return Number.NaN;
  }

All ok and proper now? Can we make a whole number out of the parts?

  var ipl = integer_part.length;
  var fpl = fraction_part.length

If the fractional part does not exist it is, to almost all intents and purposes, an integer so we branch on this premise.

  if(has_radix_part && (fpl != 0) ){

The algorithm described above is carefully hidden but does its work.

    /* we shift the radix point fpl-times to the right */
    if(ipl && fpl){
      integer_part = integer_part + fraction_part;
    } else if(!ipl && fpl){
      integer_part = fraction_part;
    }
    if(!has_exponent){
      has_exponent = true;
      exponent = "0";
    }
    /*update exponent, check for over/under-run*/
    exponent = parseInt(exponent);
    exponent *= exponent_sign;
    exponent -= fpl;

The IEEE-754 exponent is not only always decimal it is, too, the exponent of the base building the power which is the multiplier, so a simple subtraction will suffice.

The bounds checking is quite primitive here, just the example for double precision.

    if(exponent < -307){
      /* It should be taken as zero instead. No, really! */
      return Number.NEGATIVE_INFINITY;
    } 
    if(exponent > 308){
      return Number.POSITIVE_INFINITY;
    }
  } 

Oh, a duplicate! Leave it there for now.

  if(!has_exponent){
    has_exponent = true;
    exponent = "0";
  }

An finally build the number.

  integer_part = parseInt(integer_part,base);
  exponent     = parseInt(exponent,10);
  exponent    *= sign;
  /* integer_part is now a number, so just reshift it*/
  integer_part *= Math.pow(base,exponent);
  the_number = integer_part;
  the_number *= sign;

  return the_number;
}

And I thought: use JavaScript, its simpler! Well…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s