Adventures of a Programmer: Parser Writing Peril I

As menacedpromised in the last post: a description of the long path
to a well determined end.

Goal: Writing a small parser for a simple but complete language as a front-end for a for a calculator back-end. With emphasis on “complete”. A simple task one might think, a very simple one. One that could have been described in full in some book I mentioned in another post.

So in medias res, as Caesar wouldn’t have said (he spoke Greek) and of to the very beginning: The Plan. Not that I ever saw it but it is said that The Plan cometh at every beginning.
What is needed for the already briefly described…uhm…purpose, to avoid the repetition of the infamous two words, that are unbeknownst to many programmers, what does the parser need to parse? What do we need to do to enable the poor thing to do so?

Define the language and define it well!

At least before I use up my highly restricted pool of exclamation marks.
An example for the code that needs to be parsed:

/*
 * load definition for input()
 * also holds constant woop_foogle
 */
include "input_file";

global gvar=0;
static svar,zvar="";

function barfoo(var1=0){
  return massage(var1);
}
/* Ellipsis means "arbitrary number of symbols" and is not a symbol itself */
function foobar(var1,...,varN){
  local a1,a2,...,aN;
  a2 = cuddle(var2);
  for(a1 = 2;a1<length(var1);a1++,--a2){
     var1[a2] =  arguments[a1];
  }
  return var1;
}

function mcduff(a,b,fp,stop){
  while(stop--){
    a=fp(b);
  }
}
function paddington(var1){
  lol i=10;
  for(;i<100;i++){
    var1 *=.5;
  }
  return var1;
}


while(svar = input()){
  gvar++;
  if(gvar > 11)
    zvar = barfoo(svar);
  else if(strcmp(zvar,"blark") && zvar != 42 ){
    svar = input();
    zvar = barfoo(svar);
    exit(0);
  }
  else{
isfrownedupon:
    zvar = barfoo(zvar);
    zvar = mcduff(gvar,zvar,"paddington",45);
  }
}

do{
  zvar[gvar] = '\0';
  if(gvar & 0x1 > 0771)
    break;
  else
    continue;
}while(--gvar > 0);

switch(svar){
  case STEP_1:
  case STEP_2: print(svar);
               break;
  case STEP_3: make_f(zvar);
               Make_f(zvar);
               break;
  case STEP_4: goto isfrownedupon;
               break; /*NOTREACHED*/
  default:     drop_stuff();
               break;
}
/* cast to float with default precision */
print(100!*1.0);
/* 158 decimal digits need a big integer */
print(100!);
print( svar * 23.3e-2 + 444i );

/* holds the code of marble_grinder() */
include "another_file";

marble_grinder(start=23);
print(
  2*3,
  3/2,
  3.2%3, 6-2,8+8,-7e-43i,sqrt(-1),3^5
);

Looks quite like C but with strict typing.

  • 3 different definable scopes: local, static, and global
  • 3 different loops: for, while, do..while
  • 3 jumps: break,continue,goto
  • a way to make own functions, introduced with the keyword: function
  • functions, own and built-in can take multiple arguments, the number of the arguments can be fixed or be variable
  • Some logic: if, else, switch
  • A way to load (and parse, of course) external scripts
  • A kind of a function pointer (a thing you only miss when you do not have it)
  • Strings to parse
  • Numbers to parse
    • integers
    • reals
    • complex numbers
    • all above in arbitrary precision if needed
  • everything that is still amiss

We need to implement all above which gets parsed into an AST (abstract syntax tree) and add the following logic:

  • Validate semantically (missing semicolons, doubled declarations, etc.)
  • Some very high-level optimizations, e.g.: precalculate the terms in the last print() statement in the example code above but only if it makes sense

The next step would be the actual code generation but it’s a script language with immediate execution. On the other side; a bit of logic to transform the tree into something compilable would be something nice-to-have. Or something in the line.

Some of my principles while programming:

  • Make it work, optimize later
  • Counting is simple. It is: One,two,many
  • Take a lot of care to do things not in a too complicated way, just kKeep it simple!
  • Test the stuff
  • Do not publish immediately, sleep over it
  • Release early and often
  • Test the stuff
  • Run a spelling program over your comments and your code
  • Do not stick to stupid principles
  • Do not fall prey to NIH, use what already exists, works well, and is (legally!) usable

To keep it as portable as possible, but not too much, the whole thing will be written in ISO-C 99 with the help of some POSIX functions. To keep things simple the tools Flex and Bison will be used. Here “simple” is meant as “use a stable and well documented tool instead of writing something yourself which will most possible fail in some harsh and very unexpected ways”.
It wants to be a calculator so we need to parse numbers and operators at least. To be able to test what we wrote as early as possible we restrict ourselves to integers and floats as numbers. By way of the “one, two , many” principle it will be easy too add more if we do not hard-code the number “2”.
The operators include operations on two variables “+,-,*,/,%,^”, operations with one variable in a prefixed “+,-” and a postfixed “!” version, That will be easily extendable, e.g.: “–/++” in a pre- and postfix position.
Which we will do, extending I mean, and not just write that is easily extendable and “leave the rest to the reader”!

The most complicated example in the Bison manual is something called “mfcalc” which has a bit more in the guts and is explained well enough. Why don’t we just steal it, add the “factorial” operator and call it a day?
It is not that simple, as you already might have guessed. Changing mfcalc to something more elaborated involves quite some work. More work, actually, than doing it from scratch, but some things of it are re-usable and will be re-used.

Now that I wrote that i see that I forgot to mention a couple of things.

  • we have symbols, we need something to keep them: we need a symbol table
  • we have an argument list of variable length, we need something like a stack
  • we have data of different types. The types are not known in advance (“normal” integers or “big” integers? Is it a C99-double or do we need a “bigfloat”?), so we need something to handle it
  • We have names (variables, function names, strings in general), that are made of characters, so we need to handle that, too, but let’s restrict the domain to ASCII to keep it simple.
  • We have different scopes, we need something able to handle it. A dictionary might be fit for the task.

So, where to start? What’s next?

The symbol table is a binary tree (not a binary search tree, but we can make it one, if we want), which is, together with the stack and the dictionary a commonly used data structure; programming 101 so to say. So let’s refresh that long forgotten course in the next three posts. Or so.

To keep the license as weak as possible I`ll bite the sour apple, ignore my very own strict advice against the abscess in the programming culture that is the NIH abomination and write my own.
Thankfully, it is not necessary for the big number libraries, I’ll use libtommath for the big integers and probably libtomfloat, too for the floats.
The latter has a bug in its inverse square root implementation: a bad initial value. I found out while I tried to implement some basic I/O, namely reading a number from a given string and printing a big-float. One of the operations involved is a division by a larger integer and division is implemented as multiplication with a reciprocal and finding the reciprocal is implemented by finding the inverse square root and squaring that. Drove me mad.
My first approach is a very simple one.

/* LibTomFloat, multiple-precision floating-point library
 *
 * LibTomFloat is a library that provides multiple-precision
 * floating-point artihmetic as well as trigonometric functionality.
 *
 * This library requires the public domain LibTomMath to be installed.
 * 
 * This library is free for all purposes without any express
 * gurantee it works
 *
 * Tom St Denis, tomstdenis@iahu.ca, http://float.libtomcrypt.org
 */
#include <tomfloat.h>

/* using newtons method we have 1/sqrt(x) = Y_{n+1} = y_n * ((3 - xy^2_n)/2) */
int  mpf_invsqrt(mp_float *a, mp_float *b)
{
   mp_float oldval, tmp1, tmp2, const_3;
   int err, itts;
   long expo, nbits;

   /* ensure a is not zero or negative */
   if ((mp_iszero(&(a->mantissa)) == MP_YES) || (a->mantissa.sign == MP_NEG)) {
      return MP_VAL;
   }

   /* get number of iterations */
   itts = mpf_iterations(b);
 
   /* init temps */
   if ((err = mpf_init_multi(b->radix, &oldval, &tmp1, &tmp2, &const_3, NULL)) !
= MP_OKAY) {
      return err;
   }
   /* const_3 */
   if ((err = mpf_const_d(&const_3, 3)) != MP_OKAY)                             
  { goto __ERR; }

   /* tmp1 == reasonable guess at sqrt */
   if ((err = mpf_copy(a, &tmp1)) != MP_OKAY)                                     { goto __ERR; }

   /* 
     Finding an initial value, new approach.
   */
   expo = tmp1.exp;
   if (expo < -tmp1.radix) {
       nbits = -tmp1.radix - expo;
       tmp1.exp = (expo + ( nbits + (nbits/2) ) );
   }
   else if (expo > -tmp1.radix) {
       nbits = expo - -tmp1.radix ;
       tmp1.exp = (expo - ( nbits + (nbits/2) ) );
   }
   else {
       /* do nothing, near enough? */
       tmp1.exp = expo;
   }

   while (itts-- > 0) {
       /* grap copy of tmp1 for early out */
       if ((err = mpf_copy(&tmp1, &oldval)) != MP_OKAY)                           { goto __ERR; }

       /* first tmp2 = y^2 == tmp1^2 */
       if ((err = mpf_sqr(&tmp1, &tmp2)) != MP_OKAY)                              { goto __ERR; }
       /* multiply by x, tmp1 * a */
       if ((err = mpf_mul(&tmp2, a, &tmp2)) != MP_OKAY)                           { goto __ERR; }
       /* 3 - xy^2_n == 3 - tmp1 */
       if ((err = mpf_sub(&const_3, &tmp2, &tmp2)) != MP_OKAY)                    { goto __ERR; }
       /* halve it */
       if ((err = mpf_div_2(&tmp2, &tmp2)) != MP_OKAY)                            { goto __ERR; }
       /* multiply by y_n and feedback */
       if ((err = mpf_mul(&tmp1, &tmp2, &tmp1)) != MP_OKAY)                       { goto __ERR; }

       /* early out if stable */
       if (mpf_cmp(&oldval, &tmp1) == MP_EQ) {
          break;
       }
   }
   mpf_exch(&tmp1, b);
__ERR: mpf_clear_multi(&oldval, &tmp1, &tmp2, &const_3, NULL);
   return err;
}

(The above is a good example of why it is good idea to run a spell checker over your code and comments, especially if English is not your first language)
I leave the mantissa untouched and massage the exponent only. The Newton method used here is quite forgiving, so it should work but I haven’t made any effort to prove it.
What follows is the function to read a string into a bigfloat (and where the bug I describe above showed up). Still not the very best solution and also not well tested.

void mpf_get_str(const char *str, mp_float *a){
   size_t slen;
   long exponent, expol,num_dec_man_dig, num_dec_dec_dig;
   char *sp, *endp;
   mp_float ten, dig, num;
   int msign, esign;

   /* Do not manipulate other people's pointer, it might irritate them */
   sp = (char *)str;

   /* Just in case */
   while (isspace(*sp)){ sp++;}

   /* If nothing is left, set a to zero */
   slen = strlen(sp);
   if(slen == 0){
       errno = EDOM;
       mpf_const_d(a, 0);
       return;
   }

   /* We always set a to zero, whatever error happens */
   if(slen == 1 && !isdigit(*sp)){
       errno = EDOM;
       mpf_const_d(a, 0);
       return;
   }

   /* Gather the ingredients, take the easiest to get first. */

   /*
    * Sign of the mantissa/integer if any.
    * No sign at all defaults to positive.
    */
   msign = 1;
   if(!isdigit(*sp)){
     if(*sp == '-'){
       sp++;
       msign = -1;
     }
     else if(*sp == '+'){
       sp++;
       msign = 1;
     }
     else {
        errno = EDOM;
        mpf_const_d(a, 0);
        return;
     }
   }
   slen = strlen(sp);
   if(slen == 0){
       errno = EDOM;
       mpf_const_d(a, 0);
       return;
   }

   /* initialize the necessary big-floats, take radix from a */

   /* The constant ten, multiplier for the digits*/
    mpf_init(&ten,a->radix);
    mpf_const_d(&ten,10);

   /* Temporary number, holding the individual digits parsed */
    mpf_init(&dig,a->radix);

   /* Temporary number, placeholder for mp_float a */
    mpf_init(&num,a->radix);

   /* Start the parsing */

   /* Skip leading zeros */
   while (*sp == '0') sp++;
   slen = strlen(sp);
   if(slen == 0){
       mpf_const_d(a, 0);
       return;
   }
   /* we could skip trailing zeros but that would be quite complicated */

   /* Parse all digits up to [Ee.] if any */

   /* The overall number of decimal digits */
   num_dec_man_dig = 0;
   while (isdigit(*sp)) {
     /*num = num * 10 + (*sp - '0');*/
     mpf_const_d(&dig,(*sp - '0'));
     mpf_add(&num,&dig,&num);
     mpf_mul(&num,&ten,&num);
     sp++;
     num_dec_man_dig++;
   }
   exponent = 0;
   /* The non-digit may be the period */
   num_dec_dec_dig = 1;
   if (*sp == '.') {
     sp++;
     /* Same spiel as above but keep the number of decimals */
     while (isdigit(*sp)) {
       mpf_const_d(&dig, (*sp - '0'));
       mpf_add(&num,&dig,&num);
       mpf_mul(&num,&ten,&num);
       sp++;
       num_dec_man_dig++;
       num_dec_dec_dig++;
     }
   }
   exponent -= num_dec_dec_dig;

   /* 
    * There might have been no digits at all. Could have been checked above but 
    * it is simpler this way.
    */
   if (num_dec_man_dig == 0) {
     errno = EDOM;
     return;
   }

   /* There might be one or more spaces between */
   while (isspace(*sp)){ sp++;}

   /* Parse exponent if any */
   if (*sp == 'e' || *sp == 'E') {
     sp++;
     if (*sp == '0') {
       errno = EDOM;
       return;
     }
     /* deconstruct completely to make it more simple to follow */
     errno = 0;
     /* The exponent of mp_float is of a "long" data type */
     expol = strtol(sp,&endp,10);
     if ((errno == ERANGE && (expol == LONG_MAX || expol == LONG_MIN))
         || (errno != 0 && expol == 0)) {
       return;
     }
     exponent += expol; 
   } 
   esign = (exponent < 0)?-1:1;

   /* Scale the number to the correct magnitude */

   /* Work on absolute values  */
   mp_abs(&(num).mantissa,&(num).mantissa);
   exponent = abs(exponent);
   /* Binary exponentiation */
   while (exponent) {
     if (exponent & 1) {
       if (esign < 0) {
         mpf_div(&num,&ten,&num);
       } else {
         mpf_mul(&num,&ten,&num);
       }
     }
     exponent >>= 1;
     mpf_sqr(&ten,&ten);
   }
   if(msign <0){
     mp_neg(&(num).mantissa,&(num).mantissa);
   }
   mpf_copy(&num,a);
   mpf_clear(&num);
   mpf_clear(&ten);
   mpf_clear(&dig);
}

The code th print the string is still incomplete and buggy, but nevertheless and for watching only:

char *mpf_put_str(mp_float *a){
   char *buf, *temp, *zerot,*zerob;
   unsigned long multiplicator;
   size_t decimal_point,i, accuracy, buflen,offset;
   long expo;
   int sign;

   mp_int num;mp_int denom;mp_int tenpow;
   mp_init(&denom);
   mp_init(&num);
   mp_init(&tenpow);

   /* The number of decimal digits the radix can hold */
   multiplicator = (unsigned long)((double)(a->radix)/(double)(MPF_LOG2))+1;

   /* Allocate a buffer of sufficient size */
   buflen = (multiplicator < (size_t)4096)?(size_t)4096:multiplicator+100;
   buf = malloc(buflen);

   /* Sign gets handled last */
   sign = (a->mantissa.sign)?-1:1;
   mp_abs(&(a->mantissa),&(a->mantissa));

   /* Convert the base-2 mantissa to base-10 and put the result in buf */
   mp_toradix(&(a->mantissa), buf, 10);
#ifdef DEBUG_MPF_IO   
   printf("IN_PUT_STR %s * 2^%ld | %ld | %zu | %lu\n",
       buf, a->exp,a->radix, strlen(buf), multiplicator);
#endif
   /* The easiest way out */
   if(a->exp == 0){
     /* mp_toradix(&(a->mantissa), buf, 10); */
     buf[multiplicator] = '\0';
     return buf;
   }
   else{
     accuracy = (unsigned long)((double)(abs(a->exp))/(double)(MPF_LOG2))+1;
     if(a->exp < 0){
       mp_set_int(&denom, 1);
       mp_set_int(&tenpow, 10); 
       /* calculate 2^exp as 1<<exp*/
       mp_2expt(&denom,abs(a->exp));
       /* calculate 10^(dec. dig. in mantissa) */
       mp_expt_d (&tenpow,accuracy,&tenpow);
       /* Make a copy for further use */
       mp_copy(&(a->mantissa),&num);
       /* Now multiply by 10^(dec. dig. in mantissa) */
       mp_mul(&num,&tenpow,&num);
       mp_clear(&tenpow);
       /* and divide by 2^exp */
       mp_div(&num,&denom,&num,0);
       mp_clear(&denom);
       /* Result is the number without the decimal point*/
       mp_toradix(&num, buf, 10);
       mp_clear(&num);

       if(a->exp < -a->radix){
         offset = (sign < 0)?1:0;
         for(i=multiplicator;i>0;i--){
           if(buf[i] == '0'){
             multiplicator--;
             continue;
           }
           buf[i+1+offset] = buf[i];
         }
         buf[1+offset] = '.';
         buf[multiplicator+1+offset] = '\0';
         if(sign < 0){
           buf[1] = buf[0];
           buf[0] = '-';
         }
         expo = (unsigned long)
                   ((double) (abs(a->exp) - a->radix)/(double)(MPF_LOG2))+1;
         sprintf(buf,"%s E%ld", buf,-expo);
         return buf;
       }
       decimal_point = (unsigned long)
                   ((double)( a->radix - abs(a->exp) )/(double)(MPF_LOG2))+1;
       temp = malloc(strlen(buf)+10);
       zerot = temp;
       zerob = buf;
#ifdef DEBUG_MPF_IO
       printf("buf = %s, dp = %zu, mul = %lu\n"
                 ,buf, decimal_point,multiplicator);
#endif
       if(sign < 0){
         *temp++ = '-';
       }
       for(i=0;*buf;i++){
         if(i == multiplicator){
           break;
         }
         if(i == decimal_point){
           *temp++ = '.';
           continue;
         }
         *temp++ = *buf++;
       }
       *++temp = '\0';
       free(zerob);
       return zerot;
     }
     else{
       mp_set_int(&denom, 1);
       /* calculate 2^exp as 1<<exp*/
       mp_2expt(&denom,a->exp);
       /* Make a copy for further use */
       mp_copy(&(a->mantissa),&num);
       /* Now multiply by 2^exp */
       mp_mul(&num,&denom,&num);
       mp_clear(&denom);

       buflen = (size_t)(multiplicator 
              + ((unsigned long)((double)(a->exp)/(double)(MPF_LOG2))+1));
       buflen = (buflen < (size_t)8192)?(size_t)8192:buflen +100;
       buf = realloc(buf, buflen);
       mp_toradix(&num, buf, 10);
       mp_clear(&num);
       buflen = strlen(buf);
       offset = (sign < 0)?1:0;
       for(i=multiplicator;i>0;i--){
         buf[i+1+offset] = buf[i];
       }
       buf[1+offset] = '.';
       buf[multiplicator+1+offset] = '\0';
       if(sign < 0){
         buf[1] = buf[0];
         buf[0] = '-';
       }
       sprintf(buf,"%s E%zu\n", buf,buflen-1);
       mp_clear(&tenpow);
       return buf;
     }
   }
   mp_clear(&tenpow);
   mp_clear(&denom);
   mp_clear(&num);
   return NULL;
}

The header says “Define the language and define it well!” and the language ist still not defined, save well. That is left as an exercise for the…no, just kidding, I just had no time yet to write the grammar and to prove it to be correct. But implementing the complete grammar can be done at a later point and there is a lot to do in the meantime to prepare for it, so I have some short period of grace. I hope 😉

Advertisements

One thought on “Adventures of a Programmer: Parser Writing Peril I

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s