Adventures of a Programmer: Parser Writing Peril XXXIII

Some Helpers

While I was sampling all the bits and pieces needed to make the whole thing not only function but useful I stumbled over some code I once wrote that I didn’t understand anymore at the first glance—where the first reaction was something in the line of WTF or worse. Continue reading

Adventures of a Programmer: Parser Writing Peril XXX

Mmh… which of the sillier censoringfiltering programs will stumble over the title of this blog and block the whole WordPress domain from pure vengefulness? ;-)

But serious (actually, the above is serious but too much off of the theme):
I’m sampling for a library for Small out of my Pragmath project and found so much things that need an array, I think I’ll bite and implement a simple version of an array (no elisions etc.). Nothing more, just wanted to keep my idea in writ such that I cannot deny it later.
Yes, I know myself quite good ;-)
They offer webpages over at GIThub now—let’s see what one can do there and how, if not. It would be nice to be able to point to something that’s already moving, as little as it might be at the start.

Adventures of a Programmer: Parser Writing Peril XXIX

Small, the lesser evil

We want to have a language with a lot of specialities. OK, but some specialities we want! These extras are additions to some basic language for doing arithmetics. No changes to the underlying grammar, only additions. It is still a large grammar but better manageable than the final grammar of Little.
And the final product will be usable! No silly toy language, barely able to be taken as an example you can learn something off, no, something you can work with. And if you have a license (yes I know, but it is for all practical purposes just that: a license) you can slap some nice UI on top and sell it for one-ninety-nine at the Apple-shop. The restriction “Apple-shop” holds because Android got GP/PARI ported, so it is of absolutely no use to try it there ;-)

But I digress, as usual, so down to the meat-halls now.

The Small language is a proper subset of the Little language which itself is nearly a subset of ECMAScript. It needs to be usable, si it needs free variables, functions (including function declarations, that is, building your own), loops (while, do…while,for), jumps (break, continue, switch), something to work with (operators), and something to work on (numbers, but only the double-precision numbers ECMAScript offers).
We need something to sample and massage the AST-nodes.
We need a lexer. Jison has a built-in Flex-like lexer, we use it.

Parser first. The Jison parser syntax is available in two flavours: JSON and Bison. We take the latter but we need to be aware of same small differences to the original.

The parser parses its own script bottom down, the parser itself bottom up (LARL to be exact but some other types are available, too), That means that we need to give it a start point. NB: I will never use abbreviations if it is avoidable. Especial when these abbreviations came for a time that measured available memory in kilobytes.

Somebody remembers the famous “640 kb is enough for all!”? It is said that Bill Gates said it. And now try to find a version of Windows that runs and is usable with 640 MB of memory.

I digress again? Yeah, get used to it, I do it quite often ;-)

Uhm…where have I been…ah, yes, the starting point. It needs a name and we will name it, with the help of all our wit and phantasy, program. Please admire the choice of the American spelling which will save us a full two characters!

The parser script can be parted in segments, divided by a doubled percent sign %% at the start of the line with nothing else. The directive of the starting point goes into the segment before the actual parser. Because of the direction of the parsing we should start with a node and go up in direction of the precedence according to the ECMAScript 5.1 standard. The easiest starting point for the parser script (not the parser, that starting point is already chosen to be program) based on the described path is a leaf. The out-most leaf is a so called literal, here a number. You don’t do anything with it (ok, we need to parse the number but we can use JavaScript for that here), you take it literally. Now we have a start of our parser-script:

/* root node */
%start program
%%
literal
    : "TRUETOKEN"          { $$ = new boolean_node(true,location(@1,@1)); }
    | "FALSETOKEN"         { $$ = new boolean_node(false,location(@1,@1)); }
    | "NUMBER_LITERAL"     { $$ = new number_node($1,location(@1,@1)); }          
    | "IMAGINARY_LITERAL"  { $$ = new imaginary_number_node($1,location(@1,@1)); }
    ;

Here we have a C++-like multiline comment, the definition of the starting point (can be abbreviated to %s program) the divisor and the first production. The all-uppercase identifiers are variable names from the lexer, they have to be put in quotes in Jison.
I have described the meaning of the rest in my last post but see your textbook for details (e.g.: O’Reilly’s Flex&Bison $29.99 for the print. Or ask Google for some information.)
I’ll leave out the cluttering location() function from now on.

The next production is one that actually does something instead of a mere collecting.

primary_expression
    : literal
    | "IDENTIFIER"         { $$ = new resolve_node($1));}
    | '(' expression ')'   { $$ = $2; }
    ;

It consists of the variable literal, the name of our first production which is already defined at that point, and the parenthesized group with the production name expression in it which isn’t defined yet. We use JavaScript directly here for the grouping by copying the content into the AST.
We can group primary_expression and a function call together as a left-hand expression. It makes not much sense now but is a nice place for additions to…uhm…add (e.g.: the details for the matrix-type).

left_hand_expression
    : primary_expression            { $$ = new primary_expression_node($1); }
    | "IDENTIFIER" arguments        { $$ = new function_call_node($1, $2); }
    ;

Arguments for a function can be none or many in number:

arguments:
      '(' ')'                                  { $$ = []; }
    | '(' argument_list ')'                    { $$ = $2; }
    ;

argument_list
    : assignment_expression                   { $$ = [$1]; }
    | argument_list ',' assignment_expression { $$ = $1.concat($3); }

Here we use JavaScript directly again and put the arguments in an easy to parse Array. We end this branch here and start a new one, but production assignment_expression still needs a definition, we must not forget it.
We have the left-hand expressions, we need something on the right, too and that is aptly called postfix-expression:

postfix_expression
    : left_hand_expression
    | left_hand_expression "!"          { $$ = new postfix_node($1, "!"); }
    | left_hand_expression "PLUSPLUS"   { $$ = new postfix_node($1, "++"); }
    | left_hand_expression "MINUSMINUS" { $$ = new postfix_node($1, "--"); }

A parser for C++ cannot use the tokens as strings directly. Either use an enum (common way) or pass a pointer to it new postfix_node($1, *$2);. The latter is not quite useful here but there are some useful cases e.g.: the identifier where you need the actual string.
The exclamation-mark is for the combinatorial function factorial.
The exclamation-mark gets used at another place as a logical negation, too. It is possible to do it if you do it carefully like here:

unary_expression
    : postfix_expression
    | "PLUSPLUS" unary_expression     { $$ = new prefix_node("++", $2); }
    | "MINUSMINUS" unary_expression   { $$ = new prefix_node("--", $2); }
    | '+' unary_expression            { $$ = new unary_plus_node($2); }
    | '-' unary_expression            { $$ = new negate_node($2); }
    | '~' unary_expression            { $$ = new bitwise_not_node($2); }
    | '!' unary_expression            { $$ = new logical_not_node($2); }
    | '#' unary_expression            { $$ = new cardinality_node($2); }
    ;

The hash-character # acts like its node-name suggests and returns the cardinality of the following set. It is rarely a not-yet-implemented set but it can for example return the length of a not-yet-implemented array or the degree of a not-yet-implemented polynomial expression. If it is of only little use or too complicated to implement it gets dropped without hesitation.

The above shows the implementation of the order of precedence, too. The prefix’d incrementor comes before the postfix’d incrementor in both precedence and parser-script.
The next in precedence would be exponentiation. The common symbol for exponentiation is the caret ^ but that will be taken as the xor-operator symbol.
Some scripts use the doubled asterix ** as the power-symbol and so do we.
I repeat myself? Yeah, I’m old, I’m supposed to ;-)

power_expression
  : unary_expression
  | power_expression POWER unary_expression { $$ = new power_node($1, $3); }
  ;

I find it very useful to have two kinds of division operators, one for floating point division and one for integer division. You might call it syntactic sugar but I call it useful.
Most languages I know implement it as a doubled division operator // and so do we.

multiplicative_expression
    : power_expression
    | multiplicative_expression '*'  power_expression      { $$ = new mult_node($1, $3, "*"); }
    | multiplicative_expression '/'  power_expression      { $$ = new mult_node($1, $3, "/"); }
    | multiplicative_expression "INTDIV"  power_expression { $$ = new mult_node($1, $3, "//"); }
    | multiplicative_expression '%'  power_expression      { $$ = new mult_node($1, $3, "%"); }
    ;
additive_expression
    : multiplicative_expression
    | additive_expression '+' multiplicative_expression  { $$ = new add_node($1, $3, '+'); }
    | additive_expression '-' multiplicative_expression  { $$ = new add_node($1, $3, '-'); }
    ;

The lines get quite long, sorry but I failed in my choice of the WordPress theme and doing some line-breaks makes in less readable in my opinion.
Addition is quite boring and really the same, so I added it to the same snippet.

The multiplications by powers of two aka. shifting makes use of the difference between signed and unsigned shifting like in Java and JavaScript and probably some other languages, too.

shift_expression
    : additive_expression
    | shift_expression LSHIFT additive_expression  { $$ = new shift_node($1, "<<", $3); }
    | shift_expression RSHIFT additive_expression  { $$ = new shift_node($1, ">>", $3); }
    | shift_expression URSHIFT additive_expression { $$ = new shift_node($1, ">>>", $3); }
    ;

Now we are at the relational logic in the precedence order.

relational_expression
    : shift_expression
    | relational_expression '<' shift_expression { $$ = new relational_node($1, "<", $3); }
    | relational_expression '>' shift_expression { $$ = new relational_node($1, ">", $3); }
    | relational_expression LE shift_expression  { $$ = new relational_node($1, "<=", $3); }
    | relational_expression GE shift_expression  { $$ = new relational_node($1, ">=",$3);}
    ;

equality_expression
    : relational_expression
    | equality_expression EQEQ relational_expression   
                                      { $$ = new equal_node($1, "==", $3); }
    | equality_expression NE relational_expression     
                                      { $$ = new equal_node($1, "!=", $3); }
    | equality_expression STREQ relational_expression  
                                      { $$ = new equal_node($1, "===", $3); }
    | equality_expression STRNEQ relational_expression 
                                      { $$ = new equal_node($1, "!==", $3);}
    ;

Yes, that formatting is with some line-breaks. Hmm…

We need some bit manipulations.

bitwise_and_expression
    : equality_expression
    | bitwise_and_expression '&' equality_expression
                                     { $$ = new bit_oper_node($1, "&", $3); }
    ;
bitwise_xor_expression
    : bitwise_and_expression
    | bitwise_xor_expression '^' bitwise_and_expression
                                     { $$ = new bit_oper_node($1, "^", $3); }
    ;
bitwise_or_expression
    : bitwise_xor_expression
    | bitwise_or_expression '|' bitwise_xor_expression
                                     { $$ = new bit_oper_node($1, "|", $3); }
    ;

And some more logic.

logical_and_expression
    : bitwise_or_expression
    | logical_and_expression AND bitwise_or_expression
                                     { $$ = new binary_logical_node($1, "&&", $3); }
    ;
logical_or_expression
    : logical_and_expression
    | logical_or_expression OR logical_and_expression
                                     { $$ = new binary_logical_node($1,"||", $3); }
    ;

The infamous ternary expression.

conditional_expression
    : logical_or_expression
    | logical_or_expression '?' assignment_expression ':' assignment_expression
                                     { $$ = new conditional_node($1, $3, $5); }
    ;

And finally we need the assignement-expression defined.

assignment_expression
    : conditional_expression
    | left_hand_expression assignment_operator assignment_expression
                                     { $$ = new assign_node($1, $2, $3);}
    ;
assignment_operator
    : '='            { $$ = "=" ; }
    | "PLUSEQUAL"    { $$ = "+=" ; }
    | "MINUSEQUAL"   { $$ = "-=" ; }
    | "POWEREQ"      { $$ = "**="; }
    | "MULTEQUAL"    { $$ = "*="; }
    | "INTDIVEQUAL"  { $$ = "//="; }
    | "DIVEQUAL"     { $$ = "/="; }
    | "LSHIFTEQUAL"  { $$ = "<<="; }
    | "RSHIFTEQUAL"  { $$ = ">>="; }
    | "URSHIFTEQUAL" { $$ = ">>>="; }
    | "ANDEQUAL"     { $$ = "&="; }
    | "XOREQUAL"     { $$ = "^="; }
    | "OREQUAL"      { $$ = "|="; }
    | "MODEQUAL"     { $$ = "%="; }
    ;

The assignement-expression looks curious as if it cannot work at all and loops infinitely but the first expression goes to conditional_expression which itself the first expression logical_or_expression which itself…got it? If not, ask somebody who did ;-)

The production expression is defined as a comma separated list. Not necessarily needed but useful.

The thing is closed here, you can implement it without much additional work and have something like the basic calculator from the Bison/Jison examples with variables and functions (if you implement variables and functions) but it isn’t even Turing-complete—you cannot move the tape freely, you cannot jump.

Let’s take a jump to the left, and then a step to the ri-i-i-i-ight. Sorry.

So we need some if’s and else’s, and some loops, too; we need statements, with some content, like every good politician pretends to utter.

Apropos politicians: it is midnight here in western Germany so lets take a look in the northern direction, where the storms come from and see if a storm is there.
The New York Times says “don’t know anything before Friday morning CEST” although needing several pages to do so and the Daily Mail…no, I don’t want to know what the Daily Mail says or more exactly: I don’t care what the Daily Mail says.

What? Hey, I said I digress often, didn’t I?

And now I lost the thread…yes, thank you, statements it was.

statement
    : block
    | variable_statement
    | empty_statement
    | expression_statement
    | if_statement
    | iteration_statement
    | continue_statement
    | break_statement
    | return_statement
    | switch_statement
    | labelled_statement
    ;

The naming of the individual statements should be telling enough but I’ll comment in details if necessary.

block
    : "OPENBRACE" source_elements "CLOSEBRACE"   { $$ = new block_node($2); }
    ;

The block-statement is special as it does nothing on itself, it is just a syntactically neutral grouping and source_elements can be anything including the block-statement.
What it can be used for is to manipulate the scope if you increment the scope after every opening brace and decrement after every closing brace (it is a bit more complicated than that but basically…).

The following one is a bit more interesting, I think, the declaration of the variables.

variable_statement
    : "LET" variable_declaration_list ';' { $$ = new var_statement_node($2); }
    ;
variable_declaration_list
    : variable_declaration                 { $$ = [$1]; }
    | variable_declaration_list ',' _declaration 
                                           { $$ = $1.concat($3); }
    ;
variable_declaration
    : "IDENTIFIER"                      { $$ = new var_decl_node($1, 0) }
    | "IDENTIFIER" initializer          { $$ = new var_decl_node($1, $2) }
    ;
initializer
    : '=' assignment_expression       { $$ = $2; }
    ;

The keyword for the variable declaration is let.
We could do without a keyword but it is simpler and cleaner that way.
Example let a=0,b=c=d=gammaln(e),f,g,h;

We need a list of statements (later for the case/default in switch ), an empty statement and and expression statement

statement_list
  : statement                  { $$ = [$1]; }
  | statement_list statement   { $$ = $1.concat($2); }
  ;
empty_statement
    : ';'                      { $$ = new empty_statement_node(); }
    ;
expression_statement:
    expression ';'             { $$ = new expr_statement_node($1); }
    ;

The next statements are related to branching.

if_statement
    : "IF" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE" 
                               { $$ = new if_node($3,$5); }
    | "IF" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE" 
                         ELSE "OPENBRACE" statement "CLOSEBRACE"
                               { $$ = new if_node($3,$5,$7);}
    ;

No dangling else problem here because of the embracing. Some people will not like it but it is easy to change and I like it the way I defined it. One of the advantages if you write it yourself. One of the advantages of Open-Source: you can change it if you want and don’t have to ask me.
It is cleaner, too.
The iteration statement is named that way because it consist of a group of iterating statements.
Clever, hu? ;-)

iteration_statement
    : "DO" "OPENBRACE" statement "CLOSEBRACE" "WHILE" '(' expression ')' ';'
                                       { $$=new do_while_node($3,$7);}
    | "WHILE" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE"
                                       { $$ = new while_node($3,$5); }
    | "FOR" '(' expression_optional ';' expression_optional ';' expression_optional ')'
        "OPENBRACE" statement "CLOSEBRACE" 
                                       { $$ = new for_node($3,$5,$7,$10);}
    ;
expression_optional
    :                                  { $$ = " "; }
    | expression                       { $$ = $1; }
    ;

The for-loop is build like most other for-loops, especially ECMAScript’s for-loop, hence the space-character put into the AST, such that the assembler has not much work to do here.
The embracing of the following statements is not optional here as it was not optional in the if-else-branching. It had been done there to avoid a conflict, here it is done mainly for simplicity and a clean syntax. Yes, I know that I repeat myself.

Next come the jump-statements.

continue_statement
    : "CONTINUE" ';'                     { $$ = new continue_node();}
    | "CONTINUE" "IDENTIFIER" ';'        { $$ = new continue_node($2); }
    ;
break_statement
    : "BREAK" ';'                        { $$ = new break_node();}
    | "BREAK" "IDENTIFIER" ';'           { $$ = new break_node($2);}
    ;
return_statement
    : "RETURN" ';'                       { $$ = new return_node();}
    | "RETURN" expression ';'            { $$ = new return_node($2); }
    ;

The jumps can only go out (break, return) or backwards (continue). Jumping forward (goto) is not implemented in ECMAScript (yet?) so for the reason of simplicity we don’t do it either.

The switch statement is as it is implemented in JavaScript with the addition of the forced embracing.

switch_statement
    : "SWITCH" '(' expression ')' case_block { $$ = new switch_node($3, $5);}
    ;
case_block
    : "OPENBRACE" case_clauses "CLOSEBRACE"
                              { $$ = $2; }
    | "OPENBRACE" case_clauses default_clause case_clauses "CLOSEBRACE"
                              { $$ = $2.concat($3).concat($4); }
    ;
case_clauses
    :      /* nothing */       { $$ = []; }
    | case_clauses case_clause { $$ = $1.concat($2); }
    ;
case_clause
    : "CASE" expression ':'                   { $$ = new case_clause_node($2, " "); }
    | "CASE" expression ':' statement_list    { $$ = new case_clause_node($2, $4); }
    ;
default_clause
    : "DEFAULT" ':'                           { $$ = new case_clause_node(" ", " "); }
    | "DEFAULT" ':' statement_list            { $$ = new case_clause_node(" ", $3); }
    ;

The jumps can jump into something and that something is called a label

labelled_statement
    : "IDENTIFIER" ':' statement              { $$ = new label_node(*$1, $3);}
    ;

We need some function declarations, with and without arguments

function_declaration
    : "DEFINE" "IDENTIFIER" '(' ')' "OPENBRACE"
           function_body "CLOSEBRACE"        /* check for [].length to differ */
                                           { $$ = new func_decl_node($2,$3,[],$6); }
    | "DEFINE" "IDENTIFIER" '(' parameter_list ')' "OPENBRACE"
           function_body "CLOSEBRACE"    
                                           { $$ = new func_decl_node($2,$3,$5, $7); }
    ;
parameter_list
    : "IDENTIFIER"                          { $$ = new parameter_list_node($1); }
    | parameter_list ','  "IDENTIFIER"      { $$ = new parameter_list_node($1, $3);}
    ;
function_body
    : source_elements
    ;

The keyword for the declaration of a function is define and the syntax is a subset of the ECMAScript variant with the addition of the forced embracement.

The whole program consists of source-elements, a list of the catch-all statement followed by the end-of-file marker.

program
    : source_elements EOF               { $$ = new program_node($1); return $$;}
    ;

The source-elements list consists of source-elements or nothing.

source_elements
    :                                   { $$ = []; }
    | source_elements source_element    { $$ = $1.concat($2)}
    ;

A single source-element consists of statements and function_declarations which are already defined and that means: we are done.

source_element
    : statement
    | function_declaration
    ;
%% /* AST handling code comes after this line */

Done with the parser-script that is ;-)

If you want to go on yourself (I caught a cold and will need some aspirin and a bed now, it get to be a long day tomorrowday), here are some little helpers:

perl -nle 'print "function $1\{\n\n\}" if /new\s+(\w+\([0-9,$ ]*\))/' small.jison

A shell one-liner to grab all of the node-objects and make functions (constructors actually) out of them:
If you want to use it like C. Ihrig the following prints the necessary code:

perl -nle 'print "parser.ast.$1 = $1;" if /new\s+(\w+\([0-9,$ ]*\))/' small.jison

The code to make a IEEE-754 number out of the string representation (I did that already, didn’t I?) is (without hexadecimal or other representations to keep it simple):

function parseNumericLiteral(literal) {
  return parseNumber(literal);
}
function parseImaginaryLiteral(literal) {
  return parseNumericLiteral(literal.
                    subst(0,literal.length-1));
}
function parseNumber(string){
  return Number(string);
}

It is a bit more complicated than necessary, but that way you can add other representations more easily. To give a hint:

function parseNumericLiteral(literal) {
  if (literal.charAt(0) === "0") {
    switch(literal.charAt(1).toLowerCase()) {
      case "x": /* hexadecimal */
        return parseNumber(literal.substr(2), 16);
        break;
      case "b": /* binary */
        return parseNumber(literal.substr(2), 2);
        break;
      default: /* octal */
       return parseNumber(literal.substr(1), 8);
      break;
    }
  } else {    /* decimal */
    return parseNumber(literal,10);
  }
}

I haven’t implemented the actual parsing code but it can be done by simple shifting, aka. normalizing but the other way around (error and bounds-checking assumed):

  • check for a radix marker (period). If we have one
    • move the radix marker to the right until it is gone, keep the count
    • subtract the count from the exponent by shifting right by
      • the count for base 2
      • the count/8 for base 8
      • the count/4 for base 16
    • subtract the remainder
  • If we have none we have an integer and can use JavaScript’s parseInt(from_string,from_base)

You might have to treat the exponent differently depending on the base but the number representation of the exponent is always as a signed decimal integer. I have left that out in the pseudo-code above but you might have guessed that already ;-)

Adventures of a Programmer: Parser Writing Peril XXVIII

The JISON Parser

Printing the AST

The Jison parser is not a rewrite of Bison and I am fully aware of that fact, but it is quite close, dangerously close! That’s one of the reasons it took me so long ;-)

The parser stands now and JISON gulps it without protest and while I’m writing this line it creeps slowly in my mind that I forgot to implement something to include another file. OK, should be simple, but tomorrow is another day, as the saying goes.

Nevertheless… Continue reading

Adventures of a Programmer: Parser Writing Peril XXVII

The Infamous Dangling Else

The Wikipedia-page describes it well enough, so no description of the problem is laid down in this post, only a short one of the solution chosen.

There are basically two different solutions:

  1. Follow Eelco Visser’s proposals (PDF!) and do some “magick”
  2. Write an unambiguous language

We choose b. Continue reading