Tag Archives: grammar

Adventures of a Programmer: Parser Writing Peril XXXX

The final[1] grammar together with the lexer. The whole file with a lot of comments is on Github. It is currently in one file for convenience but needs to be split into a lexer (Flex) and a parser (Bison) later.

The file has a lot of comments already, not much to say here, but nevertheless… Continue reading

Adventures of a Programmer: Parser Writing Peril XXIX

Small, the lesser evil

We want to have a language with a lot of specialities. OK, but some specialities we want! These extras are additions to some basic language for doing arithmetics. No changes to the underlying grammar, only additions. It is still a large grammar but better manageable than the final grammar of Little.
And the final product will be usable! No silly toy language, barely able to be taken as an example you can learn something off, no, something you can work with. And if you have a license (yes I know, but it is for all practical purposes just that: a license) you can slap some nice UI on top and sell it for one-ninety-nine at the Apple-shop. The restriction “Apple-shop” holds because Android got GP/PARI ported, so it is of absolutely no use to try it there πŸ˜‰

But I digress, as usual, so down to the meat-halls now.

The Small language is a proper subset of the Little language which itself is nearly a subset of ECMAScript. It needs to be usable, si it needs free variables, functions (including function declarations, that is, building your own), loops (while, do…while,for), jumps (break, continue, switch), something to work with (operators), and something to work on (numbers, but only the double-precision numbers ECMAScript offers).
We need something to sample and massage the AST-nodes.
We need a lexer. Jison has a built-in Flex-like lexer, we use it.

Parser first. The Jison parser syntax is available in two flavours: JSON and Bison. We take the latter but we need to be aware of same small differences to the original.

The parser parses its own script bottom down, the parser itself bottom up (LARL to be exact but some other types are available, too), That means that we need to give it a start point. NB: I will never use abbreviations if it is avoidable. Especial when these abbreviations came for a time that measured available memory in kilobytes.

Somebody remembers the famous “640 kb is enough for all!”? It is said that Bill Gates said it. And now try to find a version of Windows that runs and is usable with 640 MB of memory.

I digress again? Yeah, get used to it, I do it quite often πŸ˜‰

Uhm…where have I been…ah, yes, the starting point. It needs a name and we will name it, with the help of all our wit and phantasy, program. Please admire the choice of the American spelling which will save us a full two characters!

The parser script can be parted in segments, divided by a doubled percent sign %% at the start of the line with nothing else. The directive of the starting point goes into the segment before the actual parser. Because of the direction of the parsing we should start with a node and go up in direction of the precedence according to the ECMAScript 5.1 standard. The easiest starting point for the parser script (not the parser, that starting point is already chosen to be program) based on the described path is a leaf. The out-most leaf is a so called literal, here a number. You don’t do anything with it (ok, we need to parse the number but we can use JavaScript for that here), you take it literally. Now we have a start of our parser-script:

/* root node */
%start program
    : "TRUETOKEN"          { $$ = new boolean_node(true,location(@1,@1)); }
    | "FALSETOKEN"         { $$ = new boolean_node(false,location(@1,@1)); }
    | "NUMBER_LITERAL"     { $$ = new number_node($1,location(@1,@1)); }          
    | "IMAGINARY_LITERAL"  { $$ = new imaginary_number_node($1,location(@1,@1)); }

Here we have a C++-like multiline comment, the definition of the starting point (can be abbreviated to %s program) the divisor and the first production. The all-uppercase identifiers are variable names from the lexer, they have to be put in quotes in Jison.
I have described the meaning of the rest in my last post but see your textbook for details (e.g.: O’Reilly’s Flex&Bison $29.99 for the print. Or ask Google for some information.)
I’ll leave out the cluttering location() function from now on.

The next production is one that actually does something instead of a mere collecting.

    : literal
    | "IDENTIFIER"         { $$ = new resolve_node($1));}
    | '(' expression ')'   { $$ = $2; }

It consists of the variable literal, the name of our first production which is already defined at that point, and the parenthesized group with the production name expression in it which isn’t defined yet. We use JavaScript directly here for the grouping by copying the content into the AST.
We can group primary_expression and a function call together as a left-hand expression. It makes not much sense now but is a nice place for additions to…uhm…add (e.g.: the details for the matrix-type).

    : primary_expression            { $$ = new primary_expression_node($1); }
    | "IDENTIFIER" arguments        { $$ = new function_call_node($1, $2); }

Arguments for a function can be none or many in number:

      '(' ')'                                  { $$ = []; }
    | '(' argument_list ')'                    { $$ = $2; }

    : assignment_expression                   { $$ = [$1]; }
    | argument_list ',' assignment_expression { $$ = $1.concat($3); }

Here we use JavaScript directly again and put the arguments in an easy to parse Array. We end this branch here and start a new one, but production assignment_expression still needs a definition, we must not forget it.
We have the left-hand expressions, we need something on the right, too and that is aptly called postfix-expression:

    : left_hand_expression
    | left_hand_expression "!"          { $$ = new postfix_node($1, "!"); }
    | left_hand_expression "PLUSPLUS"   { $$ = new postfix_node($1, "++"); }
    | left_hand_expression "MINUSMINUS" { $$ = new postfix_node($1, "--"); }

A parser for C++ cannot use the tokens as strings directly. Either use an enum (common way) or pass a pointer to it new postfix_node($1, *$2);. The latter is not quite useful here but there are some useful cases e.g.: the identifier where you need the actual string.
The exclamation-mark is for the combinatorial function factorial.
The exclamation-mark gets used at another place as a logical negation, too. It is possible to do it if you do it carefully like here:

    : postfix_expression
    | "PLUSPLUS" unary_expression     { $$ = new prefix_node("++", $2); }
    | "MINUSMINUS" unary_expression   { $$ = new prefix_node("--", $2); }
    | '+' unary_expression            { $$ = new unary_plus_node($2); }
    | '-' unary_expression            { $$ = new negate_node($2); }
    | '~' unary_expression            { $$ = new bitwise_not_node($2); }
    | '!' unary_expression            { $$ = new logical_not_node($2); }
    | '#' unary_expression            { $$ = new cardinality_node($2); }

The hash-character # acts like its node-name suggests and returns the cardinality of the following set. It is rarely a not-yet-implemented set but it can for example return the length of a not-yet-implemented array or the degree of a not-yet-implemented polynomial expression. If it is of only little use or too complicated to implement it gets dropped without hesitation.

The above shows the implementation of the order of precedence, too. The prefix’d incrementor comes before the postfix’d incrementor in both precedence and parser-script.
The next in precedence would be exponentiation. The common symbol for exponentiation is the caret ^ but that will be taken as the xor-operator symbol.
Some scripts use the doubled asterix ** as the power-symbol and so do we.
I repeat myself? Yeah, I’m old, I’m supposed to πŸ˜‰

  : unary_expression
  | power_expression POWER unary_expression { $$ = new power_node($1, $3); }

I find it very useful to have two kinds of division operators, one for floating point division and one for integer division. You might call it syntactic sugar but I call it useful.
Most languages I know implement it as a doubled division operator // and so do we.

    : power_expression
    | multiplicative_expression '*'  power_expression      { $$ = new mult_node($1, $3, "*"); }
    | multiplicative_expression '/'  power_expression      { $$ = new mult_node($1, $3, "/"); }
    | multiplicative_expression "INTDIV"  power_expression { $$ = new mult_node($1, $3, "//"); }
    | multiplicative_expression '%'  power_expression      { $$ = new mult_node($1, $3, "%"); }
    : multiplicative_expression
    | additive_expression '+' multiplicative_expression  { $$ = new add_node($1, $3, '+'); }
    | additive_expression '-' multiplicative_expression  { $$ = new add_node($1, $3, '-'); }

The lines get quite long, sorry but I failed in my choice of the WordPress theme and doing some line-breaks makes in less readable in my opinion.
Addition is quite boring and really the same, so I added it to the same snippet.

The multiplications by powers of two aka. shifting makes use of the difference between signed and unsigned shifting like in Java and JavaScript and probably some other languages, too.

    : additive_expression
    | shift_expression LSHIFT additive_expression  { $$ = new shift_node($1, "<<", $3); }
    | shift_expression RSHIFT additive_expression  { $$ = new shift_node($1, ">>", $3); }
    | shift_expression URSHIFT additive_expression { $$ = new shift_node($1, ">>>", $3); }

Now we are at the relational logic in the precedence order.

    : shift_expression
    | relational_expression '<' shift_expression { $$ = new relational_node($1, "<", $3); }
    | relational_expression '>' shift_expression { $$ = new relational_node($1, ">", $3); }
    | relational_expression LE shift_expression  { $$ = new relational_node($1, "<=", $3); }
    | relational_expression GE shift_expression  { $$ = new relational_node($1, ">=",$3);}

    : relational_expression
    | equality_expression EQEQ relational_expression   
                                      { $$ = new equal_node($1, "==", $3); }
    | equality_expression NE relational_expression     
                                      { $$ = new equal_node($1, "!=", $3); }
    | equality_expression STREQ relational_expression  
                                      { $$ = new equal_node($1, "===", $3); }
    | equality_expression STRNEQ relational_expression 
                                      { $$ = new equal_node($1, "!==", $3);}

Yes, that formatting is with some line-breaks. Hmm…

We need some bit manipulations.

    : equality_expression
    | bitwise_and_expression '&' equality_expression
                                     { $$ = new bit_oper_node($1, "&", $3); }
    : bitwise_and_expression
    | bitwise_xor_expression '^' bitwise_and_expression
                                     { $$ = new bit_oper_node($1, "^", $3); }
    : bitwise_xor_expression
    | bitwise_or_expression '|' bitwise_xor_expression
                                     { $$ = new bit_oper_node($1, "|", $3); }

And some more logic.

    : bitwise_or_expression
    | logical_and_expression AND bitwise_or_expression
                                     { $$ = new binary_logical_node($1, "&&", $3); }
    : logical_and_expression
    | logical_or_expression OR logical_and_expression
                                     { $$ = new binary_logical_node($1,"||", $3); }

The infamous ternary expression.

    : logical_or_expression
    | logical_or_expression '?' assignment_expression ':' assignment_expression
                                     { $$ = new conditional_node($1, $3, $5); }

And finally we need the assignement-expression defined.

    : conditional_expression
    | left_hand_expression assignment_operator assignment_expression
                                     { $$ = new assign_node($1, $2, $3);}
    : '='            { $$ = "=" ; }
    | "PLUSEQUAL"    { $$ = "+=" ; }
    | "MINUSEQUAL"   { $$ = "-=" ; }
    | "POWEREQ"      { $$ = "**="; }
    | "MULTEQUAL"    { $$ = "*="; }
    | "INTDIVEQUAL"  { $$ = "//="; }
    | "DIVEQUAL"     { $$ = "/="; }
    | "LSHIFTEQUAL"  { $$ = "<<="; }
    | "RSHIFTEQUAL"  { $$ = ">>="; }
    | "URSHIFTEQUAL" { $$ = ">>>="; }
    | "ANDEQUAL"     { $$ = "&="; }
    | "XOREQUAL"     { $$ = "^="; }
    | "OREQUAL"      { $$ = "|="; }
    | "MODEQUAL"     { $$ = "%="; }

The assignement-expression looks curious as if it cannot work at all and loops infinitely but the first expression goes to conditional_expression which itself the first expression logical_or_expression which itself…got it? If not, ask somebody who did πŸ˜‰

The production expression is defined as a comma separated list. Not necessarily needed but useful.

The thing is closed here, you can implement it without much additional work and have something like the basic calculator from the Bison/Jison examples with variables and functions (if you implement variables and functions) but it isn’t even Turing-complete—you cannot move the tape freely, you cannot jump.

Let’s take a jump to the left, and then a step to the ri-i-i-i-ight. Sorry.

So we need some if’s and else’s, and some loops, too; we need statements, with some content, like every good politician pretends to utter.

Apropos politicians: it is midnight here in western Germany so lets take a look in the northern direction, where the storms come from and see if a storm is there.
The New York Times says “don’t know anything before Friday morning CEST” although needing several pages to do so and the Daily Mail…no, I don’t want to know what the Daily Mail says or more exactly: I don’t care what the Daily Mail says.

What? Hey, I said I digress often, didn’t I?

And now I lost the thread…yes, thank you, statements it was.

    : block
    | variable_statement
    | empty_statement
    | expression_statement
    | if_statement
    | iteration_statement
    | continue_statement
    | break_statement
    | return_statement
    | switch_statement
    | labelled_statement

The naming of the individual statements should be telling enough but I’ll comment in details if necessary.

    : "OPENBRACE" source_elements "CLOSEBRACE"   { $$ = new block_node($2); }

The block-statement is special as it does nothing on itself, it is just a syntactically neutral grouping and source_elements can be anything including the block-statement.
What it can be used for is to manipulate the scope if you increment the scope after every opening brace and decrement after every closing brace (it is a bit more complicated than that but basically…).

The following one is a bit more interesting, I think, the declaration of the variables.

    : "LET" variable_declaration_list ';' { $$ = new var_statement_node($2); }
    : variable_declaration                 { $$ = [$1]; }
    | variable_declaration_list ',' _declaration 
                                           { $$ = $1.concat($3); }
    : "IDENTIFIER"                      { $$ = new var_decl_node($1, 0) }
    | "IDENTIFIER" initializer          { $$ = new var_decl_node($1, $2) }
    : '=' assignment_expression       { $$ = $2; }

The keyword for the variable declaration is let.
We could do without a keyword but it is simpler and cleaner that way.
Example let a=0,b=c=d=gammaln(e),f,g,h;

We need a list of statements (later for the case/default in switch ), an empty statement and and expression statement

  : statement                  { $$ = [$1]; }
  | statement_list statement   { $$ = $1.concat($2); }
    : ';'                      { $$ = new empty_statement_node(); }
    expression ';'             { $$ = new expr_statement_node($1); }

The next statements are related to branching.

    : "IF" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE" 
                               { $$ = new if_node($3,$5); }
    | "IF" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE" 
                         ELSE "OPENBRACE" statement "CLOSEBRACE"
                               { $$ = new if_node($3,$5,$7);}

No dangling else problem here because of the embracing. Some people will not like it but it is easy to change and I like it the way I defined it. One of the advantages if you write it yourself. One of the advantages of Open-Source: you can change it if you want and don’t have to ask me.
It is cleaner, too.
The iteration statement is named that way because it consist of a group of iterating statements.
Clever, hu? πŸ˜‰

    : "DO" "OPENBRACE" statement "CLOSEBRACE" "WHILE" '(' expression ')' ';'
                                       { $$=new do_while_node($3,$7);}
    | "WHILE" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE"
                                       { $$ = new while_node($3,$5); }
    | "FOR" '(' expression_optional ';' expression_optional ';' expression_optional ')'
        "OPENBRACE" statement "CLOSEBRACE" 
                                       { $$ = new for_node($3,$5,$7,$10);}
    :                                  { $$ = " "; }
    | expression                       { $$ = $1; }

The for-loop is build like most other for-loops, especially ECMAScript’s for-loop, hence the space-character put into the AST, such that the assembler has not much work to do here.
The embracing of the following statements is not optional here as it was not optional in the if-else-branching. It had been done there to avoid a conflict, here it is done mainly for simplicity and a clean syntax. Yes, I know that I repeat myself.

Next come the jump-statements.

    : "CONTINUE" ';'                     { $$ = new continue_node();}
    | "CONTINUE" "IDENTIFIER" ';'        { $$ = new continue_node($2); }
    : "BREAK" ';'                        { $$ = new break_node();}
    | "BREAK" "IDENTIFIER" ';'           { $$ = new break_node($2);}
    : "RETURN" ';'                       { $$ = new return_node();}
    | "RETURN" expression ';'            { $$ = new return_node($2); }

The jumps can only go out (break, return) or backwards (continue). Jumping forward (goto) is not implemented in ECMAScript (yet?) so for the reason of simplicity we don’t do it either.

The switch statement is as it is implemented in JavaScript with the addition of the forced embracing.

    : "SWITCH" '(' expression ')' case_block { $$ = new switch_node($3, $5);}
    : "OPENBRACE" case_clauses "CLOSEBRACE"
                              { $$ = $2; }
    | "OPENBRACE" case_clauses default_clause case_clauses "CLOSEBRACE"
                              { $$ = $2.concat($3).concat($4); }
    :      /* nothing */       { $$ = []; }
    | case_clauses case_clause { $$ = $1.concat($2); }
    : "CASE" expression ':'                   { $$ = new case_clause_node($2, " "); }
    | "CASE" expression ':' statement_list    { $$ = new case_clause_node($2, $4); }
    : "DEFAULT" ':'                           { $$ = new case_clause_node(" ", " "); }
    | "DEFAULT" ':' statement_list            { $$ = new case_clause_node(" ", $3); }

The jumps can jump into something and that something is called a label

    : "IDENTIFIER" ':' statement              { $$ = new label_node(*$1, $3);}

We need some function declarations, with and without arguments

           function_body "CLOSEBRACE"        /* check for [].length to differ */
                                           { $$ = new func_decl_node($2,$3,[],$6); }
    | "DEFINE" "IDENTIFIER" '(' parameter_list ')' "OPENBRACE"
           function_body "CLOSEBRACE"    
                                           { $$ = new func_decl_node($2,$3,$5, $7); }
    : "IDENTIFIER"                          { $$ = new parameter_list_node($1); }
    | parameter_list ','  "IDENTIFIER"      { $$ = new parameter_list_node($1, $3);}
    : source_elements

The keyword for the declaration of a function is define and the syntax is a subset of the ECMAScript variant with the addition of the forced embracement.

The whole program consists of source-elements, a list of the catch-all statement followed by the end-of-file marker.

    : source_elements EOF               { $$ = new program_node($1); return $$;}

The source-elements list consists of source-elements or nothing.

    :                                   { $$ = []; }
    | source_elements source_element    { $$ = $1.concat($2)}

A single source-element consists of statements and function_declarations which are already defined and that means: we are done.

    : statement
    | function_declaration
%% /* AST handling code comes after this line */

Done with the parser-script that is πŸ˜‰

If you want to go on yourself (I caught a cold and will need some aspirin and a bed now, it get to be a long day tomorrowday), here are some little helpers:

perl -nle 'print "function $1\{\n\n\}" if /new\s+(\w+\([0-9,$ ]*\))/' small.jison

A shell one-liner to grab all of the node-objects and make functions (constructors actually) out of them:
If you want to use it like C. Ihrig the following prints the necessary code:

perl -nle 'print "parser.ast.$1 = $1;" if /new\s+(\w+\([0-9,$ ]*\))/' small.jison

The code to make a IEEE-754 number out of the string representation (I did that already, didn’t I?) is (without hexadecimal or other representations to keep it simple):

function parseNumericLiteral(literal) {
  return parseNumber(literal);
function parseImaginaryLiteral(literal) {
  return parseNumericLiteral(literal.
function parseNumber(string){
  return Number(string);

It is a bit more complicated than necessary, but that way you can add other representations more easily. To give a hint:

function parseNumericLiteral(literal) {
  if (literal.charAt(0) === "0") {
    switch(literal.charAt(1).toLowerCase()) {
      case "x": /* hexadecimal */
        return parseNumber(literal.substr(2), 16);
      case "b": /* binary */
        return parseNumber(literal.substr(2), 2);
      default: /* octal */
       return parseNumber(literal.substr(1), 8);
  } else {    /* decimal */
    return parseNumber(literal,10);

I haven’t implemented the actual parsing code but it can be done by simple shifting, aka. normalizing but the other way around (error and bounds-checking assumed):

  • check for a radix marker (period). If we have one
    • move the radix marker to the right until it is gone, keep the count
    • subtract the count from the exponent by shifting right by
      • the count for base 2
      • the count/8 for base 8
      • the count/4 for base 16
    • subtract the remainder
  • If we have none we have an integer and can use JavaScript’s parseInt(from_string,from_base)

You might have to treat the exponent differently depending on the base but the number representation of the exponent is always as a signed decimal integer. I have left that out in the pseudo-code above but you might have guessed that already πŸ˜‰

Adventures of a Programmer: Parser Writing Peril XXVIII

The JISON Parser

Printing the AST

The Jison parser is not a rewrite of Bison and I am fully aware of that fact, but it is quite close, dangerously close! That’s one of the reasons it took me so long πŸ˜‰

The parser stands now and JISON gulps it without protest and while I’m writing this line it creeps slowly in my mind that I forgot to implement something to include another file. OK, should be simple, but tomorrow is another day, as the saying goes.

Nevertheless… Continue reading

Adventures of a Programmer: Parser Writing Peril XXVII

The Infamous Dangling Else

The Wikipedia-page describes it well enough, so no description of the problem is laid down in this post, only a short one of the solution chosen.

There are basically two different solutions:

  1. Follow Eelco Visser’s proposals (PDF!) and do some “magick”
  2. Write an unambiguous language

We choose b. Continue reading

Adventures of a Programmer: Parser Writing Peril XXVI

Refinement of the Grammar

Ok, some repair was also involved, I have to admit it but I am not that kind of genius who has it right the first time and always the first time! Are there any out there?
But back to the point—as if I ever had one—and the details:

The new grammar is available at

  • Addition of the Object because I forgot it
  • Changed the function-definition and put it into the statement list and order
  • Ripped the cf-literal out, was to much trouble to get it differ from the matrix but if you have any idea…
  • added some syntactic sugar (shamelessly stolen from GP/PARI), namely
    • forprime
    • fordiv
    • sumdiv
    • sum
    • prod

    They have all the same syntax, so it is easy to add more (e.g.: intnum, intcirc, sumalt etc.)

      "(" variable-declaration  ";"
           ( *1expression )


    let primorial = 1;
      primorial = primorial * start;

    Must print the primorial of 100, which is 2305567963945518424753102147331756070. Please, do not compute a primorial this way, there are faster methods to do it. There will be a large library, too, with a lot of these functions if a better method than the most simple is available.

  • Addition of a keyword for variable and function declaration. More than half of the syntax was stolen from ECMAScript already, so why not take a little bit more? But to my very disappointment that more made it necessary to add these keywords
    • If I do not find any other problems this grammar is the final grammar until somebody finds bugs and other inconsistencies and we need a new one. The latter is the most likely future to expect.

      Next: the actual JISON-parser with code to grab and print the resulting AST. At least that’s planned for.

Adventures of a Programmer: Parser Writing Peril XXIV

If you are wondering about the numbering scheme: I forgot to use the 14 and hate to waste things.

The grammar was finished in the last posting? No, it was not, of course not πŸ˜‰
I added the Matrix data type but forgot to add something to read it back in. That is hopefully repaired now. The Object thingie had been changed back from being hardcoded into the identifier to being parsed regularly.
The grammar validates with abnf and abnfgen which is also a generator for tests. The latter does not alway work for our grammar (goes into a very large, maybe infinite loop) which shows something. I don’ know what exactly but it shows something ;-). The program called Bill’s ABNF Parser, a tool over at the IETF webpage is quite nit-picky but all over OK with it.

Uploaded to Little

Next: use of the grammar to write a parser that prints the AST. Once we have the AST, the rest are mere details. No, of course not, that would be nice, but it is a very good start.

Adventures of a Programmer: Parser Writing Peril XXV


Expressions are quite expressive, I am not and start right on πŸ˜‰

Expressions are all things done to everything literal. These are numbers, strings, tests (comparing), logic, bit operations, arithmetic operations, casting, a non-expression (null), identifier, and then some.

And now that I’m talking about it suddenly crossed my mind: there is a lack of fundamental logic in my grammar!

bool-true-literal  = %x74.72.75.65    ; true
bool-false-literal = %x66.61.6c.73.65 ; false
null-literal       = %x6e.75.6c.6c    ; null

Oops! πŸ˜‰
Ok, but now: Continue reading