Adventures of a Programmer: Parser Writing Peril XXV

Expressions

Expressions are quite expressive, I am not and start right on 😉

Expressions are all things done to everything literal. These are numbers, strings, tests (comparing), logic, bit operations, arithmetic operations, casting, a non-expression (null), identifier, and then some.

And now that I’m talking about it suddenly crossed my mind: there is a lack of fundamental logic in my grammar!

bool-true-literal  = %x74.72.75.65    ; true
bool-false-literal = %x66.61.6c.73.65 ; false
null-literal       = %x6e.75.6c.6c    ; null

Oops! 😉
Ok, but now:

literal  = string-literal     /
           decimal-literal    /
           imaginary-literal  /
           bool-true-literal  /
           bool-false-literal /
           null-literal

There are only two states for a bool literal, so we simplify here.

But let’s start at the start. We also need not to forget to sort the whole thing for precedence; there is not only precedence for arithmetic operators but for all things, we cannot leave anything out!

An example can be the simple arithmetic operation: 1 + 2 * 3. The precedence here is known well from primary school: “Multiplication before Addition”.

OK, and what exactly do you add and multiply, if you pardon the really bad pun? The numbers you say? But there are no numbers in the rule “Multiplication before Addition”! So you need to parse the numbers. When do you parse the numbers, before or after the operations? If you parse them after the operations you’ll loose the order (not really but let’s pretend we do), so you do it before, hence “Numbers before Operations”. Only after that the rule “Multiplication before Addition” applies.

Another examples are the logical operators e.g.: 3 - 2 && 2 - 2. You need to parse the numbers (variables in real life) first, then the arithmetic operands and then the logic. If you parse the logic first, then the numbers and then the operators we would get 3 - (2 && 2) - 2 instead of (3 - 2) && (2 - 2).

And on and on it goes; we need the precedence for every f…reaking thing we are going to implement.

Ooookay.

We had the most basic precedence level for the parser with the literals. The others, down to the atomic level (byte level) are for the lexer.
The next level form the basic data types, lets call it primary-expression (shamelessly stolen from somewhere. I think from the grammar of the old KDE JavaScript engine and/or the ECMA-standard and/or the examples from JISON and/or elsewhere).

primary-expression = identifier                  /
                     literal                     /
                     matrix-literal              /
                     (identifier "." identifier) /
                     "(" expr ")"

The first two should be clear, the third is our Matrix data type, the fourth one our Object, and the last one the grouping parentheses e.g.: (2 + 3) * 4
Some kind of parser algorithms might choke on it. You may look up “dangling else” with your prefered web-search-engine to find out everything about all of the gory details.
I might just drop it and do it in a completely different way because it boils down to being a simple variable, so redefining the identifier to something in the line of the following is much easier.

identifier       = identifier_base *("." identifier-base)
identifier-base  = identifier_start *identifier_part 
identifier_part  = identifier_start / (%x30-39) ; [_a-zA-Z][_a-zA-Z0-9]*
identifier_start = "_" *(%x41-5a / %x61-7a) ; [_a-zA-Z]

That means that _._ is a proper Object but actually using it gets frowned upon, ‘k?
The new primary-expression is now:

primary-expression = identifier     /
                     literal        /
                     matrix-literal /
                     "(" expr ")"

Calling a function should be next. We might put something in between but simplicity should win here.

call-expression = identifier arguments
arguments       = "(" ")"               /
                  "(" argument-list ")"
argument-list   = assignment-expression *( "," assignment-expression)

Let’s put it all together and call it…pheeeeouuuuu…left-hand-expression because it is on the leftmost side for operators.

left-hand-expression = primary-expression /
                       call-expression

One use for it is for the postfix incrementation. Some may be tempted to put it into the group of unary expressions but I have choosen the ECMAScript precedence table for simplicity.

postfix-expression = left-hand-expression         /
                     left-hand-expression "+" "+" /
                     left-hand-expression "-" "-"

Apropos unary expressions.

unary-expression = postfix-expression           /
                   "+" "+" unary-expression     /
                   "-" "-" unary-expression     /
                   "+" unary-expression         /
                   "-" unary-expression         / ; minus
                   "~" unary-expression         / ; tilde
                   "#" unary-expression         / ; cardinality (higher?)

And that’s how we handle the precedence rules here: by using the higher precedence as the start of the list.

Next in the order of PEMDAS precedence is exponentiation.

power-expression = unary-expression /
                   power-expression '^' unary-expression

No, wait, what with XOR? Ok, it’s doable, but waaay to complicated, so we use the common alternative “**”.

power-expression = unary-expression /
                   power-expression ("*" "*") unary-expression

Next are multiplication, division, and remainder, if I remember it correctly.

multiplicative-expression = power-expression                               /
                            multiplicative-expression "*" power-expression /
                            multiplicative-expression "/" power-expression /
                            multiplicative-expression ("/" "/") power-expression /
                            multiplicative-expression "%" power-expression 

The “//” is the operator for integer division. This kills this symbol for marking single line comments. The other common symbols for such comments are “#” which has already been used for cardinality and “;” which is used as a terminator symbol. Let’s see what’s left at the end but I do not want to waste precious symbols for shortcuts for mere comments.

Addition.

additive-expression = multiplicative-expression                         /
                      additive-expression "+" multiplicative-expression /
                      additive-expression "-" multiplicative-expression

Number field operators.
Pushing bits around wildly is allowed here but not elsewhere!

shift-expression = additive-expression                                /
                   shift-expression ("<" "<") additive-expression     /
                   shift-expression (">" ">" ">") additive-expression /
                   shift-expression (">" ">") additive-expression

Yes, size matters!

relational-expression = shift-expression                                 /
                        relational-expression "<" shift-expression       /
                        relational-expression ">" shift-expression       /
                        relational-expression ("<" "=") shift-expression /
                        relational-expression (">" "=") shift-expression

Seeking equality is a noble goal.

equality-expression = relational-expression                                   /
                      equality-expression ("=" "=") relational-expression     /
                      equality-expression ("!" "=") relational-expression     /
                      equality-expression ("=" "=" "=") relational-expression /
                      equality-expression ("!" "=" "=") relational-expression

With several closely related data-types and almost-but-not-quite-entirely-loose types a stricter measurement of equality is necessary.

Uhm, where have I been&hellip:yes, next round of number fields. This time bit-wise operations which is not really correct, these operation treat whole numbers but who cares, nomenclature is almost always arbitrary.

bitwise-and-expression = equality-expression                            /
                         bitwise-and-expression "&" equality-expression

Yes, you guessed right: these bit-wise operations have all their own level of precedence!

bitwise-xor-expression = bitwise-and-expression                         /
                         bitwise-and-expression "^" bitwise-and-expression

bitwise-or-expression = bitwise-xor-expression                         /
                        bitwise-or-expression "|" bitwise-xor-expression

The two logical expressions have their very own, individual precedence levels, too.

logical-and-expression = bitwise-or-expression                            /
                         logical-and-expression "|" bitwise-or-expression
logical-or-expression =  logical-and-expression                           /
                         logical-or-expression "|" logical-and-expression

The trinity, pardon, ternary operator? Ah, why not. Increases readability iff used right.

conditional-expression = logical-or-expression          /
                         ( logical-or-expression '?' 
                               assignment-expression ':'
                               assignment-expression )

Now that we met the term assignment-expression so often we should start to define it.

assignment-expression = conditional-expression /
                        ( left-hand-expression 
                            assignment-operator assignment-expression )

We need some assignment operators. We have many, with more to come, all of the same precedence level, hence the extra production for the list.

assignment-operator = "=" /
                       ("+" "=") / ("-" "=") /
                       ("*" "=") / ("/" "=") / ("%" "=") /
                       ("<" "<" "=") / (">" ">" "=") / (">" ">" ">" "=") /
                       ("&" "=") / ("^" "=") / ("|" "=")

Do we have all together now? I hope so 😉

expression = assignment-expression *("," assignment-expression)

We can do a lot with expressions alone but what we need are some really concrete statements.

The most basic of these statements is the block, a rectangular rock we can build everything on; the variable statement, flexible as a politician; the empty statement, as dead as my bank account; the expression needs some terminal, too, dontya think; if ifs and ands were pots and pans there’d be no need for people to invent the deep-drawing-press and get rich; all that goes on might even go on, too; and if not, just continue and pretend nothing had ever happened; take a break, take a nap; before you come back;the last one switches the lights off; the washing advice is printed on this scratchy thingy inside the shirt.

statement =  block /  variable-statement / empty-statement /
             expression-statement /  if-statement / iteration-statement /
             continue-statement / break-statement / return-statement /
             with-statement / switch-statement / labelled-statement

The production block is the catch-all container and triggers the automatic variable scoping.

var_a = x;     // scope n
{
  var_b = y;     // scope n+1
  {
    var_c = z;   // scope n+2
  }
  print(var_c);  // -> undefined
}
print(var_b);  // -> undefined
block = "{" "}"                 / ; empty block does not hurt but is useful
        "{" source-elements "}"

Statements may come in multiples, so let’s build a stable for the horde.

statement-list = statement *statement

JavaScript has a label for variable declarations but our language lacks all the whistle and bells of ECMAScript, we don’t need no stinkin’ label!

variable-statement        = variable-declaration-list ";"
variable-declaration-list = variable-declaration
                                *( "," variable-declaration)
variable-declaration      = *1( *1modifier / *1type-specifier) identifier 1*initializer
initializer               = "=" assignment-expression

A single semicolon shall not cause any trouble, sayeth Don Knuth!

empty-statement = ";"

Finalize the expression, shall we?

expression-statement = expression ";"

If wishes were fishes we were not so short of seafood.

if-statement   = ("i" "f") "(" expression ")" statement 1*else-statement
else-statement = ("e" "l" "s" "e") statement

Most parse algorithms need some extra massage to swallow the above.

I hope I don’t repeat myself here, if I say that all keywords and other reserved words need to be handled by the lexer and send to the parser as symbols!

iteration-statement = ("d" "o") statement ("w" "h" "i" "l" "e") 
                         "(" expression ")"  ";"                /

                      ("w" "h" "i" "l" "e") "(" expression ")"
                          statement                             /

                      ("f" "o" "r")
                         "("
                          ( *1expression / variable-declaration-list )
                            ";"
                          ( *1expression )
                            ";"
                          ( *1expression )
                         ")"
                         statement                              /

                      ("f" "o" "r")
                         "("
                          identifier ("i" "n") expression
                         ")"
                         statement

We shouldn’t jump around too much but sometimes it is necessary.

continue-statement = ("c" "o" "n" "t" "i" "n" "u" "e") ";"            /
                     ("c" "o" "n" "t" "i" "n" "u" "e") identifier ";"

break-statement = ("b" "r" "e" "a" "k") ";"           /
                  ("b" "r" "e" "a" "k") identifier ";"

I don’t know yet if I will support jumps in every direction and scope but let’s come back to the grammar.

return-statement = ("r" "e" "t" "u" "r" "n") ";" /
                   ("r" "e" "t" "u" "r" "n") expression ";"

The switch statements are a great opportunity for premature optimization!

switch-statement = ("s" "w" "i" "t" "c" "h") "(" expression ")" case-block
case-block   = "{" *1case-clauses "}"                              /
               "{" *1case-clauses default-clause *1case-clauses "}"
case-clauses = case-clause *(case-clause)
case-clause  = ("c" "a" "s" "e") expression ":" / ; fall through
               ("c" "a" "s" "e") expression ":" statement-list
default-clause = (""d" "e" "f" "a" "u" "l" "t") ":" / ; fall through
                 (""d" "e" "f" "a" "u" "l" "t") ":" statement-list

A sand-pit, with nails buried shallowly in it.

labelled-statement = identifier ":" statement

As it was for variable declarations, it is for functions declarations, too: not necessary to tell anybody that it is a function declaration, everybody with eyes can see that it is a function declaration, even the parser who has no eyes on its own sees that!!!!111!eleven!!

function-declaration = *1( *1modifier / *1type-specifier) 
                            (
                             identifier "(" ")" function-body               /
                             identifier "(" parameter-list ")" function-body
                            )
parameter-list = identifier *( "," identifier ) ; no defaults for now
function-body  = "{" "}"                 /
                 "{" source-elements "}"

The production program will be the starting point for the parser.

program = *1source-elements ; program can be empty

The last thing left to do is the definition of source-elements

source-elements = 1*source-element
source-element  = statement / function-declaration

Aaaand we are done!
Ok, it’s only the first, very small step to a ready and running implementation, but, at least the grammar stands. It is now the time to smooth out creases and mend fences and translate it into something JISON can handle. The format will be the one Bison understands, too, because I also want to built a C-version based on libtom* when time allows.
But definitely not this year, maybe next year? 😉

The next posting is about all the errors I found.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s