Tag Archives: Parser

Adventures of a Programmer: Parser Writing Peril XXXX

The final[1] grammar together with the lexer. The whole file with a lot of comments is on Github. It is currently in one file for convenience but needs to be split into a lexer (Flex) and a parser (Bison) later.

The file has a lot of comments already, not much to say here, but nevertheless… Continue reading

Advertisements

Adventures of a Programmer: Parser Writing Peril XXIX

Small, the lesser evil

We want to have a language with a lot of specialities. OK, but some specialities we want! These extras are additions to some basic language for doing arithmetics. No changes to the underlying grammar, only additions. It is still a large grammar but better manageable than the final grammar of Little.
And the final product will be usable! No silly toy language, barely able to be taken as an example you can learn something off, no, something you can work with. And if you have a license (yes I know, but it is for all practical purposes just that: a license) you can slap some nice UI on top and sell it for one-ninety-nine at the Apple-shop. The restriction “Apple-shop” holds because Android got GP/PARI ported, so it is of absolutely no use to try it there πŸ˜‰

But I digress, as usual, so down to the meat-halls now.

The Small language is a proper subset of the Little language which itself is nearly a subset of ECMAScript. It needs to be usable, si it needs free variables, functions (including function declarations, that is, building your own), loops (while, do…while,for), jumps (break, continue, switch), something to work with (operators), and something to work on (numbers, but only the double-precision numbers ECMAScript offers).
We need something to sample and massage the AST-nodes.
We need a lexer. Jison has a built-in Flex-like lexer, we use it.

Parser first. The Jison parser syntax is available in two flavours: JSON and Bison. We take the latter but we need to be aware of same small differences to the original.

The parser parses its own script bottom down, the parser itself bottom up (LARL to be exact but some other types are available, too), That means that we need to give it a start point. NB: I will never use abbreviations if it is avoidable. Especial when these abbreviations came for a time that measured available memory in kilobytes.

Somebody remembers the famous “640 kb is enough for all!”? It is said that Bill Gates said it. And now try to find a version of Windows that runs and is usable with 640 MB of memory.

I digress again? Yeah, get used to it, I do it quite often πŸ˜‰

Uhm…where have I been…ah, yes, the starting point. It needs a name and we will name it, with the help of all our wit and phantasy, program. Please admire the choice of the American spelling which will save us a full two characters!

The parser script can be parted in segments, divided by a doubled percent sign %% at the start of the line with nothing else. The directive of the starting point goes into the segment before the actual parser. Because of the direction of the parsing we should start with a node and go up in direction of the precedence according to the ECMAScript 5.1 standard. The easiest starting point for the parser script (not the parser, that starting point is already chosen to be program) based on the described path is a leaf. The out-most leaf is a so called literal, here a number. You don’t do anything with it (ok, we need to parse the number but we can use JavaScript for that here), you take it literally. Now we have a start of our parser-script:

/* root node */
%start program
%%
literal
    : "TRUETOKEN"          { $$ = new boolean_node(true,location(@1,@1)); }
    | "FALSETOKEN"         { $$ = new boolean_node(false,location(@1,@1)); }
    | "NUMBER_LITERAL"     { $$ = new number_node($1,location(@1,@1)); }          
    | "IMAGINARY_LITERAL"  { $$ = new imaginary_number_node($1,location(@1,@1)); }
    ;

Here we have a C++-like multiline comment, the definition of the starting point (can be abbreviated to %s program) the divisor and the first production. The all-uppercase identifiers are variable names from the lexer, they have to be put in quotes in Jison.
I have described the meaning of the rest in my last post but see your textbook for details (e.g.: O’Reilly’s Flex&Bison $29.99 for the print. Or ask Google for some information.)
I’ll leave out the cluttering location() function from now on.

The next production is one that actually does something instead of a mere collecting.

primary_expression
    : literal
    | "IDENTIFIER"         { $$ = new resolve_node($1));}
    | '(' expression ')'   { $$ = $2; }
    ;

It consists of the variable literal, the name of our first production which is already defined at that point, and the parenthesized group with the production name expression in it which isn’t defined yet. We use JavaScript directly here for the grouping by copying the content into the AST.
We can group primary_expression and a function call together as a left-hand expression. It makes not much sense now but is a nice place for additions to…uhm…add (e.g.: the details for the matrix-type).

left_hand_expression
    : primary_expression            { $$ = new primary_expression_node($1); }
    | "IDENTIFIER" arguments        { $$ = new function_call_node($1, $2); }
    ;

Arguments for a function can be none or many in number:

arguments:
      '(' ')'                                  { $$ = []; }
    | '(' argument_list ')'                    { $$ = $2; }
    ;

argument_list
    : assignment_expression                   { $$ = [$1]; }
    | argument_list ',' assignment_expression { $$ = $1.concat($3); }

Here we use JavaScript directly again and put the arguments in an easy to parse Array. We end this branch here and start a new one, but production assignment_expression still needs a definition, we must not forget it.
We have the left-hand expressions, we need something on the right, too and that is aptly called postfix-expression:

postfix_expression
    : left_hand_expression
    | left_hand_expression "!"          { $$ = new postfix_node($1, "!"); }
    | left_hand_expression "PLUSPLUS"   { $$ = new postfix_node($1, "++"); }
    | left_hand_expression "MINUSMINUS" { $$ = new postfix_node($1, "--"); }

A parser for C++ cannot use the tokens as strings directly. Either use an enum (common way) or pass a pointer to it new postfix_node($1, *$2);. The latter is not quite useful here but there are some useful cases e.g.: the identifier where you need the actual string.
The exclamation-mark is for the combinatorial function factorial.
The exclamation-mark gets used at another place as a logical negation, too. It is possible to do it if you do it carefully like here:

unary_expression
    : postfix_expression
    | "PLUSPLUS" unary_expression     { $$ = new prefix_node("++", $2); }
    | "MINUSMINUS" unary_expression   { $$ = new prefix_node("--", $2); }
    | '+' unary_expression            { $$ = new unary_plus_node($2); }
    | '-' unary_expression            { $$ = new negate_node($2); }
    | '~' unary_expression            { $$ = new bitwise_not_node($2); }
    | '!' unary_expression            { $$ = new logical_not_node($2); }
    | '#' unary_expression            { $$ = new cardinality_node($2); }
    ;

The hash-character # acts like its node-name suggests and returns the cardinality of the following set. It is rarely a not-yet-implemented set but it can for example return the length of a not-yet-implemented array or the degree of a not-yet-implemented polynomial expression. If it is of only little use or too complicated to implement it gets dropped without hesitation.

The above shows the implementation of the order of precedence, too. The prefix’d incrementor comes before the postfix’d incrementor in both precedence and parser-script.
The next in precedence would be exponentiation. The common symbol for exponentiation is the caret ^ but that will be taken as the xor-operator symbol.
Some scripts use the doubled asterix ** as the power-symbol and so do we.
I repeat myself? Yeah, I’m old, I’m supposed to πŸ˜‰

power_expression
  : unary_expression
  | power_expression POWER unary_expression { $$ = new power_node($1, $3); }
  ;

I find it very useful to have two kinds of division operators, one for floating point division and one for integer division. You might call it syntactic sugar but I call it useful.
Most languages I know implement it as a doubled division operator // and so do we.

multiplicative_expression
    : power_expression
    | multiplicative_expression '*'  power_expression      { $$ = new mult_node($1, $3, "*"); }
    | multiplicative_expression '/'  power_expression      { $$ = new mult_node($1, $3, "/"); }
    | multiplicative_expression "INTDIV"  power_expression { $$ = new mult_node($1, $3, "//"); }
    | multiplicative_expression '%'  power_expression      { $$ = new mult_node($1, $3, "%"); }
    ;
additive_expression
    : multiplicative_expression
    | additive_expression '+' multiplicative_expression  { $$ = new add_node($1, $3, '+'); }
    | additive_expression '-' multiplicative_expression  { $$ = new add_node($1, $3, '-'); }
    ;

The lines get quite long, sorry but I failed in my choice of the WordPress theme and doing some line-breaks makes in less readable in my opinion.
Addition is quite boring and really the same, so I added it to the same snippet.

The multiplications by powers of two aka. shifting makes use of the difference between signed and unsigned shifting like in Java and JavaScript and probably some other languages, too.

shift_expression
    : additive_expression
    | shift_expression LSHIFT additive_expression  { $$ = new shift_node($1, "<<", $3); }
    | shift_expression RSHIFT additive_expression  { $$ = new shift_node($1, ">>", $3); }
    | shift_expression URSHIFT additive_expression { $$ = new shift_node($1, ">>>", $3); }
    ;

Now we are at the relational logic in the precedence order.

relational_expression
    : shift_expression
    | relational_expression '<' shift_expression { $$ = new relational_node($1, "<", $3); }
    | relational_expression '>' shift_expression { $$ = new relational_node($1, ">", $3); }
    | relational_expression LE shift_expression  { $$ = new relational_node($1, "<=", $3); }
    | relational_expression GE shift_expression  { $$ = new relational_node($1, ">=",$3);}
    ;

equality_expression
    : relational_expression
    | equality_expression EQEQ relational_expression   
                                      { $$ = new equal_node($1, "==", $3); }
    | equality_expression NE relational_expression     
                                      { $$ = new equal_node($1, "!=", $3); }
    | equality_expression STREQ relational_expression  
                                      { $$ = new equal_node($1, "===", $3); }
    | equality_expression STRNEQ relational_expression 
                                      { $$ = new equal_node($1, "!==", $3);}
    ;

Yes, that formatting is with some line-breaks. Hmm…

We need some bit manipulations.

bitwise_and_expression
    : equality_expression
    | bitwise_and_expression '&' equality_expression
                                     { $$ = new bit_oper_node($1, "&", $3); }
    ;
bitwise_xor_expression
    : bitwise_and_expression
    | bitwise_xor_expression '^' bitwise_and_expression
                                     { $$ = new bit_oper_node($1, "^", $3); }
    ;
bitwise_or_expression
    : bitwise_xor_expression
    | bitwise_or_expression '|' bitwise_xor_expression
                                     { $$ = new bit_oper_node($1, "|", $3); }
    ;

And some more logic.

logical_and_expression
    : bitwise_or_expression
    | logical_and_expression AND bitwise_or_expression
                                     { $$ = new binary_logical_node($1, "&&", $3); }
    ;
logical_or_expression
    : logical_and_expression
    | logical_or_expression OR logical_and_expression
                                     { $$ = new binary_logical_node($1,"||", $3); }
    ;

The infamous ternary expression.

conditional_expression
    : logical_or_expression
    | logical_or_expression '?' assignment_expression ':' assignment_expression
                                     { $$ = new conditional_node($1, $3, $5); }
    ;

And finally we need the assignement-expression defined.

assignment_expression
    : conditional_expression
    | left_hand_expression assignment_operator assignment_expression
                                     { $$ = new assign_node($1, $2, $3);}
    ;
assignment_operator
    : '='            { $$ = "=" ; }
    | "PLUSEQUAL"    { $$ = "+=" ; }
    | "MINUSEQUAL"   { $$ = "-=" ; }
    | "POWEREQ"      { $$ = "**="; }
    | "MULTEQUAL"    { $$ = "*="; }
    | "INTDIVEQUAL"  { $$ = "//="; }
    | "DIVEQUAL"     { $$ = "/="; }
    | "LSHIFTEQUAL"  { $$ = "<<="; }
    | "RSHIFTEQUAL"  { $$ = ">>="; }
    | "URSHIFTEQUAL" { $$ = ">>>="; }
    | "ANDEQUAL"     { $$ = "&="; }
    | "XOREQUAL"     { $$ = "^="; }
    | "OREQUAL"      { $$ = "|="; }
    | "MODEQUAL"     { $$ = "%="; }
    ;

The assignement-expression looks curious as if it cannot work at all and loops infinitely but the first expression goes to conditional_expression which itself the first expression logical_or_expression which itself…got it? If not, ask somebody who did πŸ˜‰

The production expression is defined as a comma separated list. Not necessarily needed but useful.

The thing is closed here, you can implement it without much additional work and have something like the basic calculator from the Bison/Jison examples with variables and functions (if you implement variables and functions) but it isn’t even Turing-complete—you cannot move the tape freely, you cannot jump.

Let’s take a jump to the left, and then a step to the ri-i-i-i-ight. Sorry.

So we need some if’s and else’s, and some loops, too; we need statements, with some content, like every good politician pretends to utter.

Apropos politicians: it is midnight here in western Germany so lets take a look in the northern direction, where the storms come from and see if a storm is there.
The New York Times says “don’t know anything before Friday morning CEST” although needing several pages to do so and the Daily Mail…no, I don’t want to know what the Daily Mail says or more exactly: I don’t care what the Daily Mail says.

What? Hey, I said I digress often, didn’t I?

And now I lost the thread…yes, thank you, statements it was.

statement
    : block
    | variable_statement
    | empty_statement
    | expression_statement
    | if_statement
    | iteration_statement
    | continue_statement
    | break_statement
    | return_statement
    | switch_statement
    | labelled_statement
    ;

The naming of the individual statements should be telling enough but I’ll comment in details if necessary.

block
    : "OPENBRACE" source_elements "CLOSEBRACE"   { $$ = new block_node($2); }
    ;

The block-statement is special as it does nothing on itself, it is just a syntactically neutral grouping and source_elements can be anything including the block-statement.
What it can be used for is to manipulate the scope if you increment the scope after every opening brace and decrement after every closing brace (it is a bit more complicated than that but basically…).

The following one is a bit more interesting, I think, the declaration of the variables.

variable_statement
    : "LET" variable_declaration_list ';' { $$ = new var_statement_node($2); }
    ;
variable_declaration_list
    : variable_declaration                 { $$ = [$1]; }
    | variable_declaration_list ',' _declaration 
                                           { $$ = $1.concat($3); }
    ;
variable_declaration
    : "IDENTIFIER"                      { $$ = new var_decl_node($1, 0) }
    | "IDENTIFIER" initializer          { $$ = new var_decl_node($1, $2) }
    ;
initializer
    : '=' assignment_expression       { $$ = $2; }
    ;

The keyword for the variable declaration is let.
We could do without a keyword but it is simpler and cleaner that way.
Example let a=0,b=c=d=gammaln(e),f,g,h;

We need a list of statements (later for the case/default in switch ), an empty statement and and expression statement

statement_list
  : statement                  { $$ = [$1]; }
  | statement_list statement   { $$ = $1.concat($2); }
  ;
empty_statement
    : ';'                      { $$ = new empty_statement_node(); }
    ;
expression_statement:
    expression ';'             { $$ = new expr_statement_node($1); }
    ;

The next statements are related to branching.

if_statement
    : "IF" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE" 
                               { $$ = new if_node($3,$5); }
    | "IF" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE" 
                         ELSE "OPENBRACE" statement "CLOSEBRACE"
                               { $$ = new if_node($3,$5,$7);}
    ;

No dangling else problem here because of the embracing. Some people will not like it but it is easy to change and I like it the way I defined it. One of the advantages if you write it yourself. One of the advantages of Open-Source: you can change it if you want and don’t have to ask me.
It is cleaner, too.
The iteration statement is named that way because it consist of a group of iterating statements.
Clever, hu? πŸ˜‰

iteration_statement
    : "DO" "OPENBRACE" statement "CLOSEBRACE" "WHILE" '(' expression ')' ';'
                                       { $$=new do_while_node($3,$7);}
    | "WHILE" '(' expression ')' "OPENBRACE" statement "CLOSEBRACE"
                                       { $$ = new while_node($3,$5); }
    | "FOR" '(' expression_optional ';' expression_optional ';' expression_optional ')'
        "OPENBRACE" statement "CLOSEBRACE" 
                                       { $$ = new for_node($3,$5,$7,$10);}
    ;
expression_optional
    :                                  { $$ = " "; }
    | expression                       { $$ = $1; }
    ;

The for-loop is build like most other for-loops, especially ECMAScript’s for-loop, hence the space-character put into the AST, such that the assembler has not much work to do here.
The embracing of the following statements is not optional here as it was not optional in the if-else-branching. It had been done there to avoid a conflict, here it is done mainly for simplicity and a clean syntax. Yes, I know that I repeat myself.

Next come the jump-statements.

continue_statement
    : "CONTINUE" ';'                     { $$ = new continue_node();}
    | "CONTINUE" "IDENTIFIER" ';'        { $$ = new continue_node($2); }
    ;
break_statement
    : "BREAK" ';'                        { $$ = new break_node();}
    | "BREAK" "IDENTIFIER" ';'           { $$ = new break_node($2);}
    ;
return_statement
    : "RETURN" ';'                       { $$ = new return_node();}
    | "RETURN" expression ';'            { $$ = new return_node($2); }
    ;

The jumps can only go out (break, return) or backwards (continue). Jumping forward (goto) is not implemented in ECMAScript (yet?) so for the reason of simplicity we don’t do it either.

The switch statement is as it is implemented in JavaScript with the addition of the forced embracing.

switch_statement
    : "SWITCH" '(' expression ')' case_block { $$ = new switch_node($3, $5);}
    ;
case_block
    : "OPENBRACE" case_clauses "CLOSEBRACE"
                              { $$ = $2; }
    | "OPENBRACE" case_clauses default_clause case_clauses "CLOSEBRACE"
                              { $$ = $2.concat($3).concat($4); }
    ;
case_clauses
    :      /* nothing */       { $$ = []; }
    | case_clauses case_clause { $$ = $1.concat($2); }
    ;
case_clause
    : "CASE" expression ':'                   { $$ = new case_clause_node($2, " "); }
    | "CASE" expression ':' statement_list    { $$ = new case_clause_node($2, $4); }
    ;
default_clause
    : "DEFAULT" ':'                           { $$ = new case_clause_node(" ", " "); }
    | "DEFAULT" ':' statement_list            { $$ = new case_clause_node(" ", $3); }
    ;

The jumps can jump into something and that something is called a label

labelled_statement
    : "IDENTIFIER" ':' statement              { $$ = new label_node(*$1, $3);}
    ;

We need some function declarations, with and without arguments

function_declaration
    : "DEFINE" "IDENTIFIER" '(' ')' "OPENBRACE"
           function_body "CLOSEBRACE"        /* check for [].length to differ */
                                           { $$ = new func_decl_node($2,$3,[],$6); }
    | "DEFINE" "IDENTIFIER" '(' parameter_list ')' "OPENBRACE"
           function_body "CLOSEBRACE"    
                                           { $$ = new func_decl_node($2,$3,$5, $7); }
    ;
parameter_list
    : "IDENTIFIER"                          { $$ = new parameter_list_node($1); }
    | parameter_list ','  "IDENTIFIER"      { $$ = new parameter_list_node($1, $3);}
    ;
function_body
    : source_elements
    ;

The keyword for the declaration of a function is define and the syntax is a subset of the ECMAScript variant with the addition of the forced embracement.

The whole program consists of source-elements, a list of the catch-all statement followed by the end-of-file marker.

program
    : source_elements EOF               { $$ = new program_node($1); return $$;}
    ;

The source-elements list consists of source-elements or nothing.

source_elements
    :                                   { $$ = []; }
    | source_elements source_element    { $$ = $1.concat($2)}
    ;

A single source-element consists of statements and function_declarations which are already defined and that means: we are done.

source_element
    : statement
    | function_declaration
    ;
%% /* AST handling code comes after this line */

Done with the parser-script that is πŸ˜‰

If you want to go on yourself (I caught a cold and will need some aspirin and a bed now, it get to be a long day tomorrowday), here are some little helpers:

perl -nle 'print "function $1\{\n\n\}" if /new\s+(\w+\([0-9,$ ]*\))/' small.jison

A shell one-liner to grab all of the node-objects and make functions (constructors actually) out of them:
If you want to use it like C. Ihrig the following prints the necessary code:

perl -nle 'print "parser.ast.$1 = $1;" if /new\s+(\w+\([0-9,$ ]*\))/' small.jison

The code to make a IEEE-754 number out of the string representation (I did that already, didn’t I?) is (without hexadecimal or other representations to keep it simple):

function parseNumericLiteral(literal) {
  return parseNumber(literal);
}
function parseImaginaryLiteral(literal) {
  return parseNumericLiteral(literal.
                    subst(0,literal.length-1));
}
function parseNumber(string){
  return Number(string);
}

It is a bit more complicated than necessary, but that way you can add other representations more easily. To give a hint:

function parseNumericLiteral(literal) {
  if (literal.charAt(0) === "0") {
    switch(literal.charAt(1).toLowerCase()) {
      case "x": /* hexadecimal */
        return parseNumber(literal.substr(2), 16);
        break;
      case "b": /* binary */
        return parseNumber(literal.substr(2), 2);
        break;
      default: /* octal */
       return parseNumber(literal.substr(1), 8);
      break;
    }
  } else {    /* decimal */
    return parseNumber(literal,10);
  }
}

I haven’t implemented the actual parsing code but it can be done by simple shifting, aka. normalizing but the other way around (error and bounds-checking assumed):

  • check for a radix marker (period). If we have one
    • move the radix marker to the right until it is gone, keep the count
    • subtract the count from the exponent by shifting right by
      • the count for base 2
      • the count/8 for base 8
      • the count/4 for base 16
    • subtract the remainder
  • If we have none we have an integer and can use JavaScript’s parseInt(from_string,from_base)

You might have to treat the exponent differently depending on the base but the number representation of the exponent is always as a signed decimal integer. I have left that out in the pseudo-code above but you might have guessed that already πŸ˜‰

Adventures of a Programmer: Parser Writing Peril XXVII

The Infamous Dangling Else

The Wikipedia-page describes it well enough, so no description of the problem is laid down in this post, only a short one of the solution chosen.

There are basically two different solutions:

  1. Follow Eelco Visser’s proposals (PDF!) and do some “magick”
  2. Write an unambiguous language

We choose b. Continue reading

Adventures of a Programmer: Parser Writing Peril XXV

Expressions

Expressions are quite expressive, I am not and start right on πŸ˜‰

Expressions are all things done to everything literal. These are numbers, strings, tests (comparing), logic, bit operations, arithmetic operations, casting, a non-expression (null), identifier, and then some.

And now that I’m talking about it suddenly crossed my mind: there is a lack of fundamental logic in my grammar!

bool-true-literal  = %x74.72.75.65    ; true
bool-false-literal = %x66.61.6c.73.65 ; false
null-literal       = %x6e.75.6c.6c    ; null

Oops! πŸ˜‰
Ok, but now: Continue reading

Adventures of a Programmer: Parser Writing Peril XXI

Constants

Before we can go on to the wild world of chaotic character-concatenations also known as strings we shall not ignore the small but rich village of constants.

The two most used constants in mathematics are \pi = \frac{C}{d} = 4\arctan 1 = |\sqrt{6\zeta\left(2\right)}|, i^2 = -1 and e = \lim_{n\to\infty} \left( 1 + \frac{1}{n} \right)^n = \lim_{n\to\infty} \frac{n}{\sqrt[n]{n!}} . Two of them are mere constants but i is almost always found in multiples.

There are different ways to handle constants syntactically:

  • As a reserved word e.g.: “Pi” in GP/PARI by Henri Cohen and FranΓ§ois Dress
  • With a prefix e.g.: “%pi” in Maxima, based on Macsyma by William Schelter which was itself based on something governmentally influenced
  • With a postfix e.g.: “pi#” (but it also allows “pi”) in Mathomatic™ written by the late George John Gesslein II

It is slightly different with i. Singular ones are hard to find, they often come in more than just a single one. This might lead one to come up with the idea of prefixing i with a real number, for example 12.34e56i like in Octave for example. Now, what is ideal to be implemented in Little?
If we inspect the short list above a bit closer we see that the programs that lean more on the side of symbolic computation have an affix and those that tend to do more numerical computations have none. GP/PARI, although mainly numerical offers some basic symbolic functions, too. They have i without an affix and define it as a constant with the value 0 + 1i; an extra operation is necessary to get more of it.
Octave on the other side defines whole imaginary numbers and marks them as such with the suffix “i”, so 0i is equivalent to 0. It also makes the input equal to the output:

octave:1> 123i
ans =    0 + 123i
octave:2> 0i
ans = 0

Little is more or less purely numerical and the way Octave handles it seems fit for us, too. If the parser itself handles it we can save us some headache but have to define Imaginary exactly.

In my last post the definitions for several kinds of number representation have been given. We can put them all together now.

; Imaginary number
imaginary = number "i"
; General number
number    = hex-sequence / dec-sequence / bin-sequence

; Hexadecimal number
hex-sequence     = *1sign hex-indicator hex-significand 
                    hex-exponent
hex-exponent     = hex-exponent-indicator *1sign 1*digit
hex-significand  = (*hex-digit "." 1*hex-digit) /
                   (1*hex-digit ".") /
                   (1*hex-digit)
hex-indicator          = "0" ("x" /%x58 ) ; 0[xX]
hex-exponent-indicator = "p" / %x50 ; [Pp]
hex-digit     = digit / hexl-owercase / hex-uppercase ; [0-9a-fA-F]
hex-uppercase = %x41-46 ; [A-F] ABNF is case insensitive
hex-lowercase = "a" / "b" / "c" / "d" / "e" / "f"

; Decimal number
dec-sequence     = *1sign dec-significand decimal-exponent
dec-significand  = (*digit "." 1*digit) /
                   (1*digit ".") /
                   (1*digit)
; Binary number
bin-sequence     = *1sign bin-indicator bin-significand 
                    decimal-exponent
decimal-exponent = bin-exponent-indicator *1sign 1*digit
bin-significand  = (*bin-digit "." 1*bin-digit) /
                   (1*bin-digit ".") /
                   (1*bin-digit)
bin-indicator          = "0" ("b" /%x42 ) ; 0[bB]
bin-exponent-indicator = "e" / %x45 ; [Ee]
bin-digit     = "0" / "1"
digit         = "0" / "1" / "2" / "3" / "4" / 
                "5" / "6" / "7" / "8" / "9"
sign          = "+" / "-"

I think that was that about defining the term Number.
My next post in this series shall be about strings. Really! Promised! πŸ˜‰

Adventures of a Programmer: Parser Writing Peril XX

Number

The program Little is all about numerical mathematics and, as the expression suggests, contains numbers. That means, in all consequences that we must be able to handle numbers. The ability to parse a number is based on a mathematical rigorous description of the term Number and not in a “I know when I see it!” way as Justice Potter Stewart described his own ability to detect obscenities so succinctly in Jacobellis v. Ohio (1964)—that is not sufficient. Continue reading

Adventures of a Programmer: Parser Writing Peril XVIII

[UPDATE]
The regular expression nagged me quite a bit until I got rid of it. updated this post to mirror my reflections, pardon, the changes (added at the end).

Last post I talked about using Tom Wu’s bigint library JSBN and the need to add some faster algorithms especially for multiplication. I just found out that I already did it πŸ˜‰ (needs some testing, of course, I haven’t even checked if it works at all).
But serious:
One of the thing needed for the front-end operator overloading is the backend operator overloading, because there is no way to do a useful operator overloading beside some fiddling with valueOf in ECMA-script. Continue reading