# Adventures of a Programmer: Parser Writing Peril XXII

## String

Ah, finally: strings!

### Some Mathematics

Now, what is a string? It is—and we need to get formal here, sorry—a finite sequence of symbols (a language) from an alphabet $\Sigma$. The alphabet does not need to be finite but it is in our case. The alphabet can be empty but that is of not much use, hence we define $\Sigma \not= \varnothing$ for us. The monoid $\Sigma^* = \bigcup_{i \in \mathbb{N} }\Sigma^i = \{\varepsilon\} \cup \Sigma^1 \cup \Sigma^2 \cup \Sigma^3 \cup \ldots$ with $\Sigma^0 = \{\varepsilon\}$ a language with only one empty string, is also known as the Kleene star of the alphabet. The subsemigroup $\Sigma^+ = \Sigma^* \setminus {\varepsilon}$ is still quite large, by the way. You can read more about the them and in finer details, too, in the Wikipedia pages linked at the end of this post.

As mentioned above, even $\Sigma^+$ is way too large, we need strings to be defined finite: $\Sigma^n = \bigcup_{i \in \mathbb{N} }^n \Sigma^i = \{\varepsilon\} \cup \Sigma^1 \cup \Sigma^2 \cup \Sigma^3 \cup \ldots \cup \Sigma^n$. In the good old times of yore and ASCII with $\Sigma_b = \{0,1\}$ the exponent $n$ was just the number of bits the CPU was able to address which reduced the number of possible languages to a mere $2^{2^32}$ (about $3.10328 \times 10^{1292913986}$ and about $1.906974 \times 10^{5553023288523357132}$ for 64 bit CPUs). For strings in the times of ASCII (or EBCDIC or whatever codec the different companies decided to implement ) it was only a subset of $\Sigma_b^7$. The most common lexer in these times was aptly called lex and was only able to handle 7-bit characters. The support of full 8-bit support was added later and there is still a switch in today’s modern, open-source version flex. Unicode, at least more tha 8-bit long Unicode, seems to be a larger problem. Other lexer do not have this problem, for example Quex or ANTLR, or JISON support it out of the box.
JISON is written in ECMAScript which currently support Unicode in full. Sort of. In section 2, second paragraph of ECMA 5.1:

A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.

The Devil is in the details here but I do not go into them—causes too much headaches—I’ll ignore the terms “if not otherwise specified” and choose UTF-16 as the encoding form.

## Strings: The Definition

Because we will handle the input as an JavaScript String object, the exact
definition in ECMA 5.1 is:

4.3.16 String value

primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer

NOTE: A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

That is the general definition but we need the definition for string literals, which is in section 7.8.4:

A string literal is zero or more characters enclosed in single or double quotes. Each character may be represented by an escape sequence. All characters may appear literally in a string literal except for the closing quote character, backslash, carriage return, line separator, paragraph separator, and line feed. Any character may appear in the form of an escape sequence.

It follows a lengthy grammar describing the above text formal. There is one small problem in that grammar, though, the use of set subtraction:

NonEscapeCharacter ::
SourceCharacter but not one of EscapeCharacter or LineTerminator


ABNF has no set subtraction, we need to spell it all out. That was no problem up to this point but we have to handle Unicode now and the set SourceCharacter ($S$) is very large but the sets EscapeCharacter ($E$) and LineTerminator ($L$) are very small, so the difference set $N = S\setminus\left(E\cup L\right)$ is still very, very large.
The term SourceCharacter is discussed in ECMA 5.1 section 6 if you want to get a hint of my headache.

Long-term use of painkillers can cause a lot of nasty things, so Little will restrict the length of single characters in strings to 8-bit plus the common escape sequences for Unicode and hexadecimal input. That results in the following ABNF grammar:

string_literal            = ( DQUOTE *double_string_character ) /
( "'" *single_string_character)
double_string_character   =  1*not_single_quote     /
( "\" escape_sequence ) /
line_continuation
double_string_character   =  1*not_double_quote     /
( "\" escape_sequence ) /
line_continuation
; printable characters without sans double quote " " "
not_double_quote          = %x20-21 / %x23-5b / %x5d-7e /
%xa0-ff
; printable characters without sans single quote " ' "
not_single_quote          = %x20-2b / %x2d-5b / %x5d-7e /
%xa0-ff
escape_sequence           = character_escape_sequence /
octal_escape_sequence     /
hex_escape_sequence       /
uni_escape_sequence
character_escape_sequence = single_escape_character /
non_escape_character
single_escape_character = "'" / DQUOTE / SLASH / "b" / "f" /
"n" /   "r"  /  "t"  / "v"
SLASH  = %x5c
DQUOTE = %x22
non_escape_character  = %x20-21 / %x23-26 / %x28-2f / %x3a-5b /
%x5d-60 /   "a"   / %x62-65 / %x67-6d /
%x6f-73 /  %x75   / %x77-78 / %x7b-7e /
%xa0-ff
uni_escape_sequence   = ("u" 4HEXDIG) / ("U" 6HEXDIG)
hex_escape_sequence   = "x" 2HEXDIG
octal_escape_sequence = "0" ( %x30-33 ) 2OCTAL ; \0000-\0377
line_continuation     = "\" (CRLF / CR / LF) ; escape EOL
HEXDIG = DIGIT / %x41-46 / %x61-66
DIGIT  = %x30-39
OCTAL  = %x30-37
CRLF   = CR LF
LF     = %x0a
CR     = %x0d


I forgot if I mentioned the term line continuation before, so I might repeat myself here. The line continuation is a fancy term to describe the escape of the character(s) marking the end-of-line position. E.g.:

print ("foo\
bar")


will result in print ("foobar"), no matter what or better how many characters the end-of-line marker contains. The latter is not always true but the exceptions are few and far between.

### Identifiers

Closely related to string literals are identifiers, used for naming variables, functions, and labels. To avoid any trouble, we restrict the characters allowed to strict ASCII:

; [_a-zA-Z] [_a-zA-Z0-9]*
identifier       = identifier_start *identifier_part
; [_a-zA-Z0-9]
identifier_part  = identifier_start / %x30-39
; "_" [a-zA-Z]
identifier_start = "_" (%x41-5a / %x61-7a)


Although “_” is a regular identifier on its own it is not recommended to use it outside of an “obfuscated code contest”.
The alphabet is quite restricted, maybe too restricted? It is possible to add some characters if they are still available after the end without much hassle, so I think I’ll leave it here.

Next in this series: Types and Modifiers.

## Note

This post contains content from the wikipedia pages http://en.wikipedia.org/wiki/Kleene_star, http://en.wikipedia.org/wiki/Sequence, and http://en.wikipedia.org/wiki/String_%28computer_science%29 and I want to thank everybody who worked hard at these pages!