Parsing for PHP developers – Part II

In part 1 I introduced and demonstrated the parsing concept using a very simple date parser. In this part I am going to talk about the important role of tokenizing. If you haven’t read part 1 this may not make much sense, so read it now if you haven’t already.

Syntactical vs Lexical

Looking again at the simple grammar of part 1. You may notice that the rule: <D_DIGIT> ::= "0" | "1" ... "9" is a bit different to all the others. It does not really contribute to the syntax of our language, it merely describes the legal characters that make up a single digit. It is convenient to view this aspect of the language as a subset of the grammar; one that is concerned only with what input ‘looks like’ rather than where it appears. This can be called the lexical grammar. The rest of the language which is concerned with syntax can be called the syntactical grammar.

If we were to run our date string through a lexical parser first, it could output a more organized set of symbols that we could then feed into our syntactical parser as input. The syntactical grammar could then safely take for granted that a symbol D_DIGIT is a perfectly valid terminal symbol and wouldn’t need to worry about what characters it might have contained. So the input for our simple date parser could be transformed from:

"1976-11-03 18:15:00"

– to something like:

[ D_DIGIT D_DIGIT D_DIGIT D_DIGIT "-" D_DIGIT D_DIGIT "-" D_DIGIT " " D_DIGIT D_DIGIT ":" D_DIGIT D_DIGIT ":" D_DIGIT D_DIGIT ]

And because the non-terminal symbol <D_DIGIT> has been scrapped in favour of the terminal symbol D_DIGIT, a grammar rule like :

<D_MONTH> ::= <D_DIGIT> <D_DIGIT>

– could become

<D_MONTH> ::= D_DIGIT D_DIGIT

This might seem a bit pointless at this stage. Having two grammars (and two parsers) for our simple date language is probably not necessary. One of the factors that makes our original test grammar so simple is that all of its terminal symbols are single characters, and only a restricted set of characters at that. This made it fairly easy to embed the lexical information within a single grammar. However, as we consider more complex languages this practice soon becomes a problem and slows things down immensely. Consequently, this separation of the two grammars becomes increasingly useful. Categorizing symbols together like this doesn’t just make our grammar neater, it also makes the parser hugely more efficient.

Tokenization

The lexical parsing process (called ‘lexing’, or ‘tokenizing’) can often be much simpler and easier to achieve than the full, syntactical parse. If we wanted to parse the PHP language itself we don’t have to worry about this step at all, because we have a built in tokenizer at our disposal – the PHP Tokenizer. There are other benefits to having a separate tokenizer too; Perhaps you want to parse a string, but you don’t need a full parse tree or even need to check the syntax. A good example of this is code highlighting.

It looks like our vocabulary is getting bigger, so let’s have a terminology catch-up as I start to talk about tokenizing in the context of PHP.

A symbol, whether terminal or not, is always scalar. As with our date grammar a terminal symbol could represent a piece of input literally, like “2”, or as with PHP, it could stand for a type of input such as T_LNUMBER who’s value may be any numeric string. The latter type of symbol is represented internally as an integer constant. Non-terminal symbols never represent actual input, and so will always be integer constants.

A token represents a single piece of [pre-processed] input. It is identified by a terminal symbol, but it may contain more information. For a symbol like T_LNUMBER, an input token needs to contain both the symbol and a literal value. We use an array as the simplest way to express this:
array ( T_LNUMBER, "123" ); The literal value at offset 1 is not actually consulted during syntactical parsing but if you’re going to do anything useful with your parser, other than just check syntax, you’ll almost definitely need this data later on. It is of course also useful for debugging and error messages during parsing.

It made sense for me to base my parser’s notion of a token on the PHP tokenizer’s model, but really there are many ways to represent such concepts and it is largely a programming design issue. Hopefully by now you get the basic premise that tokenizing raw input into more manageable data makes life easier in performing a full, syntactical parse.

My parser package does not currently contain the kind of automated tokenizer that you’ll find in the world of [ahem] proper programming languages, (e.g. Lex for C), but this is in the pipeline. If you have to write your own tokenizer, you can go about it any way you like as long as what you end up with is an array of tokens supported by the parser. In theory you could build a lexical grammar with rules like <T_FUNCTION> ::= "f" "u" "n" "c" "t" "i" "o" "n", and use the same parser that your syntactical grammar will use, but this is a pretty bad idea. I have found it a hell of a lot easier and infinitely more efficient to write a bespoke function. It doesn’t have to be complex – here is an example of a possible tokenizer for our humble date grammar: The source and the output. PHP also provides the strtok function, but I have not found it useful.

Parsing PHP

In the case of PHP of course, we don’t even need to bother with all this, we just call token_get_all, and we’re ready to parse, so before this article gets too long, let’s create a mini grammar that can use the PHP Tokenizer, and then my parser as before. The following language merely adds up numbers, that’s it – that’s all it does.

1. <NT_PROGRAM> ::=
2.    T_OPEN_TAG <NT_SUM> |
3.    T_OPEN_TAG <NT_SUM> T_CLOSE_TAG;
4. <NT_SUM> ::=
5.    <NT_NUMBER> |
6.    <NT_NUMBER> "+" <NT_SUM>;
7. <NT_NUMBER> ::=
8.    T_LNUMBER |
9.    T_DNUMBER

This tiny subset of PHP syntax must start with <?php simply for the tokenizer to work, but the closing tag is optional. This is defined in the first rule on lines 1-3. The second rule on lines 4-6 is recursive. It defines that sums may be added together with further sums until ultimately a number is produced. The third rule on lines 7-9 defines that both integers and doubles are valid numbers in this language.

If we parse this source: "<?php 100 + 2.2 ?>", we get the following parse tree:

[009] <NT_PROGRAM> :
[000] .  "T_OPEN_TAG" = '<?php'
[007] .  <NT_SUM> :
[002] .  .  <NT_NUMBER> :
[001] .  .  .  "T_LNUMBER" = '100'
[002] .  .  </NT_NUMBER>
[003] .  .  "+"
[006] .  .  <NT_SUM> :
[005] .  .  .  <NT_NUMBER> :
[004] .  .  .  .  "T_DNUMBER" = '2.2'
[005] .  .  .  </NT_NUMBER>
[006] .  .  </NT_SUM>
[007] .  </NT_SUM>
[008] .  "T_CLOSE_TAG" = '?>'
[009] </NT_PROGRAM>

Click here to try the interactive version.

The structure of this parse tree shows that the <NT_SUM> nodes are nested recursively. This is a natural occurrence of the grammar rule on lines 4-6. It is simple enough to ‘flatten’ the tree if you needed, but ultimately this structure reflects the inherent nature of the language which might be essential for whatever you’re going to do with it next.

You’ll see that the interactive demo actually gives the result of the sum. This is achieved by evaluating the parse tree, which is when things start to get a bit more useful, and a lot more interesting. I’ll get to that topic in a other post to follow soon.