Parsing is a fairly common word in the web developer’s vocabulary. We do it all the time. One immediately thinks of XML as something we parse regularly without batting an eyelid. As a PHP developer you might also parse an ini file with parse_ini_file, or parse a date string with strtotime. Whatever language you write, these tasks are easily achieved using either built-in functions or by installing other code libraries or extensions. Sometimes you may find yourself needing to parse something more bespoke, like say a postcode – you’ll either write a routine yourself, or do some googling for a neat algorithm someone out there has decided to share. – no problem.
A rod for my back
But what if you want to parse something really complex, like say – an entire JavaScript program. What if you can’t find a third party library that works for you? Well I tried to find one. I found some very promising projects. But they ranged from abandoned projects, to dodgy alpha releases, to ones that just plain didn’t work and with no documentation to help. The most serious looking projects were so sophisticated that I didn’t even have the knowledge to start using them. I decided, as I often do, that I need empowering with the knowledge to write my own parser should I need one for – well, whatever.
What transpired as the Googling began was that I had bitten off rather more than I have ever attempted to chew. The fact is that we take the concept of parsing a little for granted and we use the term quite casually. I have recently learned quite what a huge and fundamental topic of computer science this really is – It has been around longer than computers, which perhaps says it all.
Before I go any further with my ramblings I’d like to point out that everything I now know about this topic I owe to one book; Parsing Techniques – A Practical Guide. Entirely as result of studying this book I have managed to implement a powerful LR(1) parser and parse table generator in PHP. If you don’t know what that means, I suggest you read the book, because I will not attempt to provide a tutorial on these deep topics. I will however whet your appetite with a little introduction, some demos, (aka showing off), and eventually some code.
Our notion of parsing
Let’s go back for a moment and look at our understanding of what parsing is. Suppose you want to parse a UK date (by hand so to speak) – in PHP you could do something like this:
- $sDate = ’03 / 11 / 1976′;
- $aDate = preg_split( ‘/D/’, $sDate, –1, PREG_SPLIT_NO_EMPTY );
- $timestamp = mktime(0, 0, 0, $aDate[1], $aDate[0], $aDate[2], false );
Easy peasy – You have parsed the date, in so far as it has been transformed from a string of bytes to something tangible within your code. You can get meaningful handles on the data and do stuff with it. Great! – but not powerful or versatile; This is an entirely bespoke process that works for this purpose alone. Proper parsing techniques allow a problem like this to be standardized greatly, and will allow you to use the same code base to implement parsers for as many different things as your imagination and time allows you. Furthermore, it is possible to fully harness the expressive nature of a language rather than just chopping a sentence up and grabbing bits of it, like I have shown above.
As a proof of concept I’m going to use a humble date format to demonstrate the parser I have developed for PHP. In reality using such a powerful technique would be total overkill, but it makes for a pretty good first example.
Proof of concept
Let’s take a date format as the language for which we wish to create a parser. A legitimate expression (or sentence) in this language could be as follows: 1976-11-03 18:15:00
Starting with the top-most, and largest component of our language, let’s examine what it consists of and define ourselves a grammar which expresses its rules and syntax.
The complete sentence is quite simply a date – This entity is what we call the “goal symbol“, because it is our ultimate result. Let’s create a sensibly named symbol with an appropriate namespace.
<D_DATE>
We can split the D_DATE
symbol into two separate components for date (1976-03-11) and time (18:15:00), Let’s call these symbols D_DATE_COMPONENT
& D_TIME_COMPONENT
. They are separated by a space, which we write literally.
<D_DATE> ::= <D_DATE_COMPONENT> " " <D_TIME_COMPONENT>
The literal space symbol is called a terminal symbol because it is an actual character in the date we are parsing and consequently can not be divided into any further symbols. The other symbols we have defined so far are called non-terminal because they are made up of further symbols, and are not physically present in the original date string. They are, to all intent and purpose, imaginary symbols created by us to provide an understanding of what the sentence means.
You get the gist – hopefully – So here’s the whole grammar.
1. <D_DATE> ::= 2. <D_DATE_COMPONENT> " " <D_TIME_COMPONENT>; 3. 4. <D_DATE_COMPONENT> ::= 5. <D_YEAR> "-" <D_MONTH> "-" <D_DAY>; 6. 7. <D_TIME_COMPONENT> ::= 8. <D_HOUR> ":" <D_MIN> ":" <D_SEC> ; 9. 10. <D_YEAR> ::= 11. <D_DIGIT> <D_DIGIT> <D_DIGIT> <D_DIGIT> ; 12. 13. <D_MONTH> ::= 14. <D_DIGIT> <D_DIGIT> ; 15. 16. <D_DAY> ::= 17. <D_DIGIT> <D_DIGIT> ; 18. 19. <D_HOUR> ::= 20. <D_DIGIT> <D_DIGIT> ; 21. 22. <D_HOUR> ::= 23. <D_DIGIT> <D_DIGIT> ; 24. 25. <D_MIN> ::= 26. <D_DIGIT> <D_DIGIT> ; 27. 28. <D_SEC> ::= 29. <D_DIGIT> <D_DIGIT> ; 30. 31. <D_DIGIT> ::= 32. "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
This is written in a notational format called Backus Naur form (BNF). Non terminal symbols are written inside angled brackets and the terminal symbols are all written inside quotes as they are [in this case] literal. As you can see, the only terminal symbols of the language are the punctuation and numeric characters that a legal date string in this language will consist of. The last non-terminal D_DIGIT
illustrates a choice of terminal symbol – the 10 decimal characters, (lines 31-32).
The parse tree
As we drill down from our goal symbol into its child symbols you may realise that we are seeing a tree-like structure. This tree has the goal symbol at it’s root, and along each branch we eventually find a leaf – a terminal symbol. This is important, because it is precisely the parser’s purpose in life to take a stream of terminal symbols and build them into a parse tree. Without further ado, here is the parse tree of our example date string.
[041] <D_DATE> : [021] . <D_DATE_COMPONENT> : [008] . . <D_YEAR> : [001] . . . <D_DIGIT> : [000] . . . . "1" [001] . . . </D_DIGIT> [003] . . . <D_DIGIT> : [002] . . . . "9" [003] . . . </D_DIGIT> [005] . . . <D_DIGIT> : [004] . . . . "7" [005] . . . </D_DIGIT> [007] . . . <D_DIGIT> : [006] . . . . "6" [007] . . . </D_DIGIT> [008] . . </D_YEAR> [009] . . "-" [014] . . <D_MONTH> : [011] . . . <D_DIGIT> : [010] . . . . "1" [011] . . . </D_DIGIT> [013] . . . <D_DIGIT> : [012] . . . . "1" [013] . . . </D_DIGIT> [014] . . </D_MONTH> [015] . . "-" [020] . . <D_DAY> : [017] . . . <D_DIGIT> : [016] . . . . "0" [017] . . . </D_DIGIT> [019] . . . <D_DIGIT> : [018] . . . . "3" [019] . . . </D_DIGIT> [020] . . </D_DAY> [021] . </D_DATE_COMPONENT> [022] . " " [040] . <D_TIME_COMPONENT> : [027] . . <D_HOUR> : [024] . . . <D_DIGIT> : [023] . . . . "1" [024] . . . </D_DIGIT> [026] . . . <D_DIGIT> : [025] . . . . "8" [026] . . . </D_DIGIT> [027] . . </D_HOUR> [028] . . ":" [033] . . <D_MIN> : [030] . . . <D_DIGIT> : [029] . . . . "1" [030] . . . </D_DIGIT> [032] . . . <D_DIGIT> : [031] . . . . "5" [032] . . . </D_DIGIT> [033] . . </D_MIN> [034] . . ":" [039] . . <D_SEC> : [036] . . . <D_DIGIT> : [035] . . . . "0" [036] . . . </D_DIGIT> [038] . . . <D_DIGIT> : [037] . . . . "0" [038] . . . </D_DIGIT> [039] . . </D_SEC> [040] . </D_TIME_COMPONENT> [041] </D_DATE>
Click here to play with the interactive version
You will see that the parser throws a descriptive error if you try to parse an invalid date.
That’s the concept in a nutshell. Hopefully you can see that this tree structure gives us a powerful and flexible way to describe the anatomy of the input we have parsed. Compare that with simply grabbing the linear sequence of numbers that make up the date string and imagine the potential.
I shall leave this introduction for now. In part 2, I go on to discuss tokenizing and demonstrate the parsing of PHP source.