I’ve updated my JavaScript parser to include full Unicode support.
Check out the test interfaces for:
» Full parser;
» Code highlighting.
Code highlighting does not require the full syntactical parser, it just uses the tokenizer and does not break when a bad character is found.
What’s in?
When I say full Unicode support, what I mean is that Unicode characters inside string literals and comments were always implicitly supported, but now it can cope with Unicode characters in identifiers too, (as per the ECMAScript standard). Support also includes the use of Unicode escape sequences, although for the sake of speed, the validity of these is not checked. Checking could be done at a later stage when the parse tree is employed to do something useful. The full range of whitespace and line-terminating characters have also been added, although not tested.
Unicode support makes the tokenizing process slower. Because of this it may be switched off if it is not needed. Most programmers with English as their first language are unlikely to use Unicode characters in ‘hand-written’ identifiers, and I have to wonder if anyone has ever done so with an escape sequence.