Improving ANTLR3 - Generating UTF-8 and UTF-32 lexers.

354 views

Skip to first unread message

Nickolas Pohilets

unread,

May 7, 2015, 6:21:55 PM5/7/15

to antlr-di...@googlegroups.com

Right now ANTLR3 always assumes that lexer will parse input in UTF-16 code units, which may be not true and is not true for my project. That's ok while you handle all code points outside Latin-1 transparently, but recently I wanted to make my lexer to parse infinity symbol as a valid floating point literal and that didn't work. I've tricked ANTLR by encoding it in the grammar as '\u00E2\u0088\u009E', but I was going to spend some time and modify ANTLR to make it do this kind of transformation out-of-the box.

Recently I was able to finally debug ANTLR in the GUI debugger (that's a serious achivement for guy with zero Java background :) ) and spent some time digging around. And with some limitations (restricting range limits and set elements to symbols that fit into single code unit) this seems to be doable.

Requirements Analysis:

Assumtion is that text we are parsing is a valid Unicode text, regardless of its transformation format. Anyway, neigher binary data nor non-Unicode encodings are not handled by the ANTLR now, so allowing it to parse at least UTF-8 and UTF-32 with re-encoding is already an improvement.

ANTLR grammar file is a Unicode text encoded in UTF-8. It is read into JVM memory and stored there as Unicode text in UTF-16. Some parts of this text are character and string literals, which DESCRIBE unicode strings. This description features C-like escape sequences, UTF-16 escape sequences \uXXXX and I'm going to add support for C++11-like UTF-32 escape sequences \UXXXXXXXX (8 hex digits). The later one may be represented by UTF-16 surrogate pair in ANTLR's memory.

These Unicode strings should be recognized by the generated lexer in the input stream containing Unicode text in some representation. Numbers that Lexer recieves from the IntStream are code points and matching should be done agains code points.

Idea is to provide and new option to the grammar 'encoding' with three possible values: 'UTF8', 'UTF16', 'UTF32' with default one determined by the target and being 'UTF16' for Java and other existing targets (except Ruby, see below). Targets must support at least its default encoding, but are not required to support all of them. So default setup does not break existing ANTLR grammars in any language. Targets are asked if they support specified encoding, and if not - error is generated.

Then ANTLR generates lexer that recognizes text in specified encoding. It is not possible to to configure created lexer in the runtime, but theoretically it should be possible to generate 3 versions of the lexer and then pick the appropriate one in runtime.

ANTLR generates program sources in default JVM text encoding, most likely UTF-8. Actual encoding should be irrelevant for the compiler of the target language. In C++ both byte sequences 75 38 22 E2 88 9E 22 (7 bytes, UTF-8) and 00 75 00 38 00 22 22 1E 00 22 (10 bytes, 5 code units, UTF-16) represent Unicode string u8"∞" that encodes null-terminated sequence of 4 UTF-8 code unites; while both byte sequences 75 22 E2 88 9E 22 (6 bytes, UTF-8) and 00 75 00 22 22 1E 00 22 (8 bytes, 4 code units, UTF-16) represent Unicode string u"∞" that encodes null-terminated sequence of 2 UTF-16 code unites.

Implementation details:

Encoding affects the way how CHAR_LITERAL and STRING_LITERAL are distinguished, so it should be set before grammar is read into grammar object. My current plan is to use GrammarSpelunker for reading grammar options, and use read options to instantiate Target object and congifure encoding. Currently these options are discarded after sorting grammar files.

AFAIS, entire tree of composite grammars is processed using options of the root grammar. And code in CompositeGrammarTree.getOption() that forwards option resolution to parent grammar is unreachable. I want to make this explicit and remove redundant code. Currently my plan is to read options using GrammarSpelunker, pass them to CompositeGrammar and store them there, rather then in individual Grammar object.

Grammar object instantiates a Target object in its constructor, but currently this always happens before options are parsed and Grammar always instantiates JavaTarget. This looks like a bug to me.

Since options are the same for entire grammar tree, then entire tree also has the same target language. So my first thought was to create one target instance per tree and store it in CompositeGrammar. But then I've discovered that C and Cpp targets are statefull. So I'd better keep one target instance per grammar. In such case, there are several options to interact with Target to ask it about encoding:

1. Create dummy instance in CompositeGrammar - feels a bit awkward.

2. Create instance in CompositeGrammar and share it with root Grammar - overcomplicated?

3. Make encoding methods static and call them though reflection

4. Use single instance of Target, but add methods like beginGrammar()/endGrammar() and let C and Cpp targets manage their state there.

Any advice from Java gurus?

Handling encoding-related issues in encapsulated in class TextEncoder and 3 subclasses - UTF8TextEncoder, UTF16TextEncoder and UTF32TextEncoder. Instance of appropriate subclass is created and stored in CompositeGrammar. TextEncoder is used by ANTLRLexer when checking if string/character literal is small enough - if it fits in single code unit, then it is parsed as CHAR_LITERAL, otherwise - as STRING_LITERAL.

NFAFactory uses TextEncoder to break Unicode string literal into list of code units that are used to code state transitions.

Grammar.getMaxCharValue() should use TextEncoder to get code unit range, instead of Target. Corresponding method should be removed from Target class. Currently Ruby target is the only one that overrides it. Need to check with Kyle Yetter what to do with this.

That's all required code changes I'm aware of for now. If I succeed with this, then it will be possible to move on and enable using arbitrary code points in ranges and sets. This will also remove dependency of ANTLRLexer on TextEncoder.