Same token has different meanings in different contexts

15 views
Skip to first unread message

susa...@gmail.com

unread,
Apr 25, 2018, 10:37:09 PM4/25/18
to antlr-discussion
I am trying to parse English print in a way that will make it easier to translate to contracted English braille. (I'm very knowledgeable about braille; this is an ANTLR question.)

The problem occurs when the correct braille output depends on context.  Most isolated braille letters represent entire print words. For example, the braille letter "b" represents the word "but" when isolated but just the letter "b" itself in the word "butter" and other words containing that sequence.

However, when I use "BUT: 'but';" as a lexer token, the lexer identifies every 'but' sequence in the input as a BUT token so I have to include that token in grammar phrases for longer words. I could use different phrase names to tell the iisolated and longer word contexts apart and then use the phrase name to signal which hash table has the appropriate translation, e.g. 'b' or 'but'.  But I'm wondering if there is a better solution.

Mike Lischke

unread,
Apr 26, 2018, 3:21:18 AM4/26/18
to antlr-di...@googlegroups.com
Hi Susan,

> The problem occurs when the correct braille output depends on context. Most isolated braille letters represent entire print words. For example, the braille letter "b" represents the word "but" when isolated but just the letter "b" itself in the word "butter" and other words containing that sequence.

This is a semantic problem and hence cannot be decided by ANTLR4 (which does syntactic processing). You will first have to parse the input as is and after that, in your semantic phase, determine what it *means* (and act accordingly). It will cause more grief if you offload your semantic processing to the parsing step, than what you solve by that.

>
> However, when I use "BUT: 'but';" as a lexer token, the lexer identifies every 'but' sequence in the input as a BUT token so I have to include that token in grammar phrases for longer words.

That’s exactly as it should be. One input sequence is mapped to a single lexer token. The input as such doesn’t know about the semantic you apply to a specific symbol.

> I could use different phrase names to tell the iisolated and longer word contexts apart and then use the phrase name to signal which hash table has the appropriate translation, e.g. 'b' or 'but'. But I'm wondering if there is a better solution.

How can that help? In both cases your input is a 'b' and lexed as such. What you have to adjust is your semantic processing of the input. Compare the incoming token with the surroundings to know what it is about. Note: since you didn’t mention that, I assume you don’t have a keyword that parse 'but‘ in addition to the isolated 'b‘ letter.

Mike
--
www.soft-gems.net

Reply all
Reply to author
Forward
0 new messages