[Caml-list] The lexer hack

Dario Teixeira

unread,

Nov 10, 2009, 9:43:04 AM11/10/09

to caml...@yquem.inria.fr

Hi,

I'm creating a parser for a LaTeX-ish language that features verbatim blocks.
To handle them I want to switch lexers on-the-fly, depending on the parsing
context. Therefore, I need the state from the (Menhir generated) parser
to influence the lexing process (I believe this is called the "lexer hack"
in compiler lore).

Presently I am doing this by placing a module between the lexer and the
parser, listening in on the flow of tokens, and using a crude state machine
to figure out the parsing context. This solution is however error-prone
and a bit wasteful, since I'm reimplementing by hand stuff that should be
the sole competence of the parser generator.

Anyway, since I'm sure this problem pops up often, does someone have any
alternative suggestions? I would preferably keep Menhir, but I'll switch
if some other generator offers a better approach(*).

Thanks + best regards,
Dario Teixeira

(*) I've looked into Dypgen, and its partial actions may offer a way out.
Does someone have any experience with those and with real-world usage
of Dypgen in general? (In other words, is it stable enough for
production use?)

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Jeff Shaw

unread,

Nov 10, 2009, 10:26:38 AM11/10/09

to caml...@inria.fr

Dario,
You could write your lexers in Menhir and make them part of your
grammar. I know this isn't a terribly easy solution but it would be
elegant IMO.

Dario Teixeira

unread,

Nov 10, 2009, 11:26:41 AM11/10/09

to caml...@inria.fr, Jeff Shaw

Hi,

> You could write your lexers in Menhir and make them part of
> your grammar. I know this isn't a terribly easy solution but
> it would be elegant IMO.

I thought of scannerless parsing, but the input is UTF8 encoded,
which makes using Ulex all the more convenient and less error-prone.

Anyway, I was looking at Dypgen's early actions, and an idea
occurred to me: I can create dummy empty actions that simply
change a global "parsing context" variable:

inline:
| (...)
| BEGIN_VERB enter_verb RAW END_VERB exit_verb {Ast.Verbatim $3}
| (...)

enter_verb: /* empty */ {Global.context := Global.Verbatim}
exit_verb: /* empty */ {Global.context := Global.General}

Still hackish, but better than creating a state machine...

Cheers,
Dario Teixeira

Francois Pottier

unread,

Nov 10, 2009, 11:33:41 AM11/10/09

to Dario Teixeira, Jeff Shaw, caml...@inria.fr

Hello,

On Tue, Nov 10, 2009 at 08:26:23AM -0800, Dario Teixeira wrote:
> Anyway, I was looking at Dypgen's early actions, and an idea
> occurred to me: I can create dummy empty actions that simply
> change a global "parsing context" variable:
>
> inline:
> | (...)
> | BEGIN_VERB enter_verb RAW END_VERB exit_verb {Ast.Verbatim $3}
> | (...)
>
> enter_verb: /* empty */ {Global.context := Global.Verbatim}
> exit_verb: /* empty */ {Global.context := Global.General}
>
>
> Still hackish, but better than creating a state machine...

Interesting. Have you confirmed that this works? I am slightly worried by the
fact that an LR parser reads one token ahead, i.e. one token past BEGIN_VERB
might already have been read before the enter_verb semantic action is
executed. If that is so, then this token would be read while the lexer is
still in the wrong mode.

--
Fran�ois Pottier
Francois...@inria.fr
http://gallium.inria.fr/~fpottier/

Dario Teixeira

unread,

Nov 10, 2009, 11:48:13 AM11/10/09

to Francois...@inria.fr, Jeff Shaw, caml...@inria.fr

Hi,

> Interesting. Have you confirmed that this works? I am slightly
> worried by the fact that an LR parser reads one token ahead,
> i.e. one token past BEGIN_VERB might already have been read
> before the enter_verb semantic action is executed. If that is
> so, then this token would be read while the lexer is still in
> the wrong mode.

Yes, I was just thinking about that as well... :-)
I think I can pile another hack on top of the dummy action:
dummy tokens to take care of the readahead issue. Though
this has the potential to get comically silly pretty quickly!

I'll report later...

cheers,
Dario Teixeira

David Allsopp

unread,

Nov 10, 2009, 12:36:15 PM11/10/09

to Dario Teixeira, caml...@yquem.inria.fr

> I'm creating a parser for a LaTeX-ish language that features verbatim blocks.

Out of interest, how LaTeX-ish do you mean? I would hazard a guess that it's impossible to parse an unrestricted TeX file using an LR grammar (or at least no more clear than a hand-coded automaton) because you have to execute the macro expander in order to parse the file *completely* correctly. However, if you only mean LaTeX-ish in syntax (i.e. the files aren't actually TeX files) then you don't have to worry about TeX's elegant (by which I mean terrifying) \catcode mechanism and macro language!

David

Dario Teixeira

unread,

Nov 10, 2009, 3:03:06 PM11/10/09

to caml...@yquem.inria.fr, David Allsopp

Hi,

> Out of interest, how LaTeX-ish do you mean? I would hazard
> a guess that it's impossible to parse an unrestricted TeX
> file using an LR grammar (or at least no more clear than a
> hand-coded automaton) because you have to execute the macro
> expander in order to parse the file *completely* correctly.
> However, if you only mean LaTeX-ish in syntax (i.e. the
> files aren't actually TeX files) then you don't have to
> worry about TeX's elegant (by which I mean terrifying)
> \catcode mechanism and macro language!

I developed the language's syntax in tandem with the parser/lexer
so I made sure it was LR-friendly and Ulex-friendly (the verbatim
environments are the only parsing-unfriendly features). The language
looks and feels like LaTeX, but without the hairy stuff...

Incidentally, the dummy token/action trick seems to be working
fine with Menhir. Since the parser will look ahead one token,
I just have a tokenizer sitting between the lexer and the parser,
and inserting a DUMMY token into the stream after any token that
precedes a dummy action:

inline:
| (...)
| BEGIN_VERBATIM enter_verb DUMMY RAW exit_verb END_VERBATIM {...}
| (...)

enter_verb: /*empty*/ {Global.context := Global.Verbatim}
exit_verb: /*empty*/ {Global.context := Global.General}

It's not the prettiest thing in the world (and I suspect I might
still find some problem with it), but as far as lexer hacks go
it's not bad and a lot better than building a state machine.

Cheers,
Dario Teixeira

Martin Jambon

unread,

Nov 11, 2009, 6:04:33 AM11/11/09

to Dario Teixeira, Jeff Shaw, caml...@inria.fr

Dario Teixeira wrote:
> Hi,
>
>> Interesting. Have you confirmed that this works? I am slightly
>> worried by the fact that an LR parser reads one token ahead,
>> i.e. one token past BEGIN_VERB might already have been read
>> before the enter_verb semantic action is executed. If that is
>> so, then this token would be read while the lexer is still in
>> the wrong mode.
>
> Yes, I was just thinking about that as well... :-)
> I think I can pile another hack on top of the dummy action:
> dummy tokens to take care of the readahead issue. Though
> this has the potential to get comically silly pretty quickly!
>
> I'll report later...

If the lexer to use can be determined by only one token (BEGIN_VERB), I think
you can change the state in the lexer like this:

rule token state = parse
"" { match !state with
`Normal -> normal_token state lexbuf
| `Verbatim -> verbatim_token state lexbuf
}

and normal_token state = parse
...
| "\\begin{verbatim}" { state := `Verbatim; BEGIN_VERB }

and verbatim_token state = parse
... { RAW (...) }
| "\\end{verbatim}" { state := `Normal; END_VERB }

An even simpler option, if possible in your case, is to use a single token for
the whole verbatim section:

rule token = parse
...
| "\\begin{verbatim}" { finish_verbatim lexbuf }

and finish_verbatim = shortest
_* as s "\\end{verbatim}" { RAW s }

Martin

--
http://mjambon.com/

Micha

unread,

Nov 14, 2009, 11:20:25 AM11/14/09

to caml...@yquem.inria.fr

On Tuesday, 10. November 2009 15:42:52 Dario Teixeira wrote:
> Hi,
>
> I'm creating a parser for a LaTeX-ish language that features verbatim
> blocks. To handle them I want to switch lexers on-the-fly, depending on the
> parsing context. Therefore, I need the state from the (Menhir generated)
> parser to influence the lexing process (I believe this is called the "lexer
> hack" in compiler lore).

if the lexer cannot decide it on the tokens seen, a packrat parser (like
Aurochs) may be a better choice, since in a PEG there is no seperate lexer,
it's all one grammar, so you don't have this problem.

Michael

Dario Teixeira

unread,

Nov 14, 2009, 1:09:05 PM11/14/09

to caml...@yquem.inria.fr, Micha

Hi,

> if the lexer cannot decide it on the tokens seen, a packrat
> parser (like Aurochs) may be a better choice, since in a PEG
> there is no seperate lexer, it's all one grammar, so you don't
> have this problem.

But does Aurochs also handle UTF8 streams?

In the meantime I've implemented the parser using Ulex/Menhir
with the "dummy action" trick I mentioned before. It allowed
me to simplify the tokenizer tremendously, though it's still
present:

https://forge.ocamlcore.org/plugins/scmsvn/viewcvs.php/trunk/lambdoc/src/lib/lambdoc_read_lambtex/?root=lambdoc

Cheers,
Dario Teixeira

Dario Teixeira

unread,

Nov 14, 2009, 1:19:41 PM11/14/09

to Martin Jambon, caml...@inria.fr

Hi,

> If the lexer to use can be determined by only one token
> (BEGIN_VERB), I think you can change the state in the lexer
> like this:

Unfortunately the language features some verbatim-like environments
where choosing the right scanner entails knowing more about the
context than what is available to the lexer. As an example, in
the command \link{url}{text}, the "url" should be interpreted
verbatim, whereas the "text" portion uses the general scanner.
The parser is fully aware of the context, of course, which is
why the "dummy action" solution worked fine.

(And yes, I thought of having different delimiters for different
scanning contexts, but in many ways it would make the language
more cumbersome for the user).

Cheers,
Dario Teixeira

Goswin von Brederlow

unread,

Nov 14, 2009, 2:09:25 PM11/14/09

to Micha, caml...@yquem.inria.fr

Micha <mic...@fantasymail.de> writes:

> On Tuesday, 10. November 2009 15:42:52 Dario Teixeira wrote:
>> Hi,
>>
>> I'm creating a parser for a LaTeX-ish language that features verbatim
>> blocks. To handle them I want to switch lexers on-the-fly, depending on the
>> parsing context. Therefore, I need the state from the (Menhir generated)
>> parser to influence the lexing process (I believe this is called the "lexer
>> hack" in compiler lore).
>
> if the lexer cannot decide it on the tokens seen, a packrat parser (like
> Aurochs) may be a better choice, since in a PEG there is no seperate lexer,
> it's all one grammar, so you don't have this problem.
>
> Michael

There usualy must be something present to support '(* <vebatim
text> *)' and '"<verbatim text>"', i.e. comments and strings.
Find out what is recommended for those and adapt it.

MfG
Goswin