re2c-1.0 released!

Ulya Trofimovich

unread,

Aug 27, 2017, 1:45:49 PM8/27/17

to

Hello everybody,

I'm glad to announce the new major release of re2c lexer generator, 1.0.
re2c specializes in generating fast and flexible lexers; it was written
by Peter Bumbulis around 1994 [0] and since then maintained by many
people. See release notes for the full story [1].

The main breakthrough of this release is the addition of fast submatch
extraction: re2c now supports capturing groups with POSIX longest-match
semantics, as well as standalone capturing "tags" with leftmost greedy
semantics. The implementation is based on the efficient novel algorithm [2].

The challenge of adding submatch extraction to a lexer generator is not
immediately obvious: this feature is provided by many regular expression
libraries and command-line tools like grep and sed. It may seem that
only a lack of effort prevents developers of lexer generators like Flex
from implementing it (as well as fixing the ever-broken trailing
contexts [3]).

The truth is, libraries and tools process regular expressions at run
time and use NFA-based algorithms when they need submatch extraction.
Lexer generators, on the other hand, compile regular expressions to DFA.
As it turns out, submatch extraction is inherently more complex than
recognition: it can be solved on NFA, but not on (ordinary) DFA.

The new algorithm [1] is based on the Tagged DFA invented by Ville
Laurikari in 2000 [4]; it also incorporates POSIX disambiguation
algorithm informally described by Kuklewicz in 2007 [5]. Any feedback on
the new algorithm would be much appreciated.

Ulya Trofimovich

[0] https://compilers.iecc.com/comparch/article/94-04-115

[1] http://re2c.org/news/release_notes/1_0.html

[2]
http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

[3] https://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_node/flex_23.html

[4] https://laurikari.net/ville/spire2000-tnfa.pdf

[5] https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475

Kaz Kylheku

unread,

Sep 2, 2017, 11:18:33 AM9/2/17

to

On 2017-08-27, Ulya Trofimovich <skva...@gmail.com> wrote:
> Hello everybody,
>
>
> I'm glad to announce the new major release of re2c lexer generator, 1.0.
> re2c specializes in generating fast and flexible lexers; it was written
> by Peter Bumbulis around 1994 [0] and since then maintained by many
> people. See release notes for the full story [1].
>
> The main breakthrough of this release is the addition of fast submatch
> extraction: re2c now supports capturing groups with POSIX longest-match
> semantics, as well as standalone capturing "tags" with leftmost greedy
> semantics. The implementation is based on the efficient novel algorithm [2].
>
> The challenge of adding submatch extraction to a lexer generator is not
> immediately obvious: this feature is provided by many regular expression

Neither is obvious the advantage/utility of this.

> libraries and command-line tools like grep and sed.

They exist in support of very flimsy, ad hoc approaches for
dealing with some text processing problems.

Briefly, why would you do some hacky regex thing in lex with \1, \2,
\3, when in the level immediately above yylex() you have proper phrase
recognition, with $1, $2, $3.

(What is that, if not a better grade of backreferencing.)

> It may seem that
> only a lack of effort prevents developers of lexer generators like Flex
> from implementing it (as well as fixing the ever-broken trailing
> contexts [3]).

Lack of justification due to, I suspect, lack of demand from
sophisticated users of the tools, who use them for their intended
purpose.

Kaz Kylheku

unread,

Sep 2, 2017, 11:18:58 AM9/2/17

to

On 2017-08-27, Ulya Trofimovich <skva...@gmail.com> wrote:

> libraries and command-line tools like grep and sed. It may seem that
> only a lack of effort prevents developers of lexer generators like Flex

> from implementing [submatch extraction] (as well as fixing the

> ever-broken trailing contexts [3]).

> [3] https://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_node/flex_23.html

Firstly, your [3] references the documentation of Flex 2.5.4, a version
dating back to, I think, late 1996, making it 21 years old! The
trailing context feature works well enough; it has quirks and
restrictions in its use (which Flex does a decent job of diagnosing, in
my experience).

With that out of the way: the likely reason Lex doesn't feature submatch
extraction (like with \1, \2 and so on) is that this isn't normally
required for lexical analysis. Submatch extraction is useful for
hacky text processing with regexes in the absence of a parser.

This feature will be of the greatest use to using a lexical analyzer
generator *by itself* to optimize a text filtering task (e.g. take some
"sed logic" and try to make it faster by transformation to C), rather
than to scan a programming or data definition language.

If Lex is being used with a parser (like by the denizens of
comp.compilers), there is already an excellent form of backreferencing
available: namely the references to the grammar nodes from the grammar
rule action, such as the $1, $2, $3 ... variables in Yacc, or
the equivalent under some other parser generator.

For instance, someone processing, say a 2017-08-28 style date using
regexes in a POSIX text processing tool might have something like
([0-9]+)-([0-9]+)-([0-9]+), and get the pieces as \1 \2 \3. This
approach would not be greatly required by someone treating the situation
more formally with some grammar phrase structure like INTEGER '-'
INTEGER '-' INTEGER, where the integers can be pulled out as $1, $3 and
$5.

However, it's easy to see though that a date in that form can be
handled as a single token, and then backreferencing can provide
convenient access to the pieces, helpful to code which produces a
three-element semantic value for the token supplying integer values
to the parser. So there is some possible nonzero utility to a language
writer; it's just not severe. In the few places where this sort of
thing could be of benefit, you can code up a small, local workaround.

It's even easier to see that it's a significant burden to deal with
formal syntax above the lexer for someone who just wants a fast text
processing hack that does everything it needs inside the lexer;
that's the use case that sees the greatest benefit.

Even purer opinion follows:

It was a big, big mistake in Unix regexes to endow the grouping
parentheses with semantics by making them delimit numbered capture
registers.

This such a problem that derivatives of POSIX regex syntax, such as
PCRE, found it necessary to add back a "non-capturing" form parenthesis
(expressed using annoying verbiage) just for doing pure grouping.

If you add capturing, at least use a special syntax for it, leaving
the ordinary parenthesis alone.

For instance:

(?1R)

could mean "match regex R, and capture into register 1", and

(R)

just means "parse R as an indivisible unit with regard to all
surroundings" without burdening it with any additional semantics.

Anton Ertl

unread,

Sep 2, 2017, 11:50:16 PM9/2/17

to

Kaz Kylheku <398-81...@kylheku.com> writes:
>Briefly, why would you do some hacky regex thing in lex with \1, \2,
>\3, when in the level immediately above yylex() you have proper phrase
>recognition, with $1, $2, $3.

These don't do the same thing. Taking your example from the other
posting, do you want to recognize

2017
-08
/* bla bla */ -
28

as date? If so, do it at the parser level, if not, the scanner.

I dimly remember a case of a lexeme type with internal structure that
was reflected in the regexp, where I then had to write C code manually
repeating some of the work that the scanner had already done, and
wishing for a feature like that provided by the OP. Unfortunately,
and I do not remember details, that has been long ago.

>> It may seem that
>> only a lack of effort prevents developers of lexer generators like Flex
>> from implementing it (as well as fixing the ever-broken trailing
>> contexts [3]).
>
>Lack of justification due to, I suspect, lack of demand from
>sophisticated users of the tools, who use them for their intended
>purpose.

Or maybe just accepting the limitations of the tool and working around
them, without sending a feature request to the maintainer (I didn't,
at the time).

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

George Neuner

unread,

Sep 2, 2017, 11:50:32 PM9/2/17

to

On Sun, 27 Aug 2017 18:25:41 +0100, Ulya Trofimovich
<skva...@gmail.com> wrote:

>As it turns out, submatch extraction is inherently more complex than
>recognition: it can be solved on NFA, but not on (ordinary) DFA.

I don't know what this means ... every NFA has an equivalent DFA, so
if a problem can be solved with NFA, it can be solved with DFA.

George

Ulya Trofimovich

unread,

Sep 3, 2017, 11:55:10 AM9/3/17

to

> I don't know what this means ... every NFA has an equivalent DFA, so
> if a problem can be solved with NFA, it can be solved with DFA.

My wording is bad, sorry. I should have put it this way: regular
expression parsing (and submatch extraction as a special case of it) can
be solved on nondeterministic finite state transducers (NFST). And NFST
are not equivalent to DFST, despite the fact that NFA are equivalent to DFA.

Traditional NFA can be trivially converted to NFST by augmenting their
transitions with output symbols and extending the simulation procedure
to write these output symbols. In case of submatch extraction it means
equipping each path through NFA with a set of registers -- a pair per
capturing group -- that are updated during simulation.

However, NFST determinization is non-trivial. That was my point.

>> It may seem that only a lack of effort prevents developers of lexer
>> generators like Flex from implementing it (as well as fixing the
>> ever-broken trailing contexts [3]).
>

>Lack of justification due to, I suspect, lack of demand from
>sophisticated users of the tools, who use them for their intended
>purpose.

Well, the simple use case that I had in my mind was re2c's own lexer
(yes, re2c is self-hosted): it has a lexeme of the form
"{[0-9]+,[0-9]+}" that is used to denote bounded repetition of the given
regular expression. Without submatch extraction the rule looks like this:

"{" [0-9]+ "," [0-9]+ "}" { ... strchr(token, ',') ... }

It is necessary to call 'strchr' to retrieve the position of the comma.
It would be nice if the lexer stored the position in a variable as it goes:

"{" [0-9]+ @p "," [0-9]+ "}" { ... p /* points to ',' */ ... }

But there are other, more complex use cases, like parse URI or HTTP
messages:

http://re2c.org/examples/example_10.html
http://re2c.org/examples/example_11.html

This technique is already applied by re2c users, so at least someone
somewhere finds it handy. ;)

Ulya

Ben Hanson

unread,

Sep 3, 2017, 12:02:23 PM9/3/17

to

> >As it turns out, submatch extraction is inherently more complex than
> >recognition: it can be solved on NFA, but not on (ordinary) DFA.
>
> I don't know what this means ... every NFA has an equivalent DFA, so
> if a problem can be solved with NFA, it can be solved with DFA.

This is true for regular expressions, but not for capturing regular expressions (irregular expressions?) This is why NFA is traditionally used for regex with captures.

Ben Hanson

unread,

Sep 3, 2017, 12:03:45 PM9/3/17

to

Thanks for this. I will take a look at the implementation when I get the time.

I'm most interested in your refinements/bug fixes to the original TDFA algorithms and I would like to see this used as a runtime regex library more than for a lexer generator.

Regards,

Ben

Kaz Kylheku

unread,

Sep 3, 2017, 12:04:35 PM9/3/17

to

On 2017-09-02, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Kaz Kylheku <398-81...@kylheku.com> writes:
>>Briefly, why would you do some hacky regex thing in lex with \1, \2,
>>\3, when in the level immediately above yylex() you have proper phrase
>>recognition, with $1, $2, $3.
>
> These don't do the same thing. Taking your example from the other
> posting, do you want to recognize
>
> 2017
> -08
> /* bla bla */ -
> 28
>
> as date? If so, do it at the parser level, if not, the scanner.

We face this choice only because we put a hack into the lexer:
when it scans whitespace token or comment token, it throws it away,
so the parser doesn't see it.

That hack is very easy to implement and considerably simplifies the
grammar.

We can very easily implement another hack to allow us to recognize
2017-08-28 as three tokens without comments or whitespace.

In fact we can do it all in standard lex.

We can set up a custom YYINPUT which allows the entire yytext to be
pushed back into the stream. The rest is done with orchestration of one
or more start states.

The steps to recognizing a date would then be:

1. First NNNN-NN-NN is scanned as a token.
2. push_back_string(yytext) is called
3. BEGIN(DATE) is invoked
4. The rule doesn't return so the lexer re-scans the pushed input
in the DATE state.
5. In the DATE state, an integer token is recognized; everything
else is an error. Dash tokens can be recognized and returned,
or consumed in the lexer.
6. Successful recognition of an integer token in the DATE state
returns not an INTEGER to the parser but a DINTEGER.
7. The parser's phrase structure for matching dates refers to DINTEGER
nonterminals and not INTEGER.

Pseudo-code, including a mechanism for returning to the INITIAL state
without parser involvement:

%x DATE

%%

<INITIAL,DATE>[0-9]+ {
yylval.value = str_to_int(yytext); /* our function */
return (YYSTATE == DATE) ? DINTEGER : INTEGER;
}

<INITIAL,DATE>- {
return '-';
}

[0-9]+-[0-9]+-[0-9]+ {
unput('!'); /* end signal */
unput_string(yytext); /* our function; works with our YYINPUT */
BEGIN(DATE);
}

<DATE>! { /* our end signal: include other states here */
BEGIN(INITIAL);
}

<DATE>. {
/* internal error: how did we get here? */
}

Ulya Trofimovich

unread,

Sep 3, 2017, 1:43:49 PM9/3/17

to

> In fact we can do it all in standard lex.

> [...]

> Pseudo-code, including a mechanism for returning to the INITIAL
> state without parser involvement:

> [...]

The thing is, the new algorithm is both more *efficient* and *elegant*.
Consider how parsing date in format 'YYYY-MM-DD' would look like in re2c:

@y [0-9]{4} [-] @m [0-9]{2} [-] @d [0-9]{2} {
unsigned year
= (y[0] - '0') * 1000
+ (y[1] - '0') * 100
+ (y[2] - '0') * 10
+ (y[3] - '0');
unsigned month
= (m[0] - '0') * 10
+ (m[1] - '0');
unsigned day
= (m[0] - '0') * 10
+ (m[1] - '0');
}

Here re2c will optimize out all dynamic variables, because @d is always
2 characters behind the end of input, and @m is 3 characters behind @d,
and @y is 5 characters behind @m. So all three variables will be
calculated using static offsets from cursor when it points to the end of
input.

If comments and whitespace are allowed between date components, we could
change the above example like this (now re2c will use variables to track
submatch positions, since components have variable length):

com = "/*" ([^*] | ("*" [^/]))* "*""/";
wsp = [ \t\n\r]* | com;
hyp = wsp [-] wsp;

@y [0-9]{4} hyp @m [0-9]{2} hyp @d [0-9]{2} {
unsigned year
= (y[0] - '0') * 1000
+ (y[1] - '0') * 100
+ (y[2] - '0') * 10
+ (y[3] - '0');
unsigned month
= (m[0] - '0') * 10
+ (m[1] - '0');
unsigned day
= (m[0] - '0') * 10
+ (m[1] - '0');
}

Ulya Trofimovich

unread,

Sep 3, 2017, 1:44:57 PM9/3/17

to

Using TDFA in a runtime library has certain downsides:

1. It implies the runtime overhead on (lazy) determinization: it's only
useful if the regexp is compiled once and used many times. Libraries
such as RE2 (by Russ Cox) or TRE (by Ville Laurikari) mostly use NFA for
submatching.

2. Also, due to ahead-of-time compilation lexer generators are able to
perform extensive optimization of the generated code, including
minimization of the number of submatch variables and operations on them.
A runtime library cannot afford that.

However, if TDFA algorithm is to be used in a library, then my paper
contains two improvements worth consideration:

1. TDFA(1) with one-symbol lookahead instead of default TDFA(0)

2. Formal algorithm for POSIX disambiguation semantics.

I'd be glad to answer any questions anyway.

Ulya

Ben Hanson

unread,

Sep 4, 2017, 10:59:02 AM9/4/17

to

On Sunday, 3 September 2017 18:44:57 UTC+1, Ulya Trofimovich wrote:
> Using TDFA in a runtime library has certain downsides:
>
> 1. It implies the runtime overhead on (lazy) determinization: it's only
> useful if the regexp is compiled once and used many times. Libraries
> such as RE2 (by Russ Cox) or TRE (by Ville Laurikari) mostly use NFA for
> submatching.
>
> 2. Also, due to ahead-of-time compilation lexer generators are able to
> perform extensive optimization of the generated code, including
> minimization of the number of submatch variables and operations on them.
> A runtime library cannot afford that.
>
> However, if TDFA algorithm is to be used in a library, then my paper
> contains two improvements worth consideration:
>
> 1. TDFA(1) with one-symbol lookahead instead of default TDFA(0)
>
> 2. Formal algorithm for POSIX disambiguation semantics.

Ah, overheads. I'm guessing you're being a bit gloomy there, as "DFA
is too slow!" has been the battle cry for NFA advocates for decades
now. I didn't believe it in the '90s and I don't believe it now.

However, you raise an interesting point. There are plenty of regexes I
use at work that are fixed, so they could indeed just use generated
code to save the construction cost at runtime. The only reason I
haven't done this up until now is that it makes maintenance easier to
just have a string you can tweak directly in the source. The holy
grail (I believe) is sufficiently powerful meta-programming to do the
whole thing at compile time, although you could of course do it today
with custom build steps to generate code via re2c or whatever other
tool (that kind of thing tends to scare developers where I work).

When I want lookup speed I use my own (runtime) lexer generator with a
single rule of interest, and then a single skip rule for anything
else. This works very well and I'm sure re2c could easily be used the
same way.

I'm currently playing with searching using LALR(1) grammars and have
written a search tool gram_grep which I will be extending soon. I
would like to put together an LR(*) algorithm to make defining rules
easier for people, but I don't know when I'll have the time to really
get into that.

I believe there are many more possibilities for text processing than
the traditional tools everyone normally turns to and I applaud your
efforts in bringing TDFA to re2c!

Regards,

Ben

Ulya Trofimovich

unread,

Sep 9, 2017, 3:26:16 PM9/9/17

to

> However, you raise an interesting point. There are plenty of regexes I
> use at work that are fixed, so they could indeed just use generated
> code to save the construction cost at runtime. The only reason I
> haven't done this up until now is that it makes maintenance easier to
> just have a string you can tweak directly in the source. The holy
> grail (I believe) is sufficiently powerful meta-programming to do the
> whole thing at compile time, although you could of course do it today
> with custom build steps to generate code via re2c or whatever other
> tool (that kind of thing tends to scare developers where I work).

I think in future C++ will add compiled regular expressions as a
language extension. It seems logical in a compiled language, and it
would definitely simplify the toolchain.

> I'm currently playing with searching using LALR(1) grammars and have
> written a search tool gram_grep which I will be extending soon. I
> would like to put together an LR(*) algorithm to make defining rules
> easier for people, but I don't know when I'll have the time to really
> get into that.

At some point I also wrote a LALR(1) parser generator which main feature
is generating direct-code parsers (as opposed to table-based parsers
generated e.g. by bison). In my case (parsing JavaScript) it resulted in
about 2x speedup over bison's parser (minus one dereference on each
iteration of the parser loop). Maybe you would also like to experiment
with this approach (a.k.a. 'recursive ascent').

Ulya
[If you want C++ extended with RE's, that's essentially what flex is. -John]