Henry Spencer's "Tcl" Regex Library

221 views
Skip to first unread message

Chris F Clark

unread,
Oct 1, 2007, 3:15:42 PM10/1/07
to
Michela Becchi, a friend and PhD student I am working with on regex
related matters and who is publishing a paper on Hybrid NFA-DFAs, got
an interesting review comment about the above mentioned library--that
it in fact implements hybrid NFA-DFAs. (I have copied Henry Spencer
via email, so that he can respond if not too busy. However, I'd also
like other insights if they are available.)

In any case, Michela's work has been in devising ways of splitting a
regex problem into sub-problems where some of the problem is solved by
a DFA and some of the problem is solved by an NFA. Her work results
in a very clever partitioning of the problem such that the resulting
machine runs at roughly DFA speeds while taking roughly NFA space. (I
believe the paper on that aspect has already been published, this is
follow-on work solving an interesting related problem.)

The question I'm curious to is whether the above library does the same
thing, i.e. partition a regex problem into a DFA portion and an NFA
portion and whether it does this statically or dynamically. Her work
does it statically, based upon the characteristics of the problem.

However, I believe there are regex libraries that incrementally
generate a DFA from an NFA dynamically as the program is running. (I
believe this is the essence of Thompson's NFA algorithm, which is
distinct from his subset construction algorithm.) That method would
have similar effects in general, but the approach and rationale is
different. For example, the dynamic version only generated the
portion of the DFA it uses. However, if it doesn't take advantage of
the static insights that Michela's approach does, it could potentially
generate an exponentially large machine. Moreover, the two approaches
are complimentary, one could generate a Hybrid machine with the
interesting characteristics that she outlined on-the-fly.

In any case, if there are hybrid NFA-DFA solutions out there (or in
progress), we would like to know what they are and how they work (and
what problems they solve).

Thanks for any input,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

Russ Cox

unread,
Oct 2, 2007, 12:12:18 PM10/2/07
to
There are (at least) three different issues you've brought up --
the use of the term Hybrid NFA/DFA, whether Tcl does what
Michela does, and on-the-fly DFA constructions.


* Hybrid NFA/DFA

The first problem is your use of the term "hybrid NFA/DFA", which
unfortunately has been defined in Friedl's _Mastering Regular
Expressions_ to mean something different from what you mean.

Friedl doesn't ever define what a DFA or NFA is. He only says
that NFAs provide submatch information and backreferences, and
DFAs don't but run faster. The closest he comes to admitting that
these aren't really definitions is in the sidebar on p. 180:

As a programmer, if you have a true (mathematically speaking)
NFA regex engine, it is a relatively small task to add support
for backreferences. A DFA's engine's design precludes the
adding of this support, but an NFA's common implementation
makes it trivial. In doing so, you create a more powerful
tool, but you also make it decidedly nonregular (mathematically
speaking). What does this mean? At most, that you should
probably stop calling it an NFA, and start using the phrase
``nonregular expressions,'' since that describes (mathematically
speaking) the new situation. No one has actually done this,
so the name ``NFA'' has lingered, even though the implementation
is no longer (mathematically speaking) an NFA.

What does all this mean to you, as a user? Absolutely
nothing. As a user, you don't care if it's regular, nonregular,
unregular, irregular, or incontinent. So long as you know
what you can expect from it (something this chapter will
show you), you know all you need to care about.

http://regex.info/blog/2006-09-15/248

This "quacks like a duck" approach to calling things NFAs or
DFAs cannot accomodate implementations that run a DFA pass first
and then an NFA pass only to find submatches, or that manage to
use nothing but DFAs to find submatches. So Friedl calls them
"hybrid NFA/DFAs", because they kind of quack like both.
Table 4-1 (p. 145), which classifies programs by "engine type",
lists GNU awk, GNU grep/egrep, and Tcl as "Hybrid NFA/DFA".

The sidebar on p. 182 expands on this classification:

I've said several times that a DFA can't provide capturing
parentheses or backreferences. This is quite true, but it
certainly doesn't preclude hybrid approaches that mix
technologies in an attempt to reach regex nirvana. The
sidebar on page 180 told how NFAs have diverged from the
theoretical straight and narrow in search of more power, and
it's only natural that the same happens with DFAs. A DFA's
construction makes it more difficult, but that doesn't mean
impossible.

GNU grep takes a simple but effective approach. It uses a
DFA when possible, reverting to an NFA when backreferences
are used. GNU awk does something similar--it uses GNU grep's
fast shortest-leftmost DFA engine for simple "does it match"
checks, and reverts to a different engine for checks where
the actual extent of the match must be known. Since that
other engine is an NFA, GNU awk can conveniently offer
capturing parentheses, and it does via its special gensub
function.

Tcl's regex engine is a true hybrid, custom built by Henry
Spencer (whom you may remember having played an important
part in the early development and popularization of regular
expression [p. 88]). The Tcl engine someitmes appears to
be an NFA--it has lookaround, capturing parentheses,
back-references, and lazy quantifiers. Yet, it has true
POSIX longest-leftmost match ([p. 177]), and doesn't suffer
from some of the NFA problems that we'll see in Chapter 6.
It really seems quite wonderful.

Thus the use of "hybrid NFA-DFA" to describe Michela's work
unfortunately makes it sound, to those who know these terms only
through Friedl's book, like it's no different than GNU grep or Tcl.

Your one-sentence description makes it sound, to me, like it
*is* different from both the awk/grep hybrid and from Tcl; if
it's not too late, you might try to change the name to something
other than "hybrid".


* What does Tcl's regex do?

Let's assume a regexp with capturing parentheses but not
backreferences. (That is, /the (cat|dog) ran/ is okay, but
/(cat|dog) eat \1/ is not.) Let's also assume POSIX-style longest
match semantics are the goal.

Tcl's regex first runs an unanchored DFA search to find p, the
ending position of the shortest match (meaning the match with
leftmost end). If there is no match, stop, returning failure.

If the caller doesn't care about any match position info, stop,
returning success. Otherwise, it's time to figure out the exact
position of the leftmost longest match.

The leftmost longest match must start at or before p. Tcl's
regex loops over every character position from the beginning of
the string to p, trying an anchored DFA search to find the length
of the longest match starting at the current position. When it
finds a match, it stops the loop: that's the leftmost longest
match.

If the caller only cares about the position of the overall match
($0) but not any of the capturing parentheses ($1, $2, ...),
then stop, returning success. Otherwise, it's time to figure
out the positions of the capturing parentheses.

The algorithm up to this point is:

match(re, text, match, nmatch)
{
etext = text + strlen(text);
p = shortest(/.*re/, text, etext);
if(p == NULL)
return false;
if(nmatch == 0)
return true;
for(q = text; q <= p; q++){
eq = longest(/re/, q, etext);
if(eq != NULL){
match[0].start = q;
match[0].end = eq;
break;
}
}
if(nmatch == 1)
return true;

dissect(re, match[0].start, match[0].end, match);
}

The function dissect determines the position of every subexpression
in re, saving the ones corresponding to capturing parentheses.
It does this by recursing over the structure of the regexp and
searching using DFAs corresponding to the subpieces.

To dissect a concatenation, Tcl figures out the longest possible
match for the first half and checks whether the remainder matches
the second half. If so, done. If not, find the next longest
possible match for the first half, and so on:

dissect(re, text, etext, match)
{
if(re is concat: left right){
for(middle = etext; middle >= text; middle = p-1){
p = longest(/left/, text, middle);
if(p == NULL)
break;
if(longest(/right/, p, etext) == etext){
dissect(left, text, p, match);
dissect(right, p, etext, match);
}
}
panic("cannot dissect concat");
}

To dissect an alternation, Tcl tries each choice looking for the
one that uses all the text:

if(re is alt: left|right){
if(longest(/left/, text, etext) == etext)
dissect(left, text, etext, match);
else
dissect(right, text, etext, match);
}

If dissect comes across a capturing parenthesis, then it can
record the match.

if(re is capture n: (subexpr)){
match[n].start = text;
match[n].end = etext;
dissect(subexpr, text, etext, match);
}
}

(The regexp is actually a directed graph that might have cycles,
not a tree; regexps like a* are rewritten into (aa*|).)

Note that there is no backtracking here: each step determines
the subdissection and then moves on. There is no "and if that
doesn't work, ...".

An obvious optimization would be to make dissect do nothing if
the re being considered contained no capturing parens. The Tcl
code I'm looking at appears not to do that.

The run-time of this algorithm is a little tricky to analyze.
On a regexp of length m on a text of length n, and ignoring the
time to construct DFAs/NFAs from regexps...

If no submatch info is required it takes O(n) time.

If only match[0] is required this it O(n^2) time.

The top-level concatenation dissection requires O(n^2) time. In
the worst case you cut off just one character, making the next
levels O(1^2) and O((n-1)^2). Repeating that gives O(n^3) time.
Alternations might force you to do this multiple times, but no
more than m, which would be O(m * n^3) time.

I'm not sure how things like repetition or empty strings affect
the run-time. The fact that a* is rewritten into an effectively
infinite-length regular expression can probably bump the m up
to n, so that you could force O(n^4) behavior if you really tried.

Anyway, at least the run time is polynomial!


* On-the-fly DFA constructions

Thompson's 1968 CACM paper is hard to classify, since it doesn't
mention the terms NFA or DFA. If you're willing to read between
the lines then the compiled code is really an encoding of an NFA
that runs a subset simulation as it reads the input. It's kind
of building the DFA on the fly, if you view each successive CLIST
as a DFA state, but it's not caching any of the DFA states that
it builds, so it's more like an NFA subset simulation. Since
it doesn't cache at all, it's slower, but it can't possibly build
an exponentially-large machine.

Spencer's Tcl regex builds the DFA on the fly from the NFA
representation, using a cache to speed the DFA simulation. The
cache has a fixed maximum size, though, so it can't end up
building an exponentially-large machine either.


Russ

Chris F Clark

unread,
Oct 2, 2007, 3:02:46 PM10/2/07
to
Thank you for your very informative reply.

I don't like the term Hybrid NFA/DFA either. However, it is her and
her professor's choice of terms and they have several papers that use
the term, so I think they will probably continue using it. For what
they are doing, I would use a different term, "mostly deterministic
finite-state automata", because their machine is mostly like a DFA,
but with a few NFA portions (and I mean true NFA not nonregular/regexp
extensions).

For what Friedl calls "nonregular expressions", I am of the "regexp"
naming camp. I use regexp when I mean something that looks like a
regular expression but doesn't describe a regular language. I reserve
regular expression when I mean something that describes a regular
language. I hadn't thought about the issue of what to call an
automaton that can process a regexp. It isn't a true NFA, since it
has the auxillary data structure that handles the captures and
backreferences.

They do mean something quite different from Friedl means. They are
squarely within the standard theoretical framework (e.g. no capturing
subexpressions or backreferences). Their goal is to combat the state
explosion that normally occurs in NFA->DFA translation. However, they
only handle things which "could" be translated into a DFA (if it
weren't for the state explosion problem). For the problems they are
attacking the regexp extension problem isn't an issue because the
prior approaches were all restricted to true regular expressions (or
even simpler set of strings problems).

They probably use the hybrid term because the machine executes a DFA
until certain conditions are met and then when they are met, the
machine can fork and send an NFA thread off to investigate the pattern
which if compiled to a DFA would have caused a state explosion, but
continue running the DFA for the rest of the "main" pattern. Thus,
her machine is a hybrid of a DFA and an NFA. In fact, when you look
at the results of her compilation process, there are sections of her
machine that are DFAs and sections that are NFAs, with "border" states
in between them. The border states are where the forks occur.

Thank you for the detailed description of what the regexp package in
Tcl does also. It is very informative. In particular, the
description of how boundaries are set by using an NFA was relevant.
One of the problems Michela and I worked on was finding the boundaries
of where a match was found. (While the problem set of interest does
not admit subpattern capture, the start and end of the actual match
are relevant for the downstream uses.) That turns out to be a more
difficult problem for DFAs than NFAs, because any final state in a DFA
may represent a set of different patterns with different start
positions. She worked out several interesting solutions to that
problem--several of which I'm certain she will publish, so I won't
give spoilers for.

Thanks again,

Reply all
Reply to author
Forward
0 new messages