234 views

Skip to first unread message

Oct 1, 2007, 3:15:42 PM10/1/07

to

Michela Becchi, a friend and PhD student I am working with on regex

related matters and who is publishing a paper on Hybrid NFA-DFAs, got

an interesting review comment about the above mentioned library--that

it in fact implements hybrid NFA-DFAs. (I have copied Henry Spencer

via email, so that he can respond if not too busy. However, I'd also

like other insights if they are available.)

related matters and who is publishing a paper on Hybrid NFA-DFAs, got

an interesting review comment about the above mentioned library--that

it in fact implements hybrid NFA-DFAs. (I have copied Henry Spencer

via email, so that he can respond if not too busy. However, I'd also

like other insights if they are available.)

In any case, Michela's work has been in devising ways of splitting a

regex problem into sub-problems where some of the problem is solved by

a DFA and some of the problem is solved by an NFA. Her work results

in a very clever partitioning of the problem such that the resulting

machine runs at roughly DFA speeds while taking roughly NFA space. (I

believe the paper on that aspect has already been published, this is

follow-on work solving an interesting related problem.)

The question I'm curious to is whether the above library does the same

thing, i.e. partition a regex problem into a DFA portion and an NFA

portion and whether it does this statically or dynamically. Her work

does it statically, based upon the characteristics of the problem.

However, I believe there are regex libraries that incrementally

generate a DFA from an NFA dynamically as the program is running. (I

believe this is the essence of Thompson's NFA algorithm, which is

distinct from his subset construction algorithm.) That method would

have similar effects in general, but the approach and rationale is

different. For example, the dynamic version only generated the

portion of the DFA it uses. However, if it doesn't take advantage of

the static insights that Michela's approach does, it could potentially

generate an exponentially large machine. Moreover, the two approaches

are complimentary, one could generate a Hybrid machine with the

interesting characteristics that she outlined on-the-fly.

In any case, if there are hybrid NFA-DFA solutions out there (or in

progress), we would like to know what they are and how they work (and

what problems they solve).

Thanks for any input,

-Chris

*****************************************************************************

Chris Clark Internet : com...@world.std.com

Compiler Resources, Inc. Web Site : http://world.std.com/~compres

23 Bailey Rd voice : (508) 435-5016

Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

Oct 2, 2007, 12:12:18 PM10/2/07

to

There are (at least) three different issues you've brought up --

the use of the term Hybrid NFA/DFA, whether Tcl does what

Michela does, and on-the-fly DFA constructions.

the use of the term Hybrid NFA/DFA, whether Tcl does what

Michela does, and on-the-fly DFA constructions.

* Hybrid NFA/DFA

The first problem is your use of the term "hybrid NFA/DFA", which

unfortunately has been defined in Friedl's _Mastering Regular

Expressions_ to mean something different from what you mean.

Friedl doesn't ever define what a DFA or NFA is. He only says

that NFAs provide submatch information and backreferences, and

DFAs don't but run faster. The closest he comes to admitting that

these aren't really definitions is in the sidebar on p. 180:

As a programmer, if you have a true (mathematically speaking)

NFA regex engine, it is a relatively small task to add support

for backreferences. A DFA's engine's design precludes the

adding of this support, but an NFA's common implementation

makes it trivial. In doing so, you create a more powerful

tool, but you also make it decidedly nonregular (mathematically

speaking). What does this mean? At most, that you should

probably stop calling it an NFA, and start using the phrase

``nonregular expressions,'' since that describes (mathematically

speaking) the new situation. No one has actually done this,

so the name ``NFA'' has lingered, even though the implementation

is no longer (mathematically speaking) an NFA.

What does all this mean to you, as a user? Absolutely

nothing. As a user, you don't care if it's regular, nonregular,

unregular, irregular, or incontinent. So long as you know

what you can expect from it (something this chapter will

show you), you know all you need to care about.

http://regex.info/blog/2006-09-15/248

This "quacks like a duck" approach to calling things NFAs or

DFAs cannot accomodate implementations that run a DFA pass first

and then an NFA pass only to find submatches, or that manage to

use nothing but DFAs to find submatches. So Friedl calls them

"hybrid NFA/DFAs", because they kind of quack like both.

Table 4-1 (p. 145), which classifies programs by "engine type",

lists GNU awk, GNU grep/egrep, and Tcl as "Hybrid NFA/DFA".

The sidebar on p. 182 expands on this classification:

I've said several times that a DFA can't provide capturing

parentheses or backreferences. This is quite true, but it

certainly doesn't preclude hybrid approaches that mix

technologies in an attempt to reach regex nirvana. The

sidebar on page 180 told how NFAs have diverged from the

theoretical straight and narrow in search of more power, and

it's only natural that the same happens with DFAs. A DFA's

construction makes it more difficult, but that doesn't mean

impossible.

GNU grep takes a simple but effective approach. It uses a

DFA when possible, reverting to an NFA when backreferences

are used. GNU awk does something similar--it uses GNU grep's

fast shortest-leftmost DFA engine for simple "does it match"

checks, and reverts to a different engine for checks where

the actual extent of the match must be known. Since that

other engine is an NFA, GNU awk can conveniently offer

capturing parentheses, and it does via its special gensub

function.

Tcl's regex engine is a true hybrid, custom built by Henry

Spencer (whom you may remember having played an important

part in the early development and popularization of regular

expression [p. 88]). The Tcl engine someitmes appears to

be an NFA--it has lookaround, capturing parentheses,

back-references, and lazy quantifiers. Yet, it has true

POSIX longest-leftmost match ([p. 177]), and doesn't suffer

from some of the NFA problems that we'll see in Chapter 6.

It really seems quite wonderful.

Thus the use of "hybrid NFA-DFA" to describe Michela's work

unfortunately makes it sound, to those who know these terms only

through Friedl's book, like it's no different than GNU grep or Tcl.

Your one-sentence description makes it sound, to me, like it

*is* different from both the awk/grep hybrid and from Tcl; if

it's not too late, you might try to change the name to something

other than "hybrid".

* What does Tcl's regex do?

Let's assume a regexp with capturing parentheses but not

backreferences. (That is, /the (cat|dog) ran/ is okay, but

/(cat|dog) eat \1/ is not.) Let's also assume POSIX-style longest

match semantics are the goal.

Tcl's regex first runs an unanchored DFA search to find p, the

ending position of the shortest match (meaning the match with

leftmost end). If there is no match, stop, returning failure.

If the caller doesn't care about any match position info, stop,

returning success. Otherwise, it's time to figure out the exact

position of the leftmost longest match.

The leftmost longest match must start at or before p. Tcl's

regex loops over every character position from the beginning of

the string to p, trying an anchored DFA search to find the length

of the longest match starting at the current position. When it

finds a match, it stops the loop: that's the leftmost longest

match.

If the caller only cares about the position of the overall match

($0) but not any of the capturing parentheses ($1, $2, ...),

then stop, returning success. Otherwise, it's time to figure

out the positions of the capturing parentheses.

The algorithm up to this point is:

match(re, text, match, nmatch)

{

etext = text + strlen(text);

p = shortest(/.*re/, text, etext);

if(p == NULL)

return false;

if(nmatch == 0)

return true;

for(q = text; q <= p; q++){

eq = longest(/re/, q, etext);

if(eq != NULL){

match[0].start = q;

match[0].end = eq;

break;

}

}

if(nmatch == 1)

return true;

dissect(re, match[0].start, match[0].end, match);

}

The function dissect determines the position of every subexpression

in re, saving the ones corresponding to capturing parentheses.

It does this by recursing over the structure of the regexp and

searching using DFAs corresponding to the subpieces.

To dissect a concatenation, Tcl figures out the longest possible

match for the first half and checks whether the remainder matches

the second half. If so, done. If not, find the next longest

possible match for the first half, and so on:

dissect(re, text, etext, match)

{

if(re is concat: left right){

for(middle = etext; middle >= text; middle = p-1){

p = longest(/left/, text, middle);

if(p == NULL)

break;

if(longest(/right/, p, etext) == etext){

dissect(left, text, p, match);

dissect(right, p, etext, match);

}

}

panic("cannot dissect concat");

}

To dissect an alternation, Tcl tries each choice looking for the

one that uses all the text:

if(re is alt: left|right){

if(longest(/left/, text, etext) == etext)

dissect(left, text, etext, match);

else

dissect(right, text, etext, match);

}

If dissect comes across a capturing parenthesis, then it can

record the match.

if(re is capture n: (subexpr)){

match[n].start = text;

match[n].end = etext;

dissect(subexpr, text, etext, match);

}

}

(The regexp is actually a directed graph that might have cycles,

not a tree; regexps like a* are rewritten into (aa*|).)

Note that there is no backtracking here: each step determines

the subdissection and then moves on. There is no "and if that

doesn't work, ...".

An obvious optimization would be to make dissect do nothing if

the re being considered contained no capturing parens. The Tcl

code I'm looking at appears not to do that.

The run-time of this algorithm is a little tricky to analyze.

On a regexp of length m on a text of length n, and ignoring the

time to construct DFAs/NFAs from regexps...

If no submatch info is required it takes O(n) time.

If only match[0] is required this it O(n^2) time.

The top-level concatenation dissection requires O(n^2) time. In

the worst case you cut off just one character, making the next

levels O(1^2) and O((n-1)^2). Repeating that gives O(n^3) time.

Alternations might force you to do this multiple times, but no

more than m, which would be O(m * n^3) time.

I'm not sure how things like repetition or empty strings affect

the run-time. The fact that a* is rewritten into an effectively

infinite-length regular expression can probably bump the m up

to n, so that you could force O(n^4) behavior if you really tried.

Anyway, at least the run time is polynomial!

* On-the-fly DFA constructions

Thompson's 1968 CACM paper is hard to classify, since it doesn't

mention the terms NFA or DFA. If you're willing to read between

the lines then the compiled code is really an encoding of an NFA

that runs a subset simulation as it reads the input. It's kind

of building the DFA on the fly, if you view each successive CLIST

as a DFA state, but it's not caching any of the DFA states that

it builds, so it's more like an NFA subset simulation. Since

it doesn't cache at all, it's slower, but it can't possibly build

an exponentially-large machine.

Spencer's Tcl regex builds the DFA on the fly from the NFA

representation, using a cache to speed the DFA simulation. The

cache has a fixed maximum size, though, so it can't end up

building an exponentially-large machine either.

Russ

Oct 2, 2007, 3:02:46 PM10/2/07

to

Thank you for your very informative reply.

I don't like the term Hybrid NFA/DFA either. However, it is her and

her professor's choice of terms and they have several papers that use

the term, so I think they will probably continue using it. For what

they are doing, I would use a different term, "mostly deterministic

finite-state automata", because their machine is mostly like a DFA,

but with a few NFA portions (and I mean true NFA not nonregular/regexp

extensions).

For what Friedl calls "nonregular expressions", I am of the "regexp"

naming camp. I use regexp when I mean something that looks like a

regular expression but doesn't describe a regular language. I reserve

regular expression when I mean something that describes a regular

language. I hadn't thought about the issue of what to call an

automaton that can process a regexp. It isn't a true NFA, since it

has the auxillary data structure that handles the captures and

backreferences.

They do mean something quite different from Friedl means. They are

squarely within the standard theoretical framework (e.g. no capturing

subexpressions or backreferences). Their goal is to combat the state

explosion that normally occurs in NFA->DFA translation. However, they

only handle things which "could" be translated into a DFA (if it

weren't for the state explosion problem). For the problems they are

attacking the regexp extension problem isn't an issue because the

prior approaches were all restricted to true regular expressions (or

even simpler set of strings problems).

They probably use the hybrid term because the machine executes a DFA

until certain conditions are met and then when they are met, the

machine can fork and send an NFA thread off to investigate the pattern

which if compiled to a DFA would have caused a state explosion, but

continue running the DFA for the rest of the "main" pattern. Thus,

her machine is a hybrid of a DFA and an NFA. In fact, when you look

at the results of her compilation process, there are sections of her

machine that are DFAs and sections that are NFAs, with "border" states

in between them. The border states are where the forks occur.

Thank you for the detailed description of what the regexp package in

Tcl does also. It is very informative. In particular, the

description of how boundaries are set by using an NFA was relevant.

One of the problems Michela and I worked on was finding the boundaries

of where a match was found. (While the problem set of interest does

not admit subpattern capture, the start and end of the actual match

are relevant for the downstream uses.) That turns out to be a more

difficult problem for DFAs than NFAs, because any final state in a DFA

may represent a set of different patterns with different start

positions. She worked out several interesting solutions to that

problem--several of which I'm certain she will publish, so I won't

give spoilers for.

Thanks again,

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu