Leftmost longest match with DFA search

Stefan Monnier

unread,

May 10, 2008, 4:36:57 AM5/10/08

to

Can someone point me to articles that discuss various ways to get the
leftmost longest match when implementing regexp search using a DFA?

The "obvious" solution of turning the problem "search for RE" into the
problem "match .*RE" (where I use "match" here to mean "anchored
search") only gives you the leftmost shortest match.

I've already seen http://compilers.iecc.com/comparch/article/07-10-026
where Tcl's approach to the problem is described, which is sadly O(N^2).

I'm also interested in similar approaches using non-backtracking NFAs
(I'm familiar for example with the algorithm used in Plan9's regexp
library).

Stefan

Daniel Villeneuve

unread,

May 11, 2008, 11:42:22 AM5/11/08

to

Stefan Monnier wrote:
> Can someone point me to articles that discuss various ways to get the
> leftmost longest match when implementing regexp search using a DFA?
>
> The "obvious" solution of turning the problem "search for RE" into the
> problem "match .*RE" (where I use "match" here to mean "anchored
> search") only gives you the leftmost shortest match.

[snip]
> Stefan

I've used the approach to compile a DFA for the reverse RE, say ER, and
first match .*ER on the reverse text to find the leftmost anchor point.
Then match RE from that point to find the longest span.
--
Daniel Villeneuve
Kronos

Russ Cox

unread,

May 12, 2008, 8:09:04 AM5/12/08

to comp...@iecc.com

Stefan Monnier wrote:
> Can someone point me to articles that discuss various ways to get the
> leftmost longest match when implementing regexp search using a DFA?
>

> The "obvious" solution ...

Actually, the "obvious" solution is to implement unanchored search
using repeated anchored searches. Running an anchored DFA search and
remembering the last position that produces a matching state will give
you the longest match at that point in the string. You can create an
unanchored search by trying an anchored search at position 0, then
position 1, and so on. If you stop the DFA when it gets to a dead
state, then if you are lucky most of the searches will process very
little of the string, and the whole thing will run in something like
linear time in many cases. In the worst case, it is quadratic in the
length of the text just to find one match.

This method--implementing unanchored search via repeated anchored
search--is what Bell Labs awk, GNU libc (and gawk), and PCRE all do when
using DFAs. (In fact the backtracking implementations all do this too,
and the DFAs picked up the bad habit from them.)

> ... of turning the problem "search for RE" into the

> problem "match .*RE" (where I use "match" here to mean "anchored
> search") only gives you the leftmost shortest match.

Turning RE into .*RE gives you an unanchored search, but the particular
match it finds depends on other implementation details, specifically
whether you stop and return the position of the first matching state or
continue on and return the last matching state (as above).

If you are searching for RE and you stop at the first matching state,
you do get the shortest match, but what you've really found is the
earliest rightmost endpoint among all (anchored) matches. Since this
is an anchored search, that's the shortest match.

If you are searching for .*RE and you stop at the first matching state,
you get the earliest rightmost endpoint among all (unanchored) matches.
You don't know where the (RE part of the) match begins, and it is not
guaranteed to be a leftmost match. For example, if you are searching
for a*bb|b in abb, you will stop at the position between the two b's,
marking the right endpoint of the "b" match, but the leftmost match is
the entire string "abb".

What you actually find this way is the right endpoint of the leftmost
shortest non-overlapping match. To find the left endpoint, you can,
adapting Daniel Villeneuve's suggestion, match the reverse expression
ER starting at that right endpoint and stop at the earliest match.
You can repeat this operation to find a set containing all the shortest
non-overlapping matches.

Tradtionally, when you ask for all matches for a RE in a text, you get
a set of leftmost-longest non-overlapping matches. Clarke and Cormack
make a case for using the shortest non-overlapping matches instead in
their paper ``On the use of Regular Expressions for Searching Text''
(TOPLAS 1995). http://www.cs.uwaterloo.ca/research/tr/1995/07/regexp.pdf

Russ

Stefan Monnier

unread,

May 13, 2008, 4:51:39 AM5/13/08

to

Interesting. But doesn't it basically force you to scan the complete text?
That can be impractical.

Stefan

Stefan Monnier

unread,

May 13, 2008, 5:02:56 AM5/13/08

to

>> Can someone point me to articles that discuss various ways to get the
>> leftmost longest match when implementing regexp search using a DFA?
>> The "obvious" solution ...

> Actually, the "obvious" solution is to implement unanchored search
> using repeated anchored searches. Running an anchored DFA search and

Yes, that's the method used in Tcl, as mentioned in your article that
I referred to. But it's O(N^2), which is disappointing. The methods
used in TRE or Plan9's regexp lib are O(N), tho using an NFA.

> Tradtionally, when you ask for all matches for a RE in a text, you get
> a set of leftmost-longest non-overlapping matches.

Right. I'm not looking to find all the matches, just the first one.
I'd even be content with the first leftmost match (not necessarily
longest).

In any case, thank you both for proposing to match backwards. It seems
to open up new ways to attack the problem.

In any case, I do know some good-enough solutions, but I'm really
looking for literature about the problem and discussions comparing the
various solutions.

Stefan

Danny Dubé

unread,

May 13, 2008, 1:58:56 PM5/13/08

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:

I'm not sure whether there is confusion between "leftmost longest
match" and "longest leftmost match". For me, "leftmost longest match"
refers to the leftmost of the longest matches while "longest leftmost
match" refers to the longest of the leftmost matches, provided we
define what it means to be the "leftmost".

The "leftmost longest match" is defined this way:

1. Let { (p1, q1), ..., (pK, qK) } be the matches for RE in text W,
where pi is the beginning position (inclusive) and qi is the
ending position (exclusive) of the ith match. This set might be
empty.

2. If the set is non-empty, keep only the longest matches,
i.e. those with maximal qi - pi.

3. Finally, keep the leftmost one, i.e. the one with minimal pi.

The "longest leftmost match" means one of two things depending on the
point that you use to indicate the "position" of the match. The
position can be given by the beginning of the match or by its end.
The "longest begin-leftmost match" is defined this way:

1. Let { (p1, q1), ..., (pK, qK) } be the matches for RE in text W.

2. If the set is non-empty, keep only the begin-leftmost matches,
i.e. those with minimal pi.

3. Finally, keep the longest one.

and the "longest end-leftmost match" is defined this way:

1. Let { (p1, q1), ..., (pK, qK) } be the matches for RE in text W.

2. If the set is non-empty, keep only the end-leftmost matches,
i.e. those with minimal qi.

3. Finally, keep the longest one.

Can you be more precise about what you want? I think it's possible to
get a linear-time algorithm for any of these searches (i.e. in O(|W|)
time for a fixed RE).

Danny

Russ Cox

unread,

May 14, 2008, 4:30:00 PM5/14/08

to

> I'm not sure whether there is confusion between "leftmost longest
> match" and "longest leftmost match". For me, "leftmost longest match"
> refers to the leftmost of the longest matches while "longest leftmost
> match" refers to the longest of the leftmost matches

In the context of regular expression search, "leftmost longest"
means the longest of the leftmost matches, not the leftmost
of the longest matches. As you point out, it doesn't stand up
to syntactic scrutiny, but it's the standard term.

Rarely, one sees it written with a comma--leftmost, longest--to
encourage interpreting "longest" as the tiebreaker.

Russ

Danny Dubé

unread,

May 15, 2008, 11:58:45 AM5/15/08

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:

[snip]

> >> The "obvious" solution of turning the problem "search for RE" into the
> >> problem "match .*RE" (where I use "match" here to mean "anchored
> >> search") only gives you the leftmost shortest match.
> > [snip]
> >> Stefan
>
> > I've used the approach to compile a DFA for the reverse RE, say ER, and
> > first match .*ER on the reverse text to find the leftmost anchor point.
> > Then match RE from that point to find the longest span.
>
> Interesting. But doesn't it basically force you to scan the complete text?
> That can be impractical.

You don't (always) need to scan the complete text. You can perform
that reversed search on increasingly longer prefixes of the text. If
you increase the lengths of the prefixes by a constant factor, you
work for a time in O(K) to find a leftmost match when the latter
completely lies in the first K characters.

Danny