Dear Russ & All,
I'm a postgrad student studying theoretical computer science. I was
going through Cox's article on vm approach to regular expression
matching [1] and noted this particular issue of infinite loops in
ordinary back-tracking algorithms (as in the case of "(a*)*b"). Russ
mentions a couple of approaches to work-around this problem at the end
of the article:
<quote>
Backtracking loops. At the beginning of the article, we noted that a
naive backtracking algorithm could go into an infinite loop searching
for (a*)*, because of the repeated empty match. A simple way to avoid
these loops during backtracking is to introduce a progress
instruction, which says that each time the machine executes that
instruction it must have moved forward since the last time it was
there.
Alternate program structures for backtracking. Another way to avoid
loops is to change the instruction set to introduce instructions for
the repetitions. The implementation of those instructions can call the
subpieces as subroutines, which avoids the infinite loop problem and
makes it more efficient to implement features like counted repetition
and assertions. Even so, these constructs make the non-recursive
implementation much harder (more state to keep around like the "old"
pointer) and preclude the automata techniques behind Pike's VM
implementation. It's also easy for the implementation to get out of
control though: look at Perl or PCRE or really any other “full-
featured” regexp implementation.
</quote>
I naively went through RE2 code base (with my noobish C++ skills)
thinking I might be able to find out one of these tactics implemented,
only to realize RE2 uses DFA and NFA simulation for regular expression
matching (as it should be).
Now, I'm still looking for a real-world back-tracking implementation
that employs those techniques to avoid infinite loops. This is to
understand how these work-arounds really work. I'm going to look into
Java regular expressions implementation hoping I might find a similar
work-around. If you have any pointers, please kindly let me know.
Thanks a bunch.
- Asiri
[1]
http://swtch.com/~rsc/regexp/regexp2.html