Regarding infinite loops in backtracking regular expression matchers

265 views
Skip to first unread message

Asiri Rathnayake

unread,
Mar 2, 2011, 6:05:10 AM3/2/11
to re2-dev
Dear Russ & All,

I'm a postgrad student studying theoretical computer science. I was
going through Cox's article on vm approach to regular expression
matching [1] and noted this particular issue of infinite loops in
ordinary back-tracking algorithms (as in the case of "(a*)*b"). Russ
mentions a couple of approaches to work-around this problem at the end
of the article:

<quote>
Backtracking loops. At the beginning of the article, we noted that a
naive backtracking algorithm could go into an infinite loop searching
for (a*)*, because of the repeated empty match. A simple way to avoid
these loops during backtracking is to introduce a progress
instruction, which says that each time the machine executes that
instruction it must have moved forward since the last time it was
there.

Alternate program structures for backtracking. Another way to avoid
loops is to change the instruction set to introduce instructions for
the repetitions. The implementation of those instructions can call the
subpieces as subroutines, which avoids the infinite loop problem and
makes it more efficient to implement features like counted repetition
and assertions. Even so, these constructs make the non-recursive
implementation much harder (more state to keep around like the "old"
pointer) and preclude the automata techniques behind Pike's VM
implementation. It's also easy for the implementation to get out of
control though: look at Perl or PCRE or really any other “full-
featured” regexp implementation.
</quote>

I naively went through RE2 code base (with my noobish C++ skills)
thinking I might be able to find out one of these tactics implemented,
only to realize RE2 uses DFA and NFA simulation for regular expression
matching (as it should be).

Now, I'm still looking for a real-world back-tracking implementation
that employs those techniques to avoid infinite loops. This is to
understand how these work-arounds really work. I'm going to look into
Java regular expressions implementation hoping I might find a similar
work-around. If you have any pointers, please kindly let me know.

Thanks a bunch.

- Asiri


[1] http://swtch.com/~rsc/regexp/regexp2.html

Asiri Rathnayake

unread,
Mar 2, 2011, 7:11:22 AM3/2/11
to re2-dev
Hi Again,

Minor correction: I thought the NFA algorithm used in RE2 performs
subset construction on the fly but it's actually thompson's lock-step
algorithm that is implemented, sorry about this, I wrote the above
email without going through all the code.

- Asiri

Russ Cox

unread,
Mar 2, 2011, 9:51:31 AM3/2/11
to Asiri Rathnayake, re2-dev

> Now, I'm still looking for a real-world back-tracking implementation
> that employs those techniques to avoid infinite loops. This is to
> understand how these work-arounds really work. I'm going to look into
> Java regular expressions implementation hoping I might find a similar
> work-around. If you have any pointers, please kindly let me know.

Try PCRE.

Russ

Asiri Rathnayake

unread,
Mar 8, 2011, 5:39:09 AM3/8/11
to Russ Cox, re2-dev
Hi Russ,


I checked PCRE source and it seems they are specifically checking for execution paths the could match the empty string and depending on that the code for repetition is altered. I think this is what you have described in having special instructions for implementing repetition. So, just to summarize I found a couple of ways to work around the infinite loops:

- Clean up at syntax level (Thompson 1968)
- Identify execution paths that could match the empty string and act accordingly (Thompson 1968)
- Introduce some sort of a progress instruction (Russ)(we analysed this and made sure it works)
- Lock-step with elimination of redundant threads (Thompson 1968)

I hope I have understood correctly :)

Thank for your quick reply.

- Asiri

Russ


Asiri Rathnayake

unread,
Mar 8, 2011, 5:41:30 AM3/8/11
to Russ Cox, re2-dev

Thank you for the quick reply* :)
 

- Asiri

Russ



Russ Cox

unread,
Mar 8, 2011, 10:33:31 AM3/8/11
to Asiri Rathnayake, re2-dev
> I checked PCRE source and it seems they are specifically checking for
> execution paths the could match the empty string and depending on that the
> code for repetition is altered. I think this is what you have described in
> having special instructions for implementing repetition. So, just to
> summarize I found a couple of ways to work around the infinite loops:
>
> - Clean up at syntax level (Thompson 1968)
> - Identify execution paths that could match the empty string and act
> accordingly (Thompson 1968)
> - Introduce some sort of a progress instruction (Russ)(we analysed this and
> made sure it works)
> - Lock-step with elimination of redundant threads (Thompson 1968)
>
> I hope I have understood correctly :)

Sounds good to me.

Russ

Reply all
Reply to author
Forward
0 new messages