Dear RE2 developers,
I am considering to adapt RE2 as the standard regex engine for
SciTECO [1]. PCRE's missing guarantees on runtime and stack use
have long been nagging in the back of my mind. IMHO for an interactive
text editor it shouldn't be possible to crash the program if the
user merely enters some pathological regex (or TECO pattern in my case).
With PCRE it all depends on the concrete configuration, but I did
manage to crash it (causing a stack overflow) with some very simple
expression even.
Recently, I became aware of another problem: backwards searches.
I am currently implementing them very naively: By producing all
matches of the pattern over the *entire* document up until the cursor
and reporting the last one (or n-th last one).
While this ensures consistent semantics with forward searches,
it can easily be too slow on large documents (e.g. a 100mb log file).
I am considering several algorithms to improve on this,
but so far I am not really satisfied:
1) Doing an anchored forward search one character at a time
and progressing backwards until a match is produced.
But not only does this thrash certain optimizations in
comparison to forward streaming searches, it can easily
result in unexpected results. Consider backwards-searching
`[a-z]+`. With such naive algorithm you will match the last
letter (length 1). It cannot guarantee finding the (leftmost)
longest match or any longest match.
Even if I backtracked to extend the match - it's easy
to construct patterns and subject strings where this would fail.
The resulting semantics would be hard to describe and document.
2) Forward searching in blocks/windows of some size before
the search cursor. We examing the previous block when
failing to produce a match.
This would require special handling for lots of corner cases
and still cannot guarantee finding the longest matches.
3) Analyzing the pattern. If it starts with a plain letter,
we find candidate starts and forward-match from these candidates.
If I am not mistaken gnulib's and Emacs' engine has "fastmaps" that work
similarily and are filled outomatically by regexp compilation.
Would help a lot - but for patterns, where this is not possible,
you must still have a fallback.
Also, producing the longest match would still be problematic.
Line-wise processing of the document (disallow matching across
EOLs) unfortunately isn't possible in my case - this would simplify
a lot of course. At least in practice with reasonably-sized lines.
I believe that with an engine written for forward-matching it
will be impossible to guarantee longest matches without scanning
the entire subject string.
What about RE2 - are there any plans to implement native backwards
matching? (Yes I know, this also wouldn't be entirely symmetrical
to forward matches, but that would seem to be an acceptable
and documentable compromise.)
Are there any internal hints that could be used to find match
candidates when searching backwards?
Best regards,
Robin Haberkorn
[1]:
https://sciteco.fmsbw.de