Fast vs Slow regexps

Daniel Ajoy

unread,

Jul 21, 2016, 12:26:08 AM7/21/16

to

I was reading through this:

https://swtch.com/%7Ersc/regexp/regexp1.html

and it mentions that:

======================

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

======================

I know that with GENSUB you can use backreferences.

Is it the case that GENSUB uses the "slower" regexps, and SUB and GSUB use "fast" regexps?

Kaz Kylheku

unread,

Jul 21, 2016, 1:16:24 AM7/21/16

to

Read that claim again; it says that the finite-automata approach beats
the backtracking implementations. Except for backreferences.

It doesn't say that a slow, backtracking regex has to be used if
backreferences are to be supported.

Section 4 of this paper <http://www.arl.wustl.edu/~pcrowley/a25-becchi.pdf>
discusses ways of retrofitting backreferencing support into NFA-based
regex processing. Another possibility is that the regex compiler can
choose the pure automaton strategy when the regex doesn't actually
use backreferencing, so you don't have to "pay for what you don't use".

But anyway, none of that matters, because I suspect you may be confusing
register references in the edit replacement sequence with
regex backreferences!

Backreference support means that \1, \2, ... may occur *in the regex
itself*. For instance (.*)X\1 means "match some stuff until an X,
followed by that same stuff".

When the replacement sequence is being processed, the regex matching is
done. Then the register references \1, \2, ... just refer to the matched
material; they are not involved in the pattern matching itself.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Janis Papanagnou

unread,

Jul 21, 2016, 6:18:44 AM7/21/16

to

As an supplement/clarification to Kaz's reply just two points; backreferences
are *not* part of the _formal language_ of Regular Expressions, thus they can
not be implemented by plain Deterministic Finite Automaton. And gawk's gensub
does *not* support backrefs. That's the reason for fast regexp processing and
not so fast processings of (non-regexp) "regexp-extensions" like backrefs.

Janis

Andrew Schorr

unread,

Jul 21, 2016, 9:02:38 AM7/21/16

to

On Thursday, July 21, 2016 at 12:26:08 AM UTC-4, Daniel Ajoy wrote:
> Is it the case that GENSUB uses the "slower" regexps, and SUB and GSUB use "fast" regexps?

No. They all use the same code, and it's fast. :-)

Aharon Robbins

unread,

Jul 21, 2016, 11:43:18 AM7/21/16

to

In article <b74def93-1549-4ec2...@googlegroups.com>,

Daniel Ajoy <da....@gmail.com> wrote:
>I know that with GENSUB you can use backreferences.

As Kaz mentioned, you've confused back references within the regexp
(which gawk does not allow) with references to subparts of the matched text.

>Is it the case that GENSUB uses the "slower" regexps, and SUB and GSUB
>use "fast" regexps?

The answer is involved. Gawk actually contains within it two regular
expression matching engines that (generally) produce the same results.

1. The 'dfa' engine from GNU grep, which is fast. It is used for any
"does it match?" kind of test, such as ~, !~, plain patterns and so on.

2. The 'regex' engine from GLIBC, which is used when it's necessary to
find out "where does it match?". This includes sub, gsub, and gensub,
as well as other places within gawk. This engine is generally slower.

However, since gawk does not allow backreferences in regexps, it is still
generally faster than perl etc.

Hope this helps,

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com

Daniel Ajoy

unread,

Jul 21, 2016, 12:36:08 PM7/21/16

to

Thanks for all the responses, and the clarifications.

Daniel