regexp package needs tuning?

85 views
Skip to first unread message

Gary Capell

unread,
Nov 15, 2009, 9:44:43 PM11/15/09
to golang-nuts
I was wanting to try Go out on the widefinder benchmark, but ran into
what looks like a performance issue with the regexp library?

I took a step back and wrote a little grep-alike, the guts of which is
this:

func grep(fp io.Reader, name string, matcher *regexp.Regexp) {
r := bufio.NewReader(fp);
lineNum := 0;
for {
line, err := r.ReadString('\n');
lineNum++;
if err != nil {
if err != os.EOF {
log.Stderr(err);
}
break;
}
if matcher.MatchString(line) {
fmt.Printf("%s:%d %s", name, lineNum, line);
}
}
}

This seems to run a _lot_ slower than /bin/grep, or a roughly
equivalent program in python.

I'm searching for a simple string in a 100k line text file. /bin/grep
takes .03s, a rough python script takes .2s, and my go program takes
12.3s. If I strip out the MatchString (i.e. just read all the lines
in the file), my go program comes down to .7s. I'm only using 8g (not
gccgo) but still, am I doing something crazy, or does the regexp
library need some attention, or are my expectations too high?

Jeff Hill

unread,
Nov 15, 2009, 11:30:12 PM11/15/09
to golang-nuts
Hi, Gary.

I've done a little testing on my system (MacBook Pro, 2.53 GHz Core 2
Duo). 0.026s to read 189K using io.ReadFile, 0.029s using
bufio.ReadString('\n'), and 0.176s when doing a simple literal search
using the regexp library on each line. My test data is the project
Gutenberg text of Hamlet.

I'm curious what regular expression you're passing in to get those
numbers? It seems like it could be the content of the expression
itself is what triggers the performance problems. Another theory; I'm
using 6g so perhaps there's a big performance difference between 6g
and 8g? Apparently the 6-series is the most polished of the three
platforms.

Here's my test code, with the timing info I got back from running
time ./fstest for each test on my system: http://gist.github.com/235718

/Jeff

smosher

unread,
Nov 16, 2009, 12:00:36 AM11/16/09
to golang-nuts
If you look at the benchmarks on http://shootout.alioth.debian.org/,
it makes it obvious that regex in go is rather immature.

Rob 'Commander' Pike

unread,
Nov 16, 2009, 1:13:41 AM11/16/09
to Gary Capell, golang-nuts
The regular expression code is just an early library I wrote to have an interesting problem to play with. It's very simple code (although from a complexity point of view it's good; see http://swtch.com/~rsc/regexp/regexp1.html; it's just that constant is big) and has absolutely nothing to make it fast. It doesn't even recognize the simple case of a regexp starting with a literal string. Plus it generates garbage like crazy.

Those other programs are using PCRE, which is usually incredibly fast but can go exponential sometimes (see link above). The Go regexp library is slow, but never slower. We have plans to do something about this, but until then don't use it as an indication of Go's native performance. (For that matter, since pretty much everything except Go is using PCRE, it's not really a fair test of anything.)

Go's I/O libraries need work too. All the libraries need work.

-rob

Gary Capell

unread,
Nov 16, 2009, 1:32:43 AM11/16/09
to golang-nuts


On Nov 16, 5:13 pm, "Rob 'Commander' Pike" <r...@google.com> wrote:
> Go's I/O libraries need work too.  All the libraries need work.

I thought it might be something like that, but wanted to check that I
wasn't doing something crazy.

In answer to Jeff's question, the r.e. I was searching on was a simple
literal, e.g. 'Jeff'.

I imagine someone will port PCRE at some point (and that the standard
regexp library will get faster).

Ben Tilly

unread,
Nov 16, 2009, 3:58:21 AM11/16/09
to Rob 'Commander' Pike, Gary Capell, golang-nuts
On Sun, Nov 15, 2009 at 10:13 PM, Rob 'Commander' Pike <r...@google.com> wrote:
> The regular expression code is just an early library I wrote to have
> an interesting problem to play with.  It's very simple code (although
> from a complexity point of view it's good; see
> http://swtch.com/~rsc/regexp/regexp1.html; it's just that constant is
> big)  and has absolutely nothing to make it fast.  It doesn't even
> recognize the simple case of a regexp starting with a literal string.
> Plus it generates garbage like crazy.

I read that article before and liked it a lot. Doubly so since I
figured it out on my own (though I didn't implement it). And then had
thought I had figured out something original when Jeffry Friedl said
it hadn't been done and he thought it wouldn't work. Which confirms
the comment about him in the article...

It does, however, gloss over one important point. In pathological
cases the NFA simulation can wind up needing to handle states that are
an arbitrary set of states in the original NFA in an arbitrary order.
Therefore if you cache the NFA to build a DFA, the resulting DFA can
be super-exponential in the size of the original regular expression.

However in the real world this pathological case happens far less
often than the pathological performance case, and when it does happen
it tends to be less of an issue (because the strings we match tend to
be longer than the regular expressions we match against).

And a random note backing up the paper conclusion. Before Perl
implemented the memoization strategy to resolve the most common cases
of the performance problem, there were periodic bug reports where
someone's program broke because of it. A typical example was the
person who validated that an input file looked something like
/^($foo\s*)*$/ (with $foo replaced by a particular data pattern with a
timestamp).

> Those other programs are using PCRE, which is usually incredibly fast
> but can go exponential sometimes (see link above).  The Go regexp
> library is slow, but never slower.  We have plans to do something
> about this, but until then don't use it as an indication of Go's
> native performance.  (For that matter, since pretty much everything
> except Go is using PCRE, it's not really a fair test of anything.)

Many things use PCRE, but emacs, vim, Perl and Java are pretty
important exceptions to that rule.

Cheers,
Ben

Russ Cox

unread,
Nov 16, 2009, 1:09:34 PM11/16/09
to Ben Tilly, golang-nuts
> It does, however, gloss over one important point.  In pathological
> cases the NFA simulation can wind up needing to handle states that are
> an arbitrary set of states in the original NFA in an arbitrary order.
> Therefore if you cache the NFA to build a DFA, the resulting DFA can
> be super-exponential in the size of the original regular expression.

Yes but you never need to build the entire DFA.
It's true that some implementations do, but
that doesn't imply they have to.

Russ

Ben Tilly

unread,
Nov 16, 2009, 2:19:26 PM11/16/09
to r...@golang.org, golang-nuts
Doesn't avoiding building the entire DFA change the run time performance?

Let n be the length of the string and m ithe length of the regular
expression. My understanding is that compiling the whole DFA is
usually O(m), but can be O(m!) in pathological cases. The run-time
performance is O(n). Memory usage is proportional to compilation
time.

By contrast not building the DFA up front and working it out at run
time in a straightforward fashion means a O(m) compilation step, and a
O(n) (with worse constant) average run-time performance with potential
worst case run-time of O(n*m). Memory usage would be O(m).

Thinking about it, one could combine strategies by expanding out the
DFA lazily at run-time with a memoization step. This should be O(m)
compilation and O(n) run-time but memory usage would go up to
O(min(m!, n)).

Am I missing anything important?

Ben

Russ Cox

unread,
Nov 16, 2009, 4:47:51 PM11/16/09
to Ben Tilly, golang-nuts
> Let n be the length of the string and m ithe length of the regular
> expression.  My understanding is that compiling the whole DFA is
> usually O(m), but can be O(m!) in pathological cases.  The run-time
> performance is O(n).  Memory usage is proportional to compilation
> time.

Ok.

> By contrast not building the DFA up front and working it out at run
> time in a straightforward fashion means a O(m) compilation step, and a
> O(n) (with worse constant) average run-time performance with potential
> worst case run-time of O(n*m).  Memory usage would be O(m).

Ok.

> Thinking about it, one could combine strategies by expanding out the
> DFA lazily at run-time with a memoization step.  This should be O(m)
> compilation and O(n) run-time but memory usage would go up to
> O(min(m!, n)).

You still have O(m*n) (but more commonly only O(n)) runtime.
The space is at least O(m) to hold the regexp but then only
whatever you choose as your cache size. The search only
needs one DFA state at a time (two during a transition) so the
lower bound on the cache size is very small. There's certainly
no need to remember every state during the run, which is
what the n in O(min(m!, n)) seems to imply.

Russ

Ben Tilly

unread,
Nov 16, 2009, 5:11:49 PM11/16/09
to r...@golang.org, golang-nuts
On Mon, Nov 16, 2009 at 1:47 PM, Russ Cox <r...@golang.org> wrote:
[...]
>> Thinking about it, one could combine strategies by expanding out the
>> DFA lazily at run-time with a memoization step.  This should be O(m)
>> compilation and O(n) run-time but memory usage would go up to
>> O(min(m!, n)).
>
> You still have O(m*n) (but more commonly only O(n)) runtime.

Oops, you're obviously right. For a fixed regular expression and
arbitrarily long strings, runtime performance is O(n). But when your
regular expression and strings are both long, runtime performance is
O(n*m) because you don't necessarily benefit from on the fly
memoization of ordered sets of possible states you could be into into
single DFA states.

> The space is at least O(m) to hold the regexp but then only
> whatever you choose as your cache size.  The search only
> needs one DFA state at a time (two during a transition) so the
> lower bound on the cache size is very small.  There's certainly
> no need to remember every state during the run, which is
> what the n in O(min(m!, n)) seems to imply.

The idea of having a fixed cache size is a twist I hadn't considered.

Here is how I got O(min(m!, n)). I was thinking in terms of patching
in new DFA states that map back to an ordered set of NFA states you
could be matching at this point in this string. In the long run you
could theoretically wind up expanding the whole DFA out incrementally,
which means you're bounded above by a set of O(m!) states. But you
will only discover the DFA states you actually encountered while
trying to match the string, which gives you an upper limit of O(n)
states that you expand out. Satisfying both limits gives O(min(m!,
n)).

Limiting the cache size seems like it would lead to different
behaviors depending on how it is done. For instance one approach is
to only expand out some DFA states, then work out the DFA at run time
using states that are a combination of a common DFA combination and
several NFA states. This would lessen how often you head off towards
the worst case O(n*m) performance issues. An alternate approach would
be to expand out some DFA states, then when you got to big to switch
to backtracking, on the theory that you'll usually match one of the
top 5 states. This gives you exponential case worst performance, but
removes many common cases where it could happen.

The latter approach may seem to defeat the whole goal of a good
algorithm. But that kind of "fall back on backtracking" approach can
be used to match most regular expressions with guaranteed performance,
while still letting you match backreferences.

Cheers,
Ben

Russ Cox

unread,
Nov 16, 2009, 5:19:24 PM11/16/09
to Ben Tilly, golang-nuts
> The latter approach may seem to defeat the whole goal of a good
> algorithm.  But that kind of "fall back on backtracking" approach can
> be used to match most regular expressions with guaranteed performance,
> while still letting you match backreferences.

Sorry, but "most" and "guaranteed" means "not guaranteed".
It would be far better to disallow backreferences.

Russ

Ben Tilly

unread,
Nov 16, 2009, 6:23:06 PM11/16/09
to r...@golang.org, golang-nuts
Extending the same reasoning you can conclude that it is better to not
allow people to use Turing complete languages. And indeed in some
cases this is appropriate. (SQL was my favorite example for a long
time, but now SQL is Turing complete.) But in general that is
ridiculous. And so virtually every language designer allows people to
do some potentially bad things, because they can do useful things
using those same features.

I take that attitude back to regular expressions. My preference is
that if a construct could be efficient, then it should be efficient.
But useful features should not be removed just because they are
potentially inefficient.

This is, however, a matter of opinion.

Cheers,
Ben
Reply all
Reply to author
Forward
0 new messages