golang's regexp and re2

3,134 views
Skip to first unread message

guot...@gmail.com

unread,
Sep 18, 2013, 5:49:59 AM9/18/13
to golan...@googlegroups.com
Hi,

I have 2 question about regexp:

first, 
I have read source code of re2,  it has thousands line, and go's regexp code just 1,000 line, has go's regexp implement all re2's function?

second,
regexp's written by Go, and re2 is written by C++, so regexp is much slower than re2, can be optimized as fast as re2,  or less, 90% of re2

David Symonds

unread,
Sep 18, 2013, 6:22:32 AM9/18/13
to guot...@gmail.com, golang-nuts
On 18 September 2013 19:49, <guot...@gmail.com> wrote:

> first,
> I have read source code of re2, it has thousands line, and go's regexp code
> just 1,000 line, has go's regexp implement all re2's function?

No. There's lots of things that RE2 does that Go's regexp package does not do.

> second,
> regexp's written by Go, and re2 is written by C++, so regexp is much slower
> than re2, can be optimized as fast as re2, or less, 90% of re2

Most of RE2's speed actually comes from its selection of different
matching modes based on the pattern. That side of things isn't done in
Go's regexp package, but could be. I suspect Go's regexp could get
reasonably close to RE2's speed.

Java Pro

unread,
Sep 18, 2013, 3:21:02 PM9/18/13
to golan...@googlegroups.com, guot...@gmail.com
Has anyone run https://code.google.com/p/sre2/ (Go's implementation of RE2)

1. to benchmark against Go's regexp
2. to benchmark in the regex-dna shootout 
3. to benchmark against C++ RE2

and compare the results?

Regards,
JPro

Jan Mercl

unread,
Sep 18, 2013, 3:26:03 PM9/18/13
to Java Pro, golang-nuts, guot...@gmail.com
On Wed, Sep 18, 2013 at 9:21 PM, Java Pro <jserv...@gmail.com> wrote:
> Has anyone run https://code.google.com/p/sre2/ (Go's implementation of RE2)
>
> 1. to benchmark against Go's regexp
> 2. to benchmark in the regex-dna shootout
> 3. to benchmark against C++ RE2
>
> and compare the results?

Umm, can you please instead state what's the specific/particular/real
problem you're perhaps triyng to solve?

-j

Java Pro

unread,
Sep 18, 2013, 3:52:02 PM9/18/13
to golan...@googlegroups.com, Java Pro, guot...@gmail.com
I have not run into a particular bottleneck due to regexp yet. But knowing Go's regex could be near metal speed gives one *confidence* in applying it to whatever regex jobs down the road. Just like the new Tesla is adequate on daily usage, but getting into a race, one would have to double check if it is the best option. 

Java Pro

unread,
Sep 18, 2013, 4:04:02 PM9/18/13
to golan...@googlegroups.com, Java Pro, guot...@gmail.com
The shootout http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=regexdna&sort=elapsed

is showing that C++ using RE2 (ranked No. 2) is 14 times faster than Go's regex performance. I just hope Go can do better since my understanding is that both Go's regexp and RE2 are using FSA (not backtracking). So why the performance gap is so big. 

Jan Mercl

unread,
Sep 18, 2013, 4:10:13 PM9/18/13
to Java Pro, golang-nuts, guot...@gmail.com
On Wed, Sep 18, 2013 at 10:04 PM, Java Pro <jserv...@gmail.com> wrote:
> The shootout
> http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=regexdna&sort=elapsed
>
> is showing that C++ using RE2 (ranked No. 2) is 14 times faster than Go's
> regex performance. I just hope Go can do better since my understanding is
> that both Go's regexp and RE2 are using FSA (not backtracking). So why the
> performance gap is so big.

Correct. But the shootout is using PCRE for some, if not most other
implementations: http://swtch.com/~rsc/regexp/regexp1.html

IOW, being slower in some easy cases is a tradeoff for being endlessly
faster in the problematic cases.

And once again, what are trying, in particular, to solve?

-j

Java Pro

unread,
Sep 18, 2013, 4:23:16 PM9/18/13
to golan...@googlegroups.com, Java Pro, guot...@gmail.com
As suggested by Ross Cox, PCRE is already out of the game. We can ignore the participants using PCRE in the shootout. 

I am comparing (non backtracking) implementations only. e.g. RE2 and Go's regexp. Just think near 14 times difference is too big.

http://swtch.com/~rsc/regexp/regexp3.html (we basically want what RE2 offers, and wondering why Go can not do as good. I think 3 times slower is acceptable)

Again, I have not run into an issue yet. It's purely a confidence and curiosity thing.  

Thanks,
JPro

Kamil Kisiel

unread,
Sep 18, 2013, 4:33:21 PM9/18/13
to golan...@googlegroups.com, Java Pro, guot...@gmail.com
I think David Symonds already answered the question as to why that is.

Java Pro

unread,
Sep 18, 2013, 4:41:30 PM9/18/13
to golan...@googlegroups.com, Java Pro, guot...@gmail.com
Kamil,

Thanks for your response. I came across David Symonds response and then thought about the "feature complete" implementation of RE2 in native GO which is SRE2 https://code.google.com/p/sre2/ and wondered how it could do in the following benchmarks. 

1. to benchmark against Go's regexp
2. to benchmark in the regex-dna shootout 
3. to benchmark against C++ RE2

I am thinking the author of SRE2 or someone using it may have already done such benchmarks and can share the results with us.

Best regards,
JPro

Kamil Kisiel

unread,
Sep 18, 2013, 5:00:35 PM9/18/13
to golan...@googlegroups.com, Java Pro, guot...@gmail.com
You could probably submit code to the benchmark shootout that's a modification of the existing Go implementation and uses SRE2. That would answer these questions.

Java Pro

unread,
Sep 18, 2013, 5:03:04 PM9/18/13
to Kamil Kisiel, golan...@googlegroups.com, guot...@gmail.com
Thanks for the advice. I will do it.

Java Pro

unread,
Sep 20, 2013, 12:06:09 AM9/20/13
to Kamil Kisiel, sam.thorogood, golan...@googlegroups.com, guot...@gmail.com
Just a follow up. SRE2 was written before Go 1 was released. I contacted Sam Thorogood, the author of SRE2, yesterday. He was very responsive and said he would spend a fun weekend to upgrade the code to Go 1.x sometime. 

In the meanwhile, I read again the paper at http://swtch.com/~rsc/regexp/regexp[1-3].html and the source of Go's "regexp". My understanding is that "regexp" implements Thompson's algorithm to match all machine threads in lock steps "char by char" without backtracking. After reading the code, it seems to me that not much can be done to the code according to what the paper says. I am wondering if someone already has ideas to improve "regexp"'s performance. Is adding a scheduler for the threads a good idea? What simple things could be done to make it perform closer to C++ RE2?

I downloaded Go's benchmark code from http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=go&id=1 ran it and profiled it using pprof. I saw a lot of time was spent on runtime.memmove and runtime.copy. (slice resizing?) I am thinking that we might cut 30~40% of the execution time by removing these memory movements. Please see the pprof info and let me know if these movements were intended.  


(pprof) top10
Total: 8604 samples
    3062  35.6%  35.6%     5178  60.2% regexp.(*machine).add
    2474  28.8%  64.3%     2474  28.8% runtime.memmove
    1537  17.9%  82.2%     2195  25.5% regexp.(*machine).step
     438   5.1%  87.3%     7469  86.8% regexp.(*machine).match
     311   3.6%  90.9%      311   3.6% regexp/syntax.EmptyOpContext
     241   2.8%  93.7%     1970  22.9% runtime.copy
     173   2.0%  95.7%      173   2.0% regexp.(*machine).alloc
     122   1.4%  97.1%      122   1.4% regexp.(*inputBytes).step
     118   1.4%  98.5%      118   1.4% regexp/syntax.(*Inst).MatchRune
      55   0.6%  99.2%       63   0.7% MHeap_AllocLocked


go/src/pkg/runtime/memmove_amd64.s

2474   2474  28.8% runtime.memmove


go/src/pkg/runtime/slice.c
     .      .        236: }
     .      .        237: 
    35     35    238: if(ret == 1 && width == 1) { // common case worth about 2x to do here
     .      .        239: *to.array = *fm.array; // known to be a byte pointer
     .      .        240: } else {
    59   1788  241: runtime·memmove(to.array, fm.array, ret*width);
     .      .        242: }
     .      .        243: 

Sam Thorogood

unread,
Oct 7, 2013, 6:00:56 AM10/7/13
to golan...@googlegroups.com, Kamil Kisiel, sam.thorogood, guot...@gmail.com
I've updated SRE2 to support Go 1.2. The changes were mostly mechanical, the code hasn't practically changed in the last 2 years :)

When I don't care about submatches, it's about equivalent to the current regexp implementation. When I do - it's a lot slower. Output of my test script below.


RUNS=100000 RE=.*(a|(b))+(#*).+ STR=aba#hello
CMD=go run *.go -runs=100000 -re=.*(a|(b))+(#*).+ -s=aba#hello

==sre2 simple (fast, no submatches and uses bitset for states)
result true alt []
real 1.05
user 0.98
sys 0.09

==sre2 submatch (slow, abuse of gc for states)
result true alt [0 9 2 3 -1 -1 3 4]
real 1.87
user 1.78
sys 0.16

==go regexp (probably very fast)
result true alt [0 9 2 3 -1 -1 3 4]
real 1.06
user 0.95
sys 0.09
Reply all
Reply to author
Forward
0 new messages