Re: [go-nuts] Understanding regex performance between Python and Go

987 views
Skip to first unread message

Ian Lance Taylor

unread,
Sep 28, 2012, 10:38:54 AM9/28/12
to Ben Hill, golan...@googlegroups.com
On Fri, Sep 28, 2012 at 7:15 AM, Ben Hill <benh...@gmail.com> wrote:
>
> I noticed that for a particular regex, Python is at least five to six times
> faster, sub 5 ms vs 30 ms, than Go for each line of the file. This
> really adds up, ending with a total runtime of Python 3secs vs Go 24 secs.
>
> Python regex:
> '\d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d,\d{3}\s\w{4,5}\s.+?\s-\s(.*)'
>
> Go regex:
> "\\d{4}-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d:\\d\\d,\\d{3}\\s\\w{4,5}\\s.+?\\s-\\s(.*)"

The Go regexp package uses a version of RE2, which avoids taking
exponential time in regexp matching at the cost of making some regexp
matching slower.

http://code.google.com/p/re2/

There have been various discussions about this on the mailing list.

By the way, I recommend using raw strings when writing regexpts in Go,
so you don't have all the doubled backslashes.

Ian

gta

unread,
Sep 28, 2012, 10:39:14 AM9/28/12
to golan...@googlegroups.com
Hi,
have you tried Tip?

A possibly relevant discussion here.

Ciao.

Giacomo


On Friday, 28 September 2012 16:15:01 UTC+2, Ben Hill wrote:
A couple of years ago I wrote a Python script to parse Java thread dumps. I have been converting that script into Go.


I noticed that for a particular regex, Python is at least five to six times faster, sub 5 ms vs 30 ms, than Go for each line of the file. This
really adds up, ending with a total runtime of Python 3secs vs Go 24 secs.

Python regex:
'\d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d,\d{3}\s\w{4,5}\s.+?\s-\s(.*)'

Go regex:
"\\d{4}-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d:\\d\\d,\\d{3}\\s\\w{4,5}\\s.+?\\s-\\s(.*)"

Sample data:
2012-06-10 12:54:27,162 INFO CORBA.Daemon.App.out - java.lang.Thread.State: WAITING (on object monitor)

What I want is the text to the right of that dash, java.lang.Thread.State: WAITING (on object monitor).

Both are pre-compiled, so that's not the issue.

I'm using this call on the compiled pattern.
logPattern.FindStringSubmatch(line)

I know I could use strings.Split, I have started using split with a simpler regex match check, because of the performance difference. That's not the point of this post,
I want to understand why, so I can either fix my regex or avoid a pitfall I don't understand yet.

Any suggestions? 

Reinhard Wobst

unread,
Sep 28, 2012, 3:11:04 PM9/28/12
to golan...@googlegroups.com
Python regex:
'\d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d,\d{3}\s\w{4,5}\s.+?\s-\s(.*)'

Though we are speaking about regexp performance, let me add a remark I tell all people in Python courses (Python or Go, this does not matter here): I think that people use regular expressions too often, and your example fits well here. Regexp's are "write-only" - hard to understand if somebody else wrote it (except it is multi-line and commented :-), and often they are a performance trap. In your case I would do it like secret services filter messages ;-), e.g.

- test on ")" as last character; if not, discard;
- then look for the last "(" and if a "-" is placed before; if not, discard;
- and so on.

The principle is: Do small and fast tests, and each test selects only a small part of the messages (if they are rare). It will be more code, but readable by everybody (especially if you comment it :-) and much, much faster!

Some 12 years ago, my home-grown bookholding became too slow since it was written in ksh and used a regexp (in egrep) with 7 variable parts. Then I switched to Python and split() etc., and the performance raised so strong that I could not measure the time.

You have many ways to establish such a series of tests, the result will be encouraging I think. More "pythonic" or "go-istic", anyway!

Reinhard

DisposaBoy

unread,
Sep 28, 2012, 3:30:36 PM9/28/12
to golan...@googlegroups.com


On Friday, September 28, 2012 3:15:01 PM UTC+1, Ben Hill wrote:
A couple of years ago I wrote a Python script to parse Java thread dumps. I have been converting that script into Go.

I noticed that for a particular regex, Python is at least five to six times faster, sub 5 ms vs 30 ms, than Go for each line of the file. This
really adds up, ending with a total runtime of Python 3secs vs Go 24 secs.

Python regex:
'\d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d,\d{3}\s\w{4,5}\s.+?\s-\s(.*)'

Go regex:

"\\d{4}-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d:\\d\\d,\\d{3}\\s\\w{4,5}\\s.+?\\s-\\s(.*)"

Sample data:
2012-06-10 12:54:27,162 INFO CORBA.Daemon.App.out - java.lang.Thread.State: WAITING (on object monitor)

What I want is the text to the right of that dash, java.lang.Thread.State: WAITING (on object monitor).

Both are pre-compiled, so that's not the issue.

I'm using this call on the compiled pattern.
logPattern.FindStringSubmatch(line)

I know I could use strings.Split, I have started using split with a simpler regex match check, because of the performance difference. That's not the point of this post,
I want to understand why, so I can either fix my regex or avoid a pitfall I don't understand yet.

Any suggestions?



Is it possible to share the code and data and how you're doing these measurements? If I'm to assume that your data always looks like your sample, then 1ms doesn't sound reasonable, not even in Python so 5ms is suggesting there might be something else at play here.


 

Ben Hill

unread,
Sep 30, 2012, 8:30:32 PM9/30/12
to golan...@googlegroups.com, Ben Hill
Thanks for link and the mention of raw strings. I figured that feature was somewhere in the language.

Ben Hill

unread,
Sep 30, 2012, 8:38:05 PM9/30/12
to golan...@googlegroups.com
I hadn't reviewed the code a couple of years. You're right that some of the regex was overkill, though not all the lines start that pattern, so if the match fails, that's a relevant piece of information.

Due to comments on this thread I have started using Split and fixed length checks where it was appropriate. The program is now running in the same time of the Python script.

I'll post the code when I'm done, it's less than three hundred lines. Then I can point out where the performance hits were occurring.

Patrick Mylund Nielsen

unread,
Sep 30, 2012, 8:58:21 PM9/30/12
to Ian Lance Taylor, Ben Hill, golan...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages