Go is two times slower than python code?

Alex Dong

oläst,

9 mars 2010 09:52:162010-03-09

till golang-nuts

I'm writing a simple 'word picker' using golang and python. It
basically reads a tweeter message from a csv file, tokenize it and put
the word->id into a map called lexicon. It turned out that same logic
took 22 seconds for python 2.6 to finish whereas 57 seconds for
golang!

Wondering are there anything I've done wrong?

Here is the python code:
$ cat loader.py
lexicon = {}
pnct_ptn = re.compile(r'([\.,\\/\'"\?!=_\)\(\]\[\{\}:;]+|http://[^ ]
+)')
def tokenize(s):
s = pnct_ptn.sub(' ', s)
return [t for t in s.split() if len(t)>=3]

for line in open("result.csv").readlines():
parts = line.split(',', 3)
if len(parts) != 4: continue
msg = parts[3]
s = msg.decode('utf8','ignore').lower()
for word in tokenize(s):
if not lexicon.has_key(word):
unique_words += 1
lexicon[word] = unique_words

Here is the go code:
$ cat loader.go
package main

import (
"bufio"
"os"
"regexp"
"strings"
)

var (
pr, _ = regexp.Compile(`(http://[^ ]+|['".\\,=()*:;?!/]|-)`) //
pattern for removal
)

func tokenize(s string) []string {
ms := pr.ReplaceAllString(strings.ToLower(s), " ")
return strings.Split(ms, " ", 0)
}

func main() {
lex := make(map[string] int) // lexicon
dic := make(map[int] string) // lookup
tw := 0 // total
words
ps := false // present

r, _ := os.Open("result.csv", os.O_RDONLY, 0444)
defer r.Close()
in := bufio.NewReader(r)

for i := 0; i >= 0; i++ {
line, err := in.ReadString('\n')
if err != nil {
break
}

parts := strings.Split(line, ",", 4)
if len(parts) != 4 {
continue
}

ts := tokenize(parts[3])
for d := 0; d < len(ts); d++ {
w := ts[d]
if len(w) < 3 {
continue
}
_, ps = lex[w]
if ps == false {
lex[w] = tw
dic[tw] = w
tw ++
}
}
}
}

Cheers,
Alex

Jessta

oläst,

9 mars 2010 10:45:572010-03-09

till Alex Dong, golang-nuts

On 10 March 2010 01:52, Alex Dong <alex...@gmail.com> wrote:
> I'm writing a simple 'word picker' using golang and python. It
> basically reads a tweeter message from a csv file, tokenize it and put
> the word->id into a map called lexicon. It turned out that same logic
> took 22 seconds for python 2.6 to finish whereas 57 seconds for
> golang!

This has come up a few times and explanation is that
Go's garbage collector is currently slow and the regex library is
currently slow.

Go isn't stable enough to spend lots of time on optimisations because
things will likely change. Premature optimisation etc etc.
The optimisations can be done later.

- jessta
=====================
http://jessta.id.au

Rob 'Commander' Pike

oläst,

9 mars 2010 11:06:212010-03-09

till Alex Dong, golang-nuts

It's almost entirely due to the regular expression code, which is the (usually) blisteringly fast, written-in-C PCRE for Python but a place-holder, (usually) pretty slow Go-native one in Go, I'm betting at least an order of magnitude slower on average.

I hope to see a much more efficient regular expression package for Go before long. Wrapping PCRE is another option but one I shy away from because that "(usually)" hides a scary story. See http://swtch.com/~rsc/regexp/regexp1.html for some of it. The Go code is an inefficient implementation of the computationally more dependable algorithm. An efficient implementation is a lot of work but at least the inefficient implementation won't ever blow up.

-rob

Kevin Ballard

oläst,

9 mars 2010 15:06:352010-03-09

till Rob 'Commander' Pike, Alex Dong, golang-nuts

How does Oniguruma compare to PCRE? Does it have the same performance
issue as mentioned in your link? I'm not finding any relevant
information after a quick google search.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kbal...@gmail.com

Rob 'Commander' Pike

oläst,

9 mars 2010 15:29:412010-03-09

till Kevin Ballard, Alex Dong, golang-nuts

If it implements backreferencing, it cannot be efficient because that
is not a regular language. The existence of that usually enough to
tell.

-rob

Kevin Ballard

oläst,

9 mars 2010 15:35:102010-03-09

till Rob 'Commander' Pike, Alex Dong, golang-nuts

I thought your link was arguing that backreferencing is necessary to
be a regular expression, but patterns that do not use backreferencing
can be implemented using Thompson NFA instead of the exponential
algorithm (e.g. all regular expressions need to have an exponential
implementation in order to support backreferencing, but can specialize
for non-backreferencing patterns). However, I am unable to find out if
Oniguruma does this. At this time I'm going to assume it doesn't, but
it would still be nice to know.

-Kevin Ballard

Russ Cox

oläst,

9 mars 2010 15:45:592010-03-09

till Kevin Ballard, Rob 'Commander' Pike, golang-nuts

Patterns with backreferences are not regular expressions,
contrary to popular terminology. All regular expressions
(in the actual definition of the word) describe patterns
that can be matched in a single linear-time pass over the
input text. This is what package regexp implements.

Russ

Kevin Ballard

oläst,

9 mars 2010 15:49:002010-03-09

till r...@golang.org, Rob 'Commander' Pike, golang-nuts

Ah, you're talking about CS regular expressions, not "practical"
regular expressions. Are there any plans on implementing
backreferences in Go's Regexp package (e.g. using two implementations,
one for patterns w/backreferences and one for patterns w/o
backreferences)?

-Kevin Ballard

--

Rob 'Commander' Pike

oläst,

9 mars 2010 20:01:452010-03-09

till Kevin Ballard, r...@golang.org, golang-nuts

On Mar 9, 2010, at 12:49 PM, Kevin Ballard wrote:

> Ah, you're talking about CS regular expressions, not "practical"
> regular expressions. Are there any plans on implementing
> backreferences in Go's Regexp package (e.g. using two implementations,
> one for patterns w/backreferences and one for patterns w/o
> backreferences)?

No. We hope to have a fast matcher for true regular expressions,
though.

-rob

Giuseppe Altea

oläst,

10 mars 2010 13:46:552010-03-10

till Rob 'Commander' Pike, Kevin Ballard, r...@golang.org, golang-nuts

ok. Write the same code like this Python vs Go Regular Expressions.

http://github.com/jteeuwen/go-pkg-pcre

in go with the same library already PCRE.. very fast...

ciao Giuseppe

Vivek Khurana

oläst,

16 mars 2010 08:28:242010-03-16

till golang-nuts

On Mar 10, 6:01 am, "Rob 'Commander' Pike" <r...@google.com> wrote:
>
> No. We hope to have a fast matcher for true regular expressions,
> though.

So should we expect a Thompson NFA based regex matcher for GO ?

regards
Vivek

Russ Cox

oläst,

16 mars 2010 12:23:122010-03-16

till Vivek Khurana, golang-nuts

> So should we expect a Thompson NFA based regex matcher for GO ?

That's what package regexp already is.
The submatch tracking slows things a bit
but guarantees not to take exponential time.

Eventually we hope to have something more
like RE2 (http://code.google.com/p/re2/),
which picks off many important common cases
and handles them with the same asymptotic
efficiency but smaller constant factors.

Russ

Thomas Modeneis

oläst,

30 nov. 2016 04:59:022016-11-30

till golang-nuts

+6 years later, It seems that this issue still a problem.

There is a open ticket on github for it: https://github.com/golang/go/issues/16791

I've been facing this issue for a while, and on my experience the best workaround is to serialize the CSV to the disk after reading it.

Cheers.

Rich

oläst,

1 dec. 2016 06:45:062016-12-01

till golang-nuts

Have you tried searching https://godoc.org for a PCRE that may suite better? https://godoc.org/github.com/glenn-brown/golang-pkg-pcre/src/pkg/pcre

Nate Finch

oläst,

1 dec. 2016 09:11:532016-12-01

till golang-nuts

Generally, the answer is, don't use regular expressions. This is good advice in any language and for almost any problem. In this specific case, if you care about speed, write some more specific code that does what you need instead of falling back on regular expressions.

Svara alla

Svara författaren

Vidarebefordra