Wondering are there anything I've done wrong?
Here is the python code:
$ cat loader.py
lexicon = {}
pnct_ptn = re.compile(r'([\.,\\/\'"\?!=_\)\(\]\[\{\}:;]+|http://[^ ]
+)')
def tokenize(s):
s = pnct_ptn.sub(' ', s)
return [t for t in s.split() if len(t)>=3]
for line in open("result.csv").readlines():
parts = line.split(',', 3)
if len(parts) != 4: continue
msg = parts[3]
s = msg.decode('utf8','ignore').lower()
for word in tokenize(s):
if not lexicon.has_key(word):
unique_words += 1
lexicon[word] = unique_words
Here is the go code:
$ cat loader.go
package main
import (
"bufio"
"os"
"regexp"
"strings"
)
var (
pr, _ = regexp.Compile(`(http://[^ ]+|['".\\,=()*:;?!/]|-)`) //
pattern for removal
)
func tokenize(s string) []string {
ms := pr.ReplaceAllString(strings.ToLower(s), " ")
return strings.Split(ms, " ", 0)
}
func main() {
lex := make(map[string] int) // lexicon
dic := make(map[int] string) // lookup
tw := 0 // total
words
ps := false // present
r, _ := os.Open("result.csv", os.O_RDONLY, 0444)
defer r.Close()
in := bufio.NewReader(r)
for i := 0; i >= 0; i++ {
line, err := in.ReadString('\n')
if err != nil {
break
}
parts := strings.Split(line, ",", 4)
if len(parts) != 4 {
continue
}
ts := tokenize(parts[3])
for d := 0; d < len(ts); d++ {
w := ts[d]
if len(w) < 3 {
continue
}
_, ps = lex[w]
if ps == false {
lex[w] = tw
dic[tw] = w
tw ++
}
}
}
}
Cheers,
Alex
This has come up a few times and explanation is that
Go's garbage collector is currently slow and the regex library is
currently slow.
Go isn't stable enough to spend lots of time on optimisations because
things will likely change. Premature optimisation etc etc.
The optimisations can be done later.
- jessta
=====================
http://jessta.id.au
I hope to see a much more efficient regular expression package for Go before long. Wrapping PCRE is another option but one I shy away from because that "(usually)" hides a scary story. See http://swtch.com/~rsc/regexp/regexp1.html for some of it. The Go code is an inefficient implementation of the computationally more dependable algorithm. An efficient implementation is a lot of work but at least the inefficient implementation won't ever blow up.
-rob
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kbal...@gmail.com
-rob
-Kevin Ballard
Russ
-Kevin Ballard
--
> Ah, you're talking about CS regular expressions, not "practical"
> regular expressions. Are there any plans on implementing
> backreferences in Go's Regexp package (e.g. using two implementations,
> one for patterns w/backreferences and one for patterns w/o
> backreferences)?
No. We hope to have a fast matcher for true regular expressions,
though.
-rob
On Mar 10, 6:01 am, "Rob 'Commander' Pike" <r...@google.com> wrote:
>
> No. We hope to have a fast matcher for true regular expressions,
> though.
So should we expect a Thompson NFA based regex matcher for GO ?
regards
Vivek
That's what package regexp already is.
The submatch tracking slows things a bit
but guarantees not to take exponential time.
Eventually we hope to have something more
like RE2 (http://code.google.com/p/re2/),
which picks off many important common cases
and handles them with the same asymptotic
efficiency but smaller constant factors.
Russ