regexp.FindReaderIndex() on a bufio.Reader{} extremely slow

268 views
Skip to first unread message

Lukas Lueg

unread,
Jan 9, 2015, 5:16:41 AM1/9/15
to golan...@googlegroups.com
Hi,

I have a (possibly) large io.Reader and need to find some regexp-matching regions in it. The most simply approach is to use a bufio.Reader{} in order to get a io.RuneReader onto the io.Reader and send that through regexp.FindReaderIndex(). I've noticed however that this approach is *extremely* slow. I'm now buffering parts of the original io.Reader myself, slice up some strings from it, send them through regexp.FindStringIndex() and patch everything up myself in order to care about overlapping regions. Ugly, however the result is the same, achieved around ten times faster than regexp.FindReaderIndex() despite all the GC-overhead cause by all the sub-strings sliced and forgotten.

Any insight into why a simple regexp.FindReaderIndex(bufio.NewReader(io.Reader)) is so slow?

Best regards

Ian Lance Taylor

unread,
Jan 9, 2015, 9:36:18 AM1/9/15
to Lukas Lueg, golang-nuts
It's going to depend on the regexp. Some operations on strings are
naturally faster because, for example, the regexp code can do an fast
scan, partially written in optimized assembly, for a fixed prefix.
That is not available with an io.Reader, where the regexp code has to
proceed byte by byte.

Ian

Brad Fitzpatrick

unread,
Jan 9, 2015, 1:26:05 PM1/9/15
to Lukas Lueg, golang-nuts
On Fri, Jan 9, 2015 at 2:16 AM, Lukas Lueg <lukas...@gmail.com> wrote:
Hi,

I have a (possibly) large io.Reader and need to find some regexp-matching regions in it. The most simply approach is to use a bufio.Reader{} in order to get a io.RuneReader onto the io.Reader and send that through regexp.FindReaderIndex(). I've noticed however that this approach is *extremely* slow. I'm now buffering parts of the original io.Reader myself, slice up some strings from it, send them through regexp.FindStringIndex()

Why strings? Use []byte and you shouldn't get the garbage you mention.
 
and patch everything up myself in order to care about overlapping regions. Ugly, however the result is the same, achieved around ten times faster than regexp.FindReaderIndex() despite all the GC-overhead cause by all the sub-strings sliced and forgotten.

Any insight into why a simple regexp.FindReaderIndex(bufio.NewReader(io.Reader)) is so slow?

Best regards

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages