How does regexp.FindReaderIndex work?

520 views
Skip to first unread message

Toni Cárdenas

unread,
Sep 12, 2012, 7:30:37 PM9/12/12
to golan...@googlegroups.com
I'm trying to parse a file through a regexp, iterating over it using regexp.FindReaderIndex, but I get unexpected behaviour.

file.txt:

"abc
abc
abc
"

Go code:

source := bufio.NewReader(sourceFile)
re := regexp.MustCompile(`abc\n`)
fmt.Println(re.FindReaderIndex(source))
fmt.Println(re.FindReaderIndex(source))
fmt.Println(re.FindReaderIndex(source))

Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0 4][1 5] and that's it. It seems that FindReaderIndex doesn't consume that last \n character from the source stream. Why is that?

Jesse McNelis

unread,
Sep 12, 2012, 7:58:29 PM9/12/12
to Toni Cárdenas, golan...@googlegroups.com
On Thu, Sep 13, 2012 at 9:30 AM, Toni Cárdenas <to...@tcardenas.me> wrote:
> source := bufio.NewReader(sourceFile)
> re := regexp.MustCompile(`abc\n`)
> fmt.Println(re.FindReaderIndex(source))
> fmt.Println(re.FindReaderIndex(source))
> fmt.Println(re.FindReaderIndex(source))
>
> Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0 4][1
> 5] and that's it. It seems that FindReaderIndex doesn't consume that last \n
> character from the source stream. Why is that?

that's two characters. a '\' and an 'n'.
You probably want,
re := regexp.MustCompile("abc\n")



--
=====================
http://jessta.id.au

Peter S

unread,
Sep 12, 2012, 8:54:38 PM9/12/12
to Toni Cárdenas, golan...@googlegroups.com
I can't solve your problem, but it seems to me that the issue is that it consumes more runes, rather than less (otherwise it should give three matches). (Backquote for regexp doesn't seem to be the problem, `\n` is legal RE2 syntax, and changing it to double quotes doesn't help either.)

I put it on the Playground for easier testing:
http://play.golang.org/p/TFcpAVfy-1

Peter


--
 
 

Toni Cárdenas

unread,
Sep 13, 2012, 8:04:58 AM9/13/12
to golan...@googlegroups.com, Toni Cárdenas
Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

Examining the output, FindReaderIndex seems to take a certain amount of runes from the buffer and to consume from it until the pattern is found, but then it doesn't put back the remaining taken runes onto the buffer. I don't know if this behaviour is to be expected, but if it is, which would be a more proper way of parsing a file?

I can remake the buffer on each iteration like this: http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look nice.

roger peppe

unread,
Sep 13, 2012, 9:27:28 AM9/13/12
to Toni Cárdenas, golan...@googlegroups.com
On 13 September 2012 13:04, Toni Cárdenas <to...@tcardenas.me> wrote:
> Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG
>
> Examining the output, FindReaderIndex seems to take a certain amount of
> runes from the buffer and to consume from it until the pattern is found, but
> then it doesn't put back the remaining taken runes onto the buffer. I don't
> know if this behaviour is to be expected, but if it is, which would be a
> more proper way of parsing a file?
>
> I can remake the buffer on each iteration like this:
> http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look nice.

This is an interesting problem that I made a little toy solution for
some time ago:

http://play.golang.org/p/1pFFrhYKXW

I think it's about as good as you can get - there is the inherent
problem that you have to buffer all input until a match has
occurred, because the match might actually happen
right at the beginning.

So searching for a string that's never found in a large stream
is inefficient. If you were doing it for real, you might use
a temporary file.

Russ Cox

unread,
Sep 13, 2012, 3:36:32 PM9/13/12
to Peter S, Toni Cárdenas, golan...@googlegroups.com
On Wed, Sep 12, 2012 at 8:54 PM, Peter S <spete...@gmail.com> wrote:
> I can't solve your problem, but it seems to me that the issue is that it
> consumes more runes, rather than less (otherwise it should give three
> matches).

$ godoc regexp
...
There is also a subset of the methods that can be applied to text read
from a RuneReader:

MatchReader, FindReaderIndex, FindReaderSubmatchIndex

This set may grow. Note that regular expression matches may need to
examine text beyond the text returned by a match, so the methods that
match text from a RuneReader may read arbitrarily far into the input
before returning.

Toni Cárdenas

unread,
Sep 13, 2012, 6:58:53 PM9/13/12
to golan...@googlegroups.com, Toni Cárdenas
That's nice, but a little overkill for me because my pattern starts with ^. I've just ended up seeking to the previous offset of the file after running FindReaderIndex, consuming from there just the matched string and making a new buffer each time, in a similar way to my previous post.

$ godoc regexp 
... 
    There is also a subset of the methods that can be applied to text read 
    from a RuneReader: 
        MatchReader, FindReaderIndex, FindReaderSubmatchIndex 
    This set may grow. Note that regular expression matches may need to 
    examine text beyond the text returned by a match, so the methods that 
    match text from a RuneReader may read arbitrarily far into the input 
    before returning. 

Yeah, that would've helped if I had read it on time. Thanks!

roger peppe

unread,
Sep 14, 2012, 4:49:06 AM9/14/12
to Toni Cárdenas, golan...@googlegroups.com
If you've got a seekable file, that's definitely the way forward.

roger peppe

unread,
Sep 14, 2012, 4:51:36 AM9/14/12
to Toni Cárdenas, golan...@googlegroups.com
BTW, if your regexp can't span line boundaries, you could just
read line by line...

Kevin Gillette

unread,
Sep 16, 2012, 12:14:56 AM9/16/12
to golan...@googlegroups.com, Toni Cárdenas
That method takes an io.RuneReader and therefore has no way to "push back" unread runes. Since, by construction, we know that the index in the reader is already "gone" by the time it's found, unless you have something seekable or fully buffered, the index can't be usefully reused with respect to that same "reader".

Also keep in mind that Go doesn't treat regexp's as the "magic bullet" that many other languages do. For example, in interpreted languages like Perl, Python, or Ruby, it's generally going to be much much faster to use regexps for anything more complex than exact substring searching, whereas in Go, it's always faster (and sometimes even simpler) to use competently-written custom algorithms, even for tasks that regexps are "good at." Therefore, aside from cases where a user supplies search algorithms at runtime, regexps in Go are just a "convenience," not a "necessity."
Reply all
Reply to author
Forward
0 new messages