bufio.Scanner SplitFunc for all possible line endings

1,224 views
Skip to first unread message

Donovan Hide

unread,
May 24, 2013, 6:33:57 AM5/24/13
to golang-nuts
Hi,

I'm working on a replay-able event stream client and server library and am working through the specs:

http://www.whatwg.org/specs/web-apps/current-work/multipage/comms.html#parsing-an-event-stream

End of line is defined as (crlf/cr/lf)

The standard library bufio.ScanLines works for \r?\n but the event stream spec, I believe, is the equivalent of [\r\n]+

Before I start writing a custom SplitFunc just wanted to ask if anyone had a smarter idea than switching from bytes.Index to the less optimised bytes.IndexAny in:


Cheers,
Donovan.

roger peppe

unread,
May 24, 2013, 8:08:26 AM5/24/13
to Donovan Hide, golang-nuts
On 24 May 2013 11:33, Donovan Hide <donov...@gmail.com> wrote:
> Hi,
>
> I'm working on a replay-able event stream client and server library and am
> working through the specs:
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/comms.html#parsing-an-event-stream
>
> End of line is defined as (crlf/cr/lf)
>
> The standard library bufio.ScanLines works for \r?\n but the event stream
> spec, I believe, is the equivalent of [\r\n]+

You mean \r\n|[\r\n] I presume?

> Before I start writing a custom SplitFunc just wanted to ask if anyone had a
> smarter idea than switching from bytes.Index to the less optimised
> bytes.IndexAny in:
>
> http://tip.golang.org/src/pkg/bufio/scan.go line 278

You might be better off just writing the loop by hand
rather than using IndexAny.

found := false
for i, b := range data {
if b == '\r' || b == '\n' {
found = true
break
}
}

adnaan badr

unread,
May 24, 2013, 8:40:25 AM5/24/13
to golan...@googlegroups.com
This could work for you: http://tip.golang.org/src/pkg/bufio/bufio.go#420

reader := bufio.NewReader(os.Stdin)
row, err := reader.ReadString(delimiter)
if err != nil {
panic(err)
}
      fmt.Println(size_row)

Donovan Hide

unread,
May 25, 2013, 7:35:11 AM5/25/13
to golang-nuts
Thanks for the suggestions! Ended up following Roger's advice and just looping over the bytes with a bit of switch logic involved. Seems to work, but probably need some more tests :-)



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Donovan Hide

unread,
May 26, 2013, 1:50:02 PM5/26/13
to golang-nuts
Well, it seemed like a good idea to use bufio.Scanner for parsing EventSource streams until I hit the MaxTokenSize limit:

bufio.Scanner: token too long

Which is a great shame! There seemed to be discussion of the issue during the design of the Scanner type:


An "in the wild" example of a large item greater than 64k to be parsed can be seen here:


when a depth_shot event appears every 30 seconds or so.... 

Would the recommendation be to write a version of bufio.ReadString that deals with the \r\n|[\r\n] case, or to submit a CL/bug for a configurable MaxTokenSize on bufio.Scanner?

Cheers,
Donovan.

Rob Pike

unread,
May 26, 2013, 2:09:29 PM5/26/13
to Donovan Hide, golang-nuts
The bufio.Scan interface is intended to be for simple cases and not to
be cluttered with configuration parameters and other complexities. It
is designed for convenience not generality. If it doesn't do the job
for you, the package has other facilities that operate without
restrictions.

-rob

Donovan Hide

unread,
May 26, 2013, 4:30:52 PM5/26/13
to Rob Pike, golang-nuts
Thanks for the reply! I'd be interested to learn how you would solve the same problem using the other functions in bufio. The best I could come up with was a normalising (I'm from the UK, hence the 's'!) reader which feeds into the a bufio.Reader. I'm fairly sure the shift-left copy is a bit inefficient though...


It's interesting to compare with Python which does have standard library support for all possible line-endings, albeit not in buffered reader form:


Of course, the cheeky and lazy solution would have have been to copy and paste:


change the package name and alter the const MaxScanTokenSize, but that would have been cheating :-)

Cheers,
Donovan.

Donovan Hide

unread,
May 26, 2013, 5:36:41 PM5/26/13
to Rob Pike, golang-nuts
If anyone is interested, the normalising reader in the last code snippet didn't work for consecutive CRLFCRLF sequences and other situations. Updated here:


Kind of proves that rolling two byte sequences are hard to catch :-)


Reply all
Reply to author
Forward
0 new messages