proposal: new interface for scanning lines in bufio

Rob Pike

unread,

Feb 13, 2013, 9:36:37 PM2/13/13

to golan...@googlegroups.com

The existing bufio support for line-at-a-time I/O is cumbersome. Here
is a proposal for an all-new design that should be easier to use,
designed with help from Brad Fitzpatrick. This API would be an add-on,
not a replacement for the existing ReadSlice etc.

It's part of the bufio package, not being worth another package. In
any case it can use some existing bufio internals to provide an
efficient implementation.

We add a new type, called Scanner, to capture the new functionality.
Its constructor takes an io.ReadCloser, for reasons that will become
clear. (The caller can promote a Reader to a ReadCloser using
ioutil.NopCloser.) If the argument is not already a bufio.Reader, one
is created to wrap the argument.

This gives us:

package bufio

type Scanner struct { /* hidden */ }

func NewScanner(r io.ReadCloser) *Scanner

The model for the scanner is to "tokenize" the input into text to be
processed, separated by delimiters that are discarded. In the default
case, this means lines of text separated by `\r?\n`. It is not
possible in this design to discover whether, for instance, the last
line of the input ends with a newline. This is OK; the point of this
API is to make I/O easier and discovering such details about the input
complicates existing designs.

To scan the input, use the Next method as the loop condition, the
Bytes or Line methods as the "getters", and Close at the end. Here are
the method signatures:

func (s *Scanner) Next() bool

func (s *Scanner) Close() error

func (s *Scanner) Bytes() []byte // Does not copy; data is volatile.

func (s *Scanner) Text() string

The last name is Text not String so we don't accidentally create a
fmt.Stringer out of a Scanner.

I/O works by calling Next to load the next "token". It returns false
at EOF or error. The Close method returns:
nil if there was no error; or
nil if the only error was EOF; or
whatever non-EOF error triggered the scan to stop, including line-too-long; or
the return value of the reader's Close method.

Note that a line-too-long terminates the scan. If you need to deal
with crazy-long lines, you'll need to use ReadSlice or just Read etc.

Here is code to print a file line-by-line, with line numbers:

s := bufio.NewScanner(io.Stdin)
for i := 1; s.Next() i++ {
line := s.Bytes()
fmt.Printf("%3d\t%s\n", i, line)
}
if err := s.Close(); err != nil {
log.Fatal(err)
}

This is the basic outline; it seems clean and easy to use. The
deferral of error until Close (thanks, Brad) is the key insight to
having the code be simple.

We could generalize a little on top of this. One easy step is to allow
options, such as to control the maximum token length. These are done
with a chaining API so they don't need to be in the constructor. There
should be very few of them. If it can be done efficiently, we could
provide:

func (s *Scanner) MaxLength(length int) *Scanner // default 4k

To allow the user to specify the token-splitting algorithm, we add a
function option. It's not easy to use, but it won't be used much.
Still, a word-breaking splitter, for instance, would be nice, as would
a byte-at-a-time and rune-at-a-time scanner, and we could provide
those in bufio itself. The function is called for each byte and has
this signature:

type SplitFunc func(char byte, atEof bool) []byte

The atEof argument is true only at EOF, giving the function a chance
to terminate the last token. A nil return means 'nothing' and can be
discriminated from an empty return. (Another design would be to
provide a separate boolean to indicate whether there is a value
returned; either should work well.) We set up a custom splitter with
an option method:

func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
breaks; cr ignored

For example, if we provided a rune splitter in the package, you'd scan
runes like this:

s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
for s.Next() {
fmt.Printf("rune: %s\n", s.Bytes())
}
if err := s.Close(); err != nil {
log.Fatal(err)
}

Comments welcome. I have no implementation.

-rob

jimmy frasche

unread,

Feb 13, 2013, 10:00:48 PM2/13/13

to Rob Pike, golang-nuts

This would simplify most of the line handling code I've written so
far. The API looks solid and easy to use.

My only concern is the limitation of the splitter to a func. I think
having something like net/http's Handler/HandlerFunc would be better.
Sometimes you have to store state like "last char was an escape
character". Perhaps that's exotic enough to leave to the more
cumbersome techniques, however.

David Symonds

unread,

Feb 13, 2013, 11:04:51 PM2/13/13

to jimmy frasche, Rob Pike, golang-nuts

On Thu, Feb 14, 2013 at 2:00 PM, jimmy frasche <soapbo...@gmail.com> wrote:

> My only concern is the limitation of the splitter to a func. I think
> having something like net/http's Handler/HandlerFunc would be better.
> Sometimes you have to store state like "last char was an escape
> character". Perhaps that's exotic enough to leave to the more
> cumbersome techniques, however.

You could always store state by using a closure. The dominant case is
a stateless function, though, so passing a func seems like a better
fit.

Andrew Gerrand

unread,

Feb 14, 2013, 1:35:46 AM2/14/13

to Rob Pike, golang-nuts

On 14 February 2013 13:36, Rob Pike <r...@golang.org> wrote:

The last name is Text not String so we don't accidentally create a
fmt.Stringer out of a Scanner.

Why not let it be a Stringer?

s := bufio.NewScanner(r)

for s.Next() {

fmt.Println(s)

}

// etc

I would be more concerned if the String/Text method had some kind of side effect, apart from an allocation. Since you must always advance with Next, I don't see the problem with Scanner being a Stringer.

What happens if you call the Bytes or String/Text methods before calling Next (or after calling Close)? Panic? If that's the case, I could see the fmt.Stringer thing being an issue.

Andrew

Anthony Martin

unread,

Feb 14, 2013, 1:48:50 AM2/14/13

to Rob Pike, golan...@googlegroups.com

What's the rationale for the names "Next" and "Close"?

The only precedent I can see is database/sql.(*Rows).Next.
All other Next methods in the standard library actually
return the "next" of something.

Anthony

Kamil Kisiel

unread,

Feb 14, 2013, 2:47:39 AM2/14/13

to golan...@googlegroups.com

On Wednesday, February 13, 2013 6:36:37 PM UTC-8, Rob Pike wrote:

To scan the input, use the Next method as the loop condition, the
Bytes or Line methods as the "getters", and Close at the end.

By "Line" did you mean "Text"? or is there some other method?

The API looks good to me, probably about as close as possible to Python's "for line in f:" as is possible with Go constructs.

Is the choice of line ending fixed? I remember there were some people on the mailing list asking for convenient handling of \r some time back, for dealing with old mac files.

I the Python world there's something called "Universal Newlines" mode that treats either \n, \r\n or \r as a newline if you open a file with a special flag. The splitlines() function does the same thing.

Dan Kortschak

unread,

Feb 14, 2013, 2:50:34 AM2/14/13

to Andrew Gerrand, Rob Pike, golang-nuts

I was thinking the same thing. It gives access to one of the nicer perl/python idioms.

Also, as someone who regularly deals with crazy long lines (there is nothing in the fasta specification that insists on any line breaks except after seqs and ids) I'd like a possibility of continuing on from an overlong ling in some way.

Very nice though.

Dan

On 14/02/2013, at 5:06 PM, "Andrew Gerrand" <a...@golang.org> wrote:

On 14 February 2013 13:36, Rob Pike <r...@golang.org> wrote:

The last name is Text not String so we don't accidentally create a
fmt.Stringer out of a Scanner.

Why not let it be a Stringer?

s := bufio.NewScanner(r)

for s.Next() {

fmt.Println(s)

}

// etc

I would be more concerned if the String/Text method had some kind of side effect, apart from an allocation. Since you must always advance with Next, I don't see the problem with Scanner being a Stringer..

Donovan Hide

unread,

Feb 14, 2013, 2:59:24 AM2/14/13

to Dan Kortschak, Andrew Gerrand, Rob Pike, golang-nuts

Would there be any future possibility of hooking this into range? If there is some thought going into a future iterator implementation of iterators which work with range, then this is a prime candidate!

s := bufio.NewScanner(r)

for i,line := range s {

fmt.Println(i,line)

}

Kevin Gillette

unread,

Feb 14, 2013, 3:36:19 AM2/14/13

to golan...@googlegroups.com

The semantics of SplitFunc are unclear to me... Based on that func type returning []byte while accepting a byte at a time suggests (to me) that it must have internal state, such as with a closure, accumulating bytes until it can return a token (returning nil until the token is constructed). If that interpretation is correct, then contrary to what others have said, almost every use case would need state, and thus soapboxcicero's suggestion of an employing an interface would be appropriate. On the other hand, if the return value is only ever expected to be nil, or length <= 1 (which is usually all a stateless func can do with a single byte as input) in order to achieve an in-band signalling effect, I believe a multi-valued return signature would be clearer.

yy

unread,

Feb 14, 2013, 3:45:27 AM2/14/13

to golang-nuts

(sorry Dan, I sent this reply just to you, fwding to the list)

On 14 February 2013 08:50, Dan Kortschak <dan.ko...@adelaide.edu.au> wrote:

I'd like a possibility of continuing on from an overlong ling in some way.

If there was a (s *Scanner)LineTooLong() bool method (or, more generally, a s.Error() method), wouldn't something like this work?

func longerNext(s *bufio.Scanner) bool {

if s.LineTooLong() {

return s.MaxLength(2*len(s.Bytes())).Next() || longerNext(s)
}

return false
}

and, in the "scanning loop":

for s.Next() || longerNext(s) {

This may be dangerous, I don't think it should be part of the library, but you could easily implement it as long as there's some way to know the error without closing the reader. In any case, I think it would be interesting having such functionality, because you may also want to discard a long line but keep reading.

--
- yiyus || JGL .

Dan Kortschak

unread,

Feb 14, 2013, 3:53:12 AM2/14/13

to yy, golang-nuts

I was thinking more of a way to continue rather that grow. I guess sort of how you would if a line were too long when reading with bufio.ReadLine. You would need to have a way of checking errors without the call to Close(), which may not be wanted. If the Underlying Reader does read past where the buffer would overflow, I guess you could create a new Scanner on the that Reader, but it seems wasteful.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gustavo Niemeyer

unread,

Feb 14, 2013, 4:49:08 AM2/14/13

to Rob Pike, golan...@googlegroups.com

On Thu, Feb 14, 2013 at 12:36 AM, Rob Pike <r...@golang.org> wrote:
> This is the basic outline; it seems clean and easy to use. The
> deferral of error until Close (thanks, Brad) is the key insight to
> having the code be simple.

That's the design of mgo iterators, except there's an explicit method
iter.Err() instead. It works quite well. I was just tempted to suggest
not bundling this with Close, but I guess the common case is files
opened just for parsing, so the NopCloser should be fine for the rest.

> type SplitFunc func(char byte, atEof bool) []byte

I suggest this instead:

type Splitter interface {
Split(buf []byte) [][]byte
}

The last call to Split is done with nil. This avoids both atEof and
the bool return, and is also more friendly to splitters that need to
take state.

> to terminate the last token. A nil return means 'nothing' and can be
> discriminated from an empty return. (Another design would be to

Given we encourage people to handle nil slices as if they were empty,
might be best to avoid purposefully differentiating a nil return from
an empty one.

gustavo @ http://niemeyer.net

Aston Motes

unread,

Feb 14, 2013, 5:23:50 AM2/14/13

to Anthony Martin, Rob Pike, golan...@googlegroups.com

It seems like "Scan" works just as well for the name of the "function to do the next thing" loop condition and avoids the confusion of "Next" not returning the next line.

Gustavo Niemeyer

unread,

Feb 14, 2013, 5:38:21 AM2/14/13

to David Symonds, jimmy frasche, Rob Pike, golang-nuts

On Thu, Feb 14, 2013 at 2:04 AM, David Symonds <dsym...@golang.org> wrote:
> You could always store state by using a closure. The dominant case is
> a stateless function, though, so passing a func seems like a better
> fit.

Is it? I don't think the current behavior of ReadLine can be made stateless.

gustavo @ http://niemeyer.net

roger peppe

unread,

Feb 14, 2013, 6:57:34 AM2/14/13

to Rob Pike, golang-nuts

On 14 February 2013 02:36, Rob Pike <r...@golang.org> wrote:
> The existing bufio support for line-at-a-time I/O is cumbersome. Here
> is a proposal for an all-new design that should be easier to use,
> designed with help from Brad Fitzpatrick. This API would be an add-on,
> not a replacement for the existing ReadSlice etc.

I like this proposal in general. I think it's neat and should be easy to use,
and making it general is great too.
I have one or two comments though.

I'm not sure about the Close thing. I very often read lines from
a Reader, and using io.NopCloser adds to the weight of this
common case. Also, it means that the common idiom of
putting a defer f.Close() after opening f is not so applicable
(you usually want to see line-too-long errors, but reader-close
errors are usually boring).

Having the scanner close the underlying file on error
means we can't carry on after an error, even if we
want to.

How about using Err rather than close?

s := bufio.NewScanner(io.Stdin)
for i := 1; s.Next() i++ {
line := s.Bytes()
fmt.Printf("%3d\t%s\n", i, line)
}

if err := s.Err(); err != nil {
log.Fatal(err)

}

> We could generalize a little on top of this. One easy step is to allow
> options, such as to control the maximum token length. These are done
> with a chaining API so they don't need to be in the constructor. There
> should be very few of them. If it can be done efficiently, we could
> provide:
>
> func (s *Scanner) MaxLength(length int) *Scanner // default 4k

I'm not keen on the chaining API. It means that no-one
else can implement types that satisfy the same interface
as Scanner.

Even though it might add a couple of lines to the source code,
I'd prefer to see:

func (s *Scanner) SetMaxLength(length int)

(or MaxLength as above, but without the *Scanner return)

> To allow the user to specify the token-splitting algorithm, we add a
> function option. It's not easy to use, but it won't be used much.
> Still, a word-breaking splitter, for instance, would be nice, as would
> a byte-at-a-time and rune-at-a-time scanner, and we could provide
> those in bufio itself. The function is called for each byte and has
> this signature:
>
> type SplitFunc func(char byte, atEof bool) []byte

I'm not sure about the use of a function type here.
Usually when we provide a function type, we expect
it to be side-effect free (I'm thinking in particular of the *Func
functions in strings and bytes here). Here, on the other hand,
we *require* the function to have state, otherwise it
cannot function correctly.

The other thing that concerns me here is efficiency.
Currently ReadSlice can use IndexByte to scan very quickly
through a buffer - we're making that into many function
calls here.

Given that this interface is about *splitting*, not tokenization,
how about an interface something like this?

type Splitter interface {
// Split scans the given byte slice for a token, where the first
// seen bytes have been inspected by a previous call to Split.
// It returns the number of bytes it has inspected and, if a
// token was found, a non-nil slice containing the token.
// The length of b may only be zero when atEOF is true.
Split(b []byte, atEOF bool, seen int) (n int, token []byte)
}

It think this interface can make it relatively straightforward and efficient
to implement common splitting idioms without needing to
maintain state, while being flexible enough to implement more
interesting splitters.

One significant limitation is that it wouldn't allow for delimiters
that are longer than the maximum token length, but I'm
not sure that's a big issue (if it is, Split could be changed to
return whether the input should be discarded).

For instance, here's the usual \r\n line splitter (utterly untested, of course):

func (lineSplitter) Split(b []byte, atEOF bool, seen int) (int, []byte) {
n := len(b)
if atEOF {
if seen > 0 {
return n, b
}
return n, nil
}
i := bytes.IndexByte(b[seen:], '\n')
if i < 0 {
if b[n-1] == '\r' {
// Save unresolved \r for the next call.
n--
}
return n, nil
}
t := b[0:i]
if t[len(t)-1] == '\r' {
t = t[0 : len(t)-1]
}
return i + 1, t
}

roger peppe

unread,

Feb 14, 2013, 7:42:18 AM2/14/13

to Gustavo Niemeyer, Rob Pike, golang-nuts

On 14 February 2013 09:49, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
>> type SplitFunc func(char byte, atEof bool) []byte
>
> I suggest this instead:
>
> type Splitter interface {
> Split(buf []byte) [][]byte
> }
>
> The last call to Split is done with nil. This avoids both atEof and
> the bool return, and is also more friendly to splitters that need to
> take state.

This interface would make it difficult to enforce buffer size limits,
I think, and even if Split was changed to return an error,
I think it's nicer to have buffer size limits enforced at the Scanner
level rather than requiring a custom splitter.

>> to terminate the last token. A nil return means 'nothing' and can be
>> discriminated from an empty return. (Another design would be to
>
> Given we encourage people to handle nil slices as if they were empty,
> might be best to avoid purposefully differentiating a nil return from
> an empty one.

I don't mind, personally. The distinction's got to be good for something.

Ingo Oeser

unread,

Feb 14, 2013, 8:25:46 AM2/14/13

to golan...@googlegroups.com

First: Great work! That really makes it simple!

On Thursday, February 14, 2013 3:36:37 AM UTC+1, Rob Pike wrote:

We add a new type, called Scanner, to capture the new functionality.
Its constructor takes an io.ReadCloser, for reasons that will become
clear. (The caller can promote a Reader to a ReadCloser using
ioutil.NopCloser.) If the argument is not already a bufio.Reader, one
is created to wrap the argument.

Could this be done by the constructor itself, like we do for automatic buffering?

e.g.

rc, ok := r.(io.ReadCloser)

if !ok {

rc = ioutil.NopCloser(r)

}

Or is this considered dangerous, as the caller might not expect that the Scanner closes its stream?

func (s *Scanner) Text() string

The last name is Text not String so we don't accidentally create a
fmt.Stringer out of a Scanner.

Conforming to Stringer is actually a very useful feature. I wish every text stream analysis API would return their hold space this way. Makes usage very simple and idiomatic. Andrew presented already a good example.

Best Regards

Ingo

Gerard

unread,

Feb 14, 2013, 9:20:30 AM2/14/13

to golan...@googlegroups.com

Nice initiative! FWIW here are my € 0.02

The thing I missed in the pseudo code was the delimiter string. I suppose it will be defined in the constructor like this:

func bufio.NewScanner(r io.ReadCloser, delim string) *Scanner

or in the Next function, like :

Next(delim string) bool // This makes the delimiter string modifyable while reading, generating lots of possiblities (at what cost?)

Gerard

Steve McCoy

unread,

Feb 14, 2013, 9:52:10 AM2/14/13

to golan...@googlegroups.com

I think this type is pretty much perfect as-is. I didn't like the Next method at first, but I realized that pushing the error-handling to Close is really nice.

Here's another alternative for SplitFunc, which would allow for them to be very simple and stateless (by virtue of putting the Scanner in charge of the state):

type SplitFunc func(chunk []byte, atEof bool) (ok, prefix bool)

My thought is that, when scanning for splitting tokens, the Scanner will feed a chunk to the SplitFunc, and the SplitFunc will return whether the chunk matches a splitting token ("ok") and/or if it is a prefix of a splitting token. For example, say the Scanner has an internal buffer and calls SplitFunc(buffer[n:n+1], false). As long as the chunk is a prefix, the Scanner calls SplitFunc with buffer[n:n+m] ("growing" the chunk). Eventually, the SplitFunc will indicate that the chunk is a token or a non-token-non-prefix, so the Scanner moves on to the next chunk with SplitFunc(buffer[n+m:n+m+1], false) (if it isn't at EOF).

There are a few downsides: The Scanner is more complex internally, I left out several edge cases in the above example, and this is potentially *very* inefficient for long or complex splitters. I think the first two are made up for by simplifying client code. As for efficiency, I'm inclined to think that maybe this is too extreme of a simplification because it'd be very easy to make something pathological, but on the other hand, anything this would be insufficient for would probably be just as much effort to implement via a stateful SplitFunc as with the existing tools in bufio.

Nate Finch

unread,

Feb 14, 2013, 10:02:12 AM2/14/13

to golan...@googlegroups.com

This sounds great. It's very similar to the helper functions I've already written to do this.

One question - is there a reason we can't toggle whether or not to return the line ending? In gocog I use bufio's Reader.Readline to read and rewrite lines from the original file. If we don't have a way to retain the line ending, we lose information from the original reader. I understand that in the base case you don't want the line endings, but an option to keep them would be pretty handy for people that want to keep the original data unchanged.

I agree that the fluent API style is not a good practice for Go, as it removes the ability to implement interfaces for that API, and gains you very little. Saving one line, or being able to do construct and set in the arguments of a function call is not worth the tradeoff.

I don't like the use of Close() to get the error. It means you can't get the error without also closing the reader. What if you can handle the error and then continue? I think making the function Err() and instead letting the caller close is a better interface.

I like renaming Next() to Scan(). I think it better reflects the action being performed.

I also think making it a Stringer is just fine. Why even have these interfaces if we're not going to implement them? Unless you have some technical reason you didn't mention.

Here's your base case, modified with the above suggestions (also not printing line numbers for clarity):

func printLines(file string) {

in, err := os.Open(file)

if err != nil {

log.Fatal(err)

}

defer in.Close() // dropping possible error on the ground for clarity

s := bufio.NewScanner(in)

for s.Scan() {

fmt.Println(s)

}

if err := s.Err(); err != nil {

log.Fatal(err)

}

return nil

}

roger peppe

unread,

Feb 14, 2013, 10:44:52 AM2/14/13

to Nate Finch, golang-nuts

On 14 February 2013 15:02, Nate Finch <nate....@gmail.com> wrote:
> This sounds great. It's very similar to the helper functions I've already
> written to do this.
>
> One question - is there a reason we can't toggle whether or not to return
> the line ending? In gocog I use bufio's Reader.Readline to read and rewrite
> lines from the original file. If we don't have a way to retain the line
> ending, we lose information from the original reader. I understand that in
> the base case you don't want the line endings, but an option to keep them
> would be pretty handy for people that want to keep the original data
> unchanged.

It shouldn't be hard to write a Splitter that returns lines with their
endings intact.

> I don't like the use of Close() to get the error. It means you can't get the
> error without also closing the reader. What if you can handle the error and
> then continue? I think making the function Err() and instead letting the
> caller close is a better interface.

+1

> I like renaming Next() to Scan(). I think it better reflects the action
> being performed.

-1 (I think it's worth matching the usage of database/sql)

> I also think making it a Stringer is just fine.

+1

Nate Finch

unread,

Feb 14, 2013, 11:05:13 AM2/14/13

to golan...@googlegroups.com, Nate Finch

Now that I think of it, I think you're right. Using the same method signature in as many places as possible is always a good idea in Go. It makes it possible to pass them to a function that takes an (as of yet unimplemented) iterator interface:

type Iterator interface {

Next() bool

Err() error

}

Maybe with future support for popping these puppies in a range statement? 0:)

Michael Jones

unread,

Feb 14, 2013, 11:08:15 AM2/14/13

to roger peppe, Nate Finch, golang-nuts

Rob,

Why not keep the last separating delimiter string in the object? No name in mind, but for the discussion, let's call it "Separator."

func (s *Scanner) Separator() []byte

It would almost never change, could be inquired by a method on s, and in typical cases would have this character:

s := bufio.NewScanner(io.Stdin)

...if you call s.Separator() here you get an empty slice

for i := 1; s.Next() i++ {
line := s.Bytes()

...if you call s.Separator() here you get a "\n" or "\r" or "\r\n" or whatever.

fmt.Printf("%3d\t%s\n", i, line)
}

...if you call s.Separator() here you get an empty slice if the file ended with no separator string, or the final \n" or "\r" or "\r\n" or whatever.

if err := s.Close(); err != nil {
log.Fatal(err)
}

...if you call s.Separator() here I think you should get an empty slice meaning a benign "no separation state remains after close."

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Michael T. Jones | Chief Technology Advocate | m...@google.com | +1 650-335-5765

Kyle Lemons

unread,

Feb 14, 2013, 12:02:48 PM2/14/13

to roger peppe, Rob Pike, golang-nuts

+1. Not sure about precisely what the interface should be, but I think this looks close. This would make it useful for more than just line and word splitting, for instance: it could be used as a general tokenizer. Of course, that may be a non-goal.

I also agree with allowing it to be a Stringer, and with not requiring a Closer (Err is fine with me).

}

Kevin Gillette

unread,

Feb 14, 2013, 1:18:37 PM2/14/13

to golan...@googlegroups.com

@rog: who says you can't call Close twice? I believe os.File takes steps to ensure that its idempotent (and so it didn't close a reopened fd), while being safe with other ReadCloser use cases I've seen.

Some contradictory opinions: Steve McCoy's proposed func seems close, but would be much faster if it returned a slice (otherwise the func would continually have to rescan the same input until it was given a whole token). The stdlib side could use address checking to determine where the token resides (and panic if the input and output slices have no overlap). I suggest:

func (data []byte, prevState int) (token []byte, state int)

Here, the func should scan data for at most one token. state is user defined, except for zero, which means the token return value, if not nil, represents a complete token. This allows the function to continue processing a token without nesting the previous data. If the returned state is negative, it indicates an invalid token (the contents of which should be represented by the returned token value); the abs of the negative state will be stuffed with the full token (over however many calls had consecutively returned a state > 0) into an error; the caller of Close can type assert to use the error state to lookup a useful error message.

Though for simple cases, what's wrong with the func type that bytes.FieldsFunc takes, or something similar, like `func (byte) bool`. This is less flexible and less efficient, yet sufficient for many cases, and an adaptor that wraps this into the defacto func type would be useful.

Kamil Kisiel

unread,

Feb 14, 2013, 1:27:31 PM2/14/13

to golan...@googlegroups.com

I disagree there. Comparing to the API of existing packages like database/sql that's backwards.

Kyle Lemons

unread,

Feb 14, 2013, 1:41:18 PM2/14/13

to Kevin Gillette, golang-nuts

On Thu, Feb 14, 2013 at 10:18 AM, Kevin Gillette <extempor...@gmail.com> wrote:

@rog: who says you can't call Close twice? I believe os.File takes steps to ensure that its idempotent (and so it didn't close a reopened fd), while being safe with other ReadCloser use cases I've seen.

Some contradictory opinions: Steve McCoy's proposed func seems close, but would be much faster if it returned a slice (otherwise the func would continually have to rescan the same input until it was given a whole token). The stdlib side could use address checking to determine where the token resides (and panic if the input and output slices have no overlap). I suggest:

func (data []byte, prevState int) (token []byte, state int)

Here, the func should scan data for at most one token. state is user defined, except for zero, which means the token return value, if not nil, represents a complete token. This allows the function to continue processing a token without nesting the previous data. If the returned state is negative, it indicates an invalid token (the contents of which should be represented by the returned token value); the abs of the negative state will be stuffed with the full token (over however many calls had consecutively returned a state > 0) into an error; the caller of Close can type assert to use the error state to lookup a useful error message.

Eep. That seems complicated and easy to get wrong. Explicitly stating what the ints mean (how many bytes have been considered by the function) is quite clear; any other state can be done with a closure.

Though for simple cases, what's wrong with the func type that bytes.FieldsFunc takes, or something similar, like `func (byte) bool`. This is less flexible and less efficient, yet sufficient for many cases, and an adaptor that wraps this into the defacto func type would be useful.

roger peppe

unread,

Feb 14, 2013, 1:43:58 PM2/14/13

to Kevin Gillette, golang-nuts

On 14 February 2013 18:18, Kevin Gillette <extempor...@gmail.com> wrote:
> @rog: who says you can't call Close twice? I believe os.File takes steps to ensure that its idempotent (and so it didn't close a reopened fd), while being safe with other ReadCloser use cases I've seen.

yeah, you can, but i'm not sure it's defined to be safe always.

I just don't think that conflating Close and Err is that helpful
(and it might be actively unhelpful if you want to pass a
bufio.Reader to the scanner - currently bufio.Reader checks
if its argument is already a bufio.Reader and doesn't add
another layer if so, but forcing it to be an io.NopCloser
would prevent this).

> Some contradictory opinions: Steve McCoy's proposed func seems close, but would be much faster if it returned a slice (otherwise the func would continually have to rescan the same input until it was given a whole token). The stdlib side could use address checking to determine where the token resides (and panic if the input and output slices have no overlap). I suggest:
>
> func (data []byte, prevState int) (token []byte, state int)
>
> Here, the func should scan data for at most one token. state is user defined, except for zero, which means the token return value, if not nil, represents a complete token. This allows the function to continue processing a token without nesting the previous data. If the returned state is negative, it indicates an invalid token (the contents of which should be represented by the returned token value); the abs of the negative state will be stuffed with the full token (over however many calls had consecutively returned a state > 0) into an error; the caller of Close can type assert to use the error state to lookup a useful error message.

Assume the data argument contains several lines. How does the above function
indicate where the token finished?

roger peppe

unread,

Feb 14, 2013, 1:44:40 PM2/14/13

to Kyle Lemons, Kevin Gillette, golang-nuts

On 14 February 2013 18:41, Kyle Lemons <kev...@google.com> wrote:
>> Here, the func should scan data for at most one token. state is user
>> defined, except for zero, which means the token return value, if not nil,
>> represents a complete token. This allows the function to continue processing
>> a token without nesting the previous data. If the returned state is
>> negative, it indicates an invalid token (the contents of which should be
>> represented by the returned token value); the abs of the negative state will
>> be stuffed with the full token (over however many calls had consecutively
>> returned a state > 0) into an error; the caller of Close can type assert to
>> use the error state to lookup a useful error message.
>
>
> Eep. That seems complicated and easy to get wrong. Explicitly stating what
> the ints mean (how many bytes have been considered by the function) is quite
> clear; any other state can be done with a closure.

+1

roger peppe

unread,

Feb 14, 2013, 1:46:57 PM2/14/13

to Kyle Lemons, Rob Pike, golang-nuts

On 14 February 2013 17:02, Kyle Lemons <kev...@google.com> wrote:
> +1. Not sure about precisely what the interface should be, but I think this
> looks close. This would make it useful for more than just line and word
> splitting, for instance: it could be used as a general tokenizer. Of
> course, that may be a non-goal.

yeah. my first thought was to make it return indexes into the passed
slice, but returning a slice was neater (and, as you point out, allows
more generality).

Kevin Gillette

unread,

Feb 14, 2013, 2:06:15 PM2/14/13

to golan...@googlegroups.com

@rog: the token return val would only contain the first token. The next call to that func would pass the remaining buffer (starting on the first byte after the token's last byte), similar to the logic used with Read or Write in a loop, except instead of n, addresses would be used by bufio to determine the offset.

Kyle Lemons

unread,

Feb 14, 2013, 2:12:41 PM2/14/13

to Kevin Gillette, golang-nuts

On Thu, Feb 14, 2013 at 11:06 AM, Kevin Gillette <extempor...@gmail.com> wrote:

@rog: the token return val would only contain the first token. The next call to that func would pass the remaining buffer (starting on the first byte after the token's last byte), similar to the logic used with Read or Write in a loop, except instead of n, addresses would be used by bufio to determine the offset.

That is very strange. Keep in mind, there is no pointer arithmetic in Go. That would be difficult (if not impossible) to implement without unsafe.

Kevin Gillette

unread,

Feb 14, 2013, 4:16:41 PM2/14/13

to golan...@googlegroups.com, Kevin Gillette

It shouldn't be problematic in that respect -- there wouldn't be any pointer arithmetic, since it's similar to the (safe) approach used to determine if two slices overlap:

token, newState := thefunc(data, prevState)

offset := uintptr(&token[0]) - uintptr(&data[0])

// now do slicing or other (safe) determinations with offset -- it _won't_ be converted into an unsafe.Pointer

A signature like this would support just about any kind of typical lexer, such as a (comparatively inefficient) tokenizer for the Go language. The func uses in bytes.FieldFunc isn't as powerful, since that couldn't distinguish between `:=` and `=:` without unclean use of closure-wrapped state (which also makes assumptions about the algorithm that calls the func).

Rob Pike

unread,

Feb 14, 2013, 4:28:56 PM2/14/13

to golan...@googlegroups.com

Updated proposal. Changes, comments, and conterarguments to proposed
changes precede it here. Thanks for all your input; it was very
helpful.

The argument to the constructor is now an io.Reader, and Close closes
the scanner but not the underlying Reader.

NewScanner is otherwise unchanged. The design should be simple; no
extra folderol in the constructor please.

The method to recover the string-valued token is still Text, not
String. Making it String, and hence a fmt.Stringer, feels like a
misuse of the purpose of both the Scanner and the fmt interface. If
Scanner is to have a String method, it should be about the Scanner,
not a magic trick to access temporary state just to avoid one method
call. I also like that making the string accessor explicit always
means the user must think about whether to pay the price for the
allocation the operation requires.

There is still no way to grab the space between the tokens. The
purpose of this design is to be simple, to do the common case by
default and easily. If you want more control, use the old methods. A
similar argument applies to things like defining line endings.

The maximum token length has been enlarged to 64kB. I might even make
it 1MB. If it's long enough, the need to continue the discussion about
continuing after long lines becomes vanishingly small. See the
previous paragraph.

I left the chaining design in place. It's cheap and helpful for
initialization but is not necessary: one may always use a separate
line for the option setup if desired. This is a trivial decision to
reverse, of course, and it's not set in stone yet.

The Split function interface is all new. It was the least
thought-through part of the previous iteration, and may still require
refinement as implementation proceeds, but the design in this round
seems pretty good to me.

---

The existing bufio support for line-at-a-time I/O is cumbersome. Here
is a proposal for an all-new design that should be easier to use,
designed with help from Brad Fitzpatrick. This API would be an add-on,
not a replacement for the existing ReadSlice etc.

It's part of the bufio package, not being worth another package. In
any case it can use some existing bufio internals to provide an
efficient implementation.

We add a new type, called Scanner, that is used to capture the new
functionality. Its constructor takes an io.Reader. If the argument is

not already a bufio.Reader, one is created to wrap the argument.

This gives us:

package bufio

type Scanner struct { /* hidden */ }

func NewScanner(r io.Reader) *Scanner

The model for the scanner is to "tokenize" the input into text to be
processed, separated by delimiters that are discarded. In the default
case, this means lines of text separated by `\r?\n`. It is not
possible in this design to discover whether, for instance, the last
line of the input ends with a newline. This is OK; the point of this
API is to make I/O easier and discovering such details about the input
complicates existing designs.

To scan the input, use the Next method as the loop condition, the
Bytes or Text methods as the "getters", and Close at the end. Here are
the method signatures:

func (s *Scanner) Next() bool

func (s *Scanner) Close() error

func (s *Scanner) Bytes() []byte // Does not copy; data is volatile.

func (s *Scanner) Text() string

The last name is not String so we don't accidentally create a
fmt.Stringer out of a Scanner. Some have suggested doing that anyway,
but making String be an accessor rather than a formatter is an abuse
to the model.

Close does not close the Reader (it can't; the argument is not a
ReadCloser); it just shuts down the scanning operation, terminates the
scan, and reports any accumulated error. Because of the internal use
of bufio, in general there can no guarantee that, for early calls to
Close, all data after the last returned token is available to be read
afterwards.

I/O works by calling Next to load the next "token". It returns false
at EOF or error. Close() returns:

nil if there was no error; or
nil if the only I/O error was EOF; or
whatever error from the Reader caused the scan to stop; or
whatever scan error caused the sane to stop, such as line-too-long

Here is code to print a file line-by-line, with line numbers:

s := bufio.NewScanner(io.Stdin)
for i := 1; s.Next(); i++ {

line := s.Text()

fmt.Printf("%3d\t%s\n", i, line)
}

if err := s.Close(); err != nil {
log.Fatal(err)
}

This is the basic outline; it seems clean and easy to use. The use of
Close to report error (thanks, Brad) is the key insight to having the
code be simple.

We could generalize a little on top of this. One easy step is to allow
options, such as to control the maximum token length. These are done

with a chaining API so they don't need to be in the constructor and
work well in initialization expressions. There should be very few of
them.

func (s *Scanner) MaxLength(length int) *Scanner // default 64k or
maybe larger. anyway pretty big

To allow the user to specify the token-splitting algorithm, we add a
function option. It's not easy to use, but it won't be used much.
Still, a word-breaking splitter, for instance, would be nice, as would

a byte-at-atime and rune-at-a-time scanner, and we could provide those
in bufio itself. The function is called by Next and has this
signature:

type SplitFunc func(data []byte, EOF bool) (advance int, token []byte)

The EOF argument is true only at EOF, giving the function a chance to
terminate the last token.

The incoming data is a slice of unconsumed data. Each call to
SplitFunc occurs at the previous location, plus the returned 'advance'
value from the previous call. Thus by returning advance==0, SplitFunc
can ask the Scanner to accumulate data until there is a full token to
return. If the required storage becomes too large while accumulating,
the Scanner will terminate with a line-too-long error. Once a token is
delivered, SplitFunc would typically return advance=len(token) plus
perhaps len(separator).

The token returned by SplitFunc is the next token to deliver to the
client; there is no requirement that it correspond to any actual input
data. For instance, it might be upper-cased or lower-cased or
something completely arbitrary. A nil token signals to return nothing
to the client yet.

We set up a custom splitter with an option method:

func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
breaks of the form `\r?\n`

For example, if we provided a rune splitter in the package, you'd scan
runes like this:

s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
for s.Next() {
fmt.Printf("rune: %s\n", s.Bytes())

}
if err := s.Close(); err != nil {
log.Fatal(err)
}

Comments welcome.

jimmy frasche

unread,

Feb 14, 2013, 4:43:05 PM2/14/13

to Rob Pike, golang-nuts

Doesn't the argument that Scanner not be a fmt.Stringer also apply to
Scanner not being an io.Closer, if the Scanner doesn't actually Close
the reader? That seems like it would be a source of common error.
Maybe rename Close Stop?

Nigel Tao

unread,

Feb 14, 2013, 6:15:56 PM2/14/13

to Gustavo Niemeyer, Rob Pike, golang-nuts

On Thu, Feb 14, 2013 at 8:49 PM, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
> On Thu, Feb 14, 2013 at 12:36 AM, Rob Pike <r...@golang.org> wrote:
>> This is the basic outline; it seems clean and easy to use. The
>> deferral of error until Close (thanks, Brad) is the key insight to
>> having the code be simple.
>
> That's the design of mgo iterators, except there's an explicit method
> iter.Err() instead.

FWIW, leveldb-go's db.Iterator works like Rob's proposal (Close
returns the accumulated error).
https://code.google.com/p/leveldb-go/source/browse/leveldb/db/db.go#53

Rob Pike

unread,

Feb 14, 2013, 6:49:58 PM2/14/13

to jimmy frasche, golang-nuts

The io.Closer interface comes up very rarely and such an error is
unlikely. I'm not worried about it and, as Nigel says, there is
precedent for Close.

The fmt.Stringer argument was being made because people wanted to
encourage the use of Scanner as a Stringer, whereas I want to prevent
it.

-rob

Kyle Lemons

unread,

Feb 14, 2013, 7:16:55 PM2/14/13

to Rob Pike, golang-nuts

Cool

The EOF argument is true only at EOF, giving the function a chance to
terminate the last token.

The incoming data is a slice of unconsumed data. Each call to
SplitFunc occurs at the previous location, plus the returned 'advance'
value from the previous call. Thus by returning advance==0, SplitFunc
can ask the Scanner to accumulate data until there is a full token to
return. If the required storage becomes too large while accumulating,
the Scanner will terminate with a line-too-long error. Once a token is
delivered, SplitFunc would typically return advance=len(token) plus
perhaps len(separator).

The token returned by SplitFunc is the next token to deliver to the
client; there is no requirement that it correspond to any actual input
data.

Ooo, shiny.

For instance, it might be upper-cased or lower-cased or
something completely arbitrary. A nil token signals to return nothing
to the client yet.

We set up a custom splitter with an option method:

func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line

breaks of the form `\r?\n`

For example, if we provided a rune splitter in the package, you'd scan
runes like this:

s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
for s.Next() {
fmt.Printf("rune: %s\n", s.Bytes())
}
if err := s.Close(); err != nil {
log.Fatal(err)
}

Comments welcome.

LGTM. While I would've chosen the opposite for String/Text and Close/Err, I like this API and I think it is a dramatic improvement.

Dan Cross

unread,

Feb 14, 2013, 7:23:23 PM2/14/13

to jimmy frasche, Rob Pike, golang-nuts

On Thu, Feb 14, 2013 at 4:43 PM, jimmy frasche <soapbo...@gmail.com> wrote:

[...]
Maybe rename Close Stop?

I agree, despite the counterarguments and precedence (and in a database context, 'Close' has double precedence with from cursors).

This whole design seems clean, simple and elegant, but I don't like the names: I prefer the suggestions of 'Scan' instead of 'Next' and 'Stop' instead of 'Close'.

My justification is that these better match the semantics of what's actually happening. Next isn't returning a value, but rather an indicator of whether it did anything. Instead, just refer to the thing that was actually done (the scanning). Similarly, 'Close' doesn't close anything (what does it mean to close a scanner?), whereas 'Stop' is stating explicitly that one is stopping the Scanner.

- Dan C.

Nate Finch

unread,

Feb 14, 2013, 8:59:45 PM2/14/13

to golan...@googlegroups.com

I hate the chaining API. It's a cheap trick just to avoid writing one line of code, which to me is a very un-Go-like goal. MaxLength does not return a new Scanner, so it shouldn't return *Scanner; it's an unnecessary complication of the interface. I haven't found anything else in the standard library that does that, and I don't think starting now is a good idea.

Also, this may have been assumed, but it wasn't explicitly stated - you should export the default SplitFunc for splitting on '\r?\n'.

Kyle Lemons

unread,

Feb 14, 2013, 9:15:49 PM2/14/13

to Nate Finch, golang-nuts

On Thu, Feb 14, 2013 at 5:59 PM, Nate Finch <nate....@gmail.com> wrote:

I hate the chaining API. It's a cheap trick just to avoid writing one line of code, which to me is a very un-Go-like goal. MaxLength does not return a new Scanner, so it shouldn't return *Scanner; it's an unnecessary complication of the interface. I haven't found anything else in the standard library that does that, and I don't think starting now is a good idea.

I don't really like chaining either, but it is used in e.g. the template libraries. template.New("blah").Funcs(funcmap).Parse("text") etc, so it does have precedent.

Also, this may have been assumed, but it wasn't explicitly stated - you should export the default SplitFunc for splitting on '\r?\n'.

On Thursday, February 14, 2013 4:28:56 PM UTC-5, Rob Pike wrote:
I left the chaining design in place. It's cheap and helpful for
initialization but is not necessary: one may always use a separate
line for the option setup if desired. This is a trivial decision to
reverse, of course, and it's not set in stone yet.

Nate Finch

unread,

Feb 14, 2013, 9:24:03 PM2/14/13

to golan...@googlegroups.com

I figured someone else would prove me wrong. I scanned through likely packages to see if I'd forgotten anything, and missed those ones. I've even used them, though not recently, pleh. I'll blame it on lack of sleep :) I still don't like it, even there :)

On Thursday, February 14, 2013 9:18:26 PM UTC-5, gary b wrote:

On Thursday, February 14, 2013 5:59:45 PM UTC-8, Nate Finch wrote:
I haven't found anything else in the standard library that does that,

The text/template and html/template packages use the chaining design (see Funcs and Delims methods on Template type).

Steve McCoy

unread,

Feb 14, 2013, 9:34:28 PM2/14/13

to golan...@googlegroups.com

On Thursday, February 14, 2013 4:28:56 PM UTC-5, Rob Pike wrote:

Updated proposal.

Looks great to me and — despite your warning — this version of Split seems very easy to use.

roger peppe

unread,

Feb 15, 2013, 3:38:11 AM2/15/13

to gary b, golang-nuts

On 15 February 2013 02:18, gary b <gary...@gmail.com> wrote:
> On Thursday, February 14, 2013 5:59:45 PM UTC-8, Nate Finch wrote:
>>

>> I haven't found anything else in the standard library that does that,
>
>

> The text/template and html/template packages use the chaining design (see
> Funcs and Delims methods on Template type).

That's true, and it's a right pain (and the reason I suggested it
might not be a good idea for the splitter interface) - it means you can't have
a single interface that satisfies both the Template in text/template
and in html/template.

roger peppe

unread,

Feb 15, 2013, 4:52:14 AM2/15/13

to Rob Pike, golang-nuts

On 14 February 2013 21:28, Rob Pike <r...@golang.org> wrote:
> Close does not close the Reader (it can't; the argument is not a
> ReadCloser); it just shuts down the scanning operation, terminates the
> scan, and reports any accumulated error. Because of the internal use
> of bufio, in general there can no guarantee that, for early calls to
> Close, all data after the last returned token is available to be read
> afterwards.

Presumably if the reader was already a bufio.Reader, we *could*
make that guarantee, and it might be useful to do so.

> To allow the user to specify the token-splitting algorithm, we add a
> function option. It's not easy to use, but it won't be used much.
> Still, a word-breaking splitter, for instance, would be nice, as would
> a byte-at-atime and rune-at-a-time scanner, and we could provide those
> in bufio itself. The function is called by Next and has this
> signature:
>
> type SplitFunc func(data []byte, EOF bool) (advance int, token []byte)
>
> The EOF argument is true only at EOF, giving the function a chance to
> terminate the last token.
>
> The incoming data is a slice of unconsumed data. Each call to
> SplitFunc occurs at the previous location, plus the returned 'advance'
> value from the previous call. Thus by returning advance==0, SplitFunc
> can ask the Scanner to accumulate data until there is a full token to
> return. If the required storage becomes too large while accumulating,
> the Scanner will terminate with a line-too-long error. Once a token is
> delivered, SplitFunc would typically return advance=len(token) plus
> perhaps len(separator).

This seems fine. One small thing for consideration though: by
making the existence of a token predicated on advance>0,
we preclude the possibility of zero-length tokens with no delimiter.
For example, we couldn't have a splitter similar to strings.Split,
because the last (or the first) token has no delimiter and may be empty.
On the other hand, it has the nice property that the splitter is bound
to make progress regardless of what the split func returns.

> The token returned by SplitFunc is the next token to deliver to the
> client; there is no requirement that it correspond to any actual input
> data. For instance, it might be upper-cased or lower-cased or
> something completely arbitrary. A nil token signals to return nothing
> to the client yet.
>
> We set up a custom splitter with an option method:
>
> func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
> breaks of the form `\r?\n`
>
> For example, if we provided a rune splitter in the package, you'd scan
> runes like this:
>
> s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)

I'm still a little uncomfortable with using a function type for the
splitter. It's fine for splitting runes (there's no efficiency to be
gained from scanning where the last call left off, so we
can use static function), but n general one would need to create
a new function for each new scanner, because it's necessary to
have closure state.

For example, if we provide a splitter to split on \n only,
it would probably be something like:

func NewlineSplitter() SplitFunc {
seen := 0
return func(b []byte, atEOF bool) (int, []byte) {

i := bytes.IndexByte(b[seen:], '\n')
if i < 0 {

seen = len(b)
return 0, nil
}
etc
}
}

That looks quite like a New function, and in general
I think it's nicer to be carrying state around in values
rather than closures.

That said, it'll work fine either way.

Luis Alfonso Vega Garcia

unread,

Feb 15, 2013, 8:33:23 AM2/15/13

to Rob Pike, golang-nuts

Cool, a (simple) Scanner has been a missing important feature for Go.
I'm looking forward to use it.

>We set up a custom splitter with an option method:
> func (s *Scanner) Split(SplitFunc) *Scanner // default: split
on line breaks of the form `\r?\n`

I wonder what would be the implementation of the SplitFunc for the mac
case ('\r').

-- Alfonso

Alfonso Vega-Garcia | Software Engineer | vegacom at gmail.com

Michael Jones

unread,

Feb 15, 2013, 11:09:17 AM2/15/13

to Luis Alfonso Vega Garcia, Rob Pike, golang-nuts

Luis (and Rob),

Why not change the default from Rob's:

In the default case, this means lines of text separated by `\r?\n`.

to be "[end of line stuff]" as in:

(\r\n?) | (\n)

Brad Fitzpatrick

unread,

Feb 15, 2013, 11:53:02 AM2/15/13

to Rob Pike, golan...@googlegroups.com

(*Scanner).Close in the updated proposal is a weird name, considering there are no resources to free. If there isn't a ReadCloser to close, naming it Close makes it sound like FreeSomeBuffersAndBreakUpSomeGarbageMaybe().

There's precedent for naming it Err() error instead:

http://golang.org/pkg/database/sql/#Rows.Err

Which does the same thing:

927 // Err returns the error, if any, that was encountered during iteration.

928 func (rs *Rows) Err() error {

929 if rs.lasterr == io.EOF {

930 return nil

931 }

932 return rs.lasterr

933 }

On Thu, Feb 14, 2013 at 1:28 PM, Rob Pike <r...@golang.org> wrote:

Rob Pike

unread,

Feb 15, 2013, 12:29:53 PM2/15/13

to Brad Fitzpatrick, golan...@googlegroups.com

I'm leaning towards Scan and Stop as the methods at this point. I'm
close to blowing the whistle on the bikeshedding session.

I don't like Err or Error because they suggest something about the
error interface. The point is, in fact, to shut down the scanner and
terminate the scan; the error is a side effect, not the driver, and
calling it Err for instance will encourage lazy users to skip that
stage. That's partly why I liked Close, but now think that Close
indicates closure of the underlying resource. Stop indeed makes sense.
When you're done, you Stop. "When you're done, you Err" doesn't sound
right.

I'm going to write some code.

-rob

Brad Fitzpatrick

unread,

Feb 15, 2013, 12:45:06 PM2/15/13

to Rob Pike, golan...@googlegroups.com

On Fri, Feb 15, 2013 at 9:29 AM, Rob Pike <r...@golang.org> wrote:

I'm leaning towards Scan and Stop as the methods at this point. I'm
close to blowing the whistle on the bikeshedding session.

I don't like Err or Error because they suggest something about the
error interface.

What do they suggest? Error(), sure, makes it look like an error interface.

But Err is used in both:

$ grep Err api/go1.txt

pkg database/sql, method (*Rows) Err() error

pkg go/scanner, method (ErrorList) Err() error

And in compress/gzip2's private bitReader.

And 7 places inside Google's codebase.

If that's not the convention for a sticky error accessor, what should it be? "GetLastError"?

The point is, in fact, to shut down the scanner and
terminate the scan; the error is a side effect, not the driver, and
calling it Err for instance will encourage lazy users to skip that
stage. That's partly why I liked Close, but now think that Close
indicates closure of the underlying resource. Stop indeed makes sense.
When you're done, you Stop. "When you're done, you Err" doesn't sound
right.

What is there to Stop? What needs to be freed and shut down? The state and buffers are garbage-collected.

I don't Stop reading from an io.Reader if I'm done early before io.EOF.

What else I Stop?

The only thing I see to Stop is a *time.Ticker. But that makes sense: if I don't stop it, it keeps going. A bufio.Scanner has no inertia.

I'm going to write some code.

Okay. I'll can bring this up again later during code review.

Kevin Gillette

unread,

Feb 15, 2013, 5:26:21 PM2/15/13

to golan...@googlegroups.com, Brad Fitzpatrick

I agree on blowing that whistle -- I'd have been happy if your initial proposal had used any of the suggested names, and wouldn't have put much thought towards that. The nomenclature may be a learning/readability concern, but it certainly has no effect on practical ease-of-use or flexibility; those are the issues I'm concerned with.

I'm still interested in more details on SplitFunc: I believe it'd still be useful to have some kind of error signalling. Just as filepath.Walk can break out early with an error, there'd be use cases here where the interface is close enough to working for general tokenizing tasks (which would need to be able to signal syntax errors and such) that it is an important consideration. I'm not making a suggestion, and indeed if the expected use-cases are such that closures or panics could satisfy this need, then so be it.

Regarding chaining vs text/template: I see the usefulness of chaining in the template case, since it's often desirable to fully initialize global variables with Template values -- I don't see the usefulness of fully initializing a global with a bufio.Scanner. In this case, I think a cleaner option would be the exp/cookiejar approach, along the lines of:

NewScanner(io.Reader, *ScannerCfg) *Scanner

Where the cfg arg, a struct pointer, can be nil to indicate defaults. Alternatively, that signature could be a separate function perhaps called NewScannerConfig, alongside Rob's proposed NewScanner. Importantly, if options continually get added, there won't be a large chaining method set to maintain -- it'd instead consist of just adding extra fields to the config struct type.

I think 32kb is a reasonable default buffer size since, one way or another, that size will be configurable on a per-scanner basis. I, for one, would want to use this for parsing files which wouldn't even be expected to total 32kb in size (much less having lines that long). For cases where the lines can be arbitrarily, extremely long (such that the "human readable" concept of "line" long since ceased to apply), especially with files whose typical production contents are not intended to be managed or viewed by humans except in dire circumstances.

With due deference and respect to kortschak, I doubt that 1mb, or any fixed buffer size would be sufficient to handle some of those line-as-record cases which can contain arbitrary internal data, but in those cases as well, if I chose bufio.Scanner as the solution, I'd write a custom SplitFunc to handle low level tokenization and also include the newline itself as a separate token, rather than scan line-at-a-time, and then subprocess each line (bufio.Reader's ReadLine as is would excel particularly well in doing that anyway).

My suggestion is that the default be whatever buffer size is reasonable and efficient for throwaway tasks with an emphasis on expected home-brew formats, since complex formats (or formats designed for machine efficiency) are more likely to use a custom scanner, or may even be forced to use a custom scanner (if processing is truly line oriented yet lines may be of any length). By not setting it too high, it can also serve as an early canary for some tasks that bufio.Scanner may not be appropriate -- I myself have worked with formats where the common case has lines that fit on an 80 character display, but with corner cases that can be hundreds of kilobytes to tens of megs long -- exceeding a 32 or 64kb buffer may occur after a couple days of processing production data, while it may take 6 months to come across a case that will exceed a 1mb buffer.

Dan Kortschak

unread,

Feb 15, 2013, 6:07:11 PM2/15/13

to Kevin Gillette, golan...@googlegroups.com

Yes, that's what I currently do for FASTA files - that is use ReadLine. I suspect that the advent of the proposed Scanner will not change my existing codebase.

It occurred to me that I has already written a Scanner type, though for the specialised purpose of conditioning ragel state variables, this again would not change as it needs control of its buffers.

Tarmigan

unread,

Feb 17, 2013, 12:43:49 AM2/17/13

to roger peppe, Rob Pike, golang-nuts

On Fri, Feb 15, 2013 at 1:52 AM, roger peppe <rogp...@gmail.com> wrote:
> I'm still a little uncomfortable with using a function type for the
> splitter. It's fine for splitting runes (there's no efficiency to be
> gained from scanning where the last call left off, so we
> can use static function), but n general one would need to create
> a new function for each new scanner, because it's necessary to
> have closure state.
>
> For example, if we provide a splitter to split on \n only,
> it would probably be something like:
>
> func NewlineSplitter() SplitFunc {
> seen := 0
> return func(b []byte, atEOF bool) (int, []byte) {
> i := bytes.IndexByte(b[seen:], '\n')
> if i < 0 {
> seen = len(b)
> return 0, nil
> }
> etc
> }
> }
>
> That looks quite like a New function, and in general
> I think it's nicer to be carrying state around in values
> rather than closures.
>
> That said, it'll work fine either way.

My initial reading was that b would advance one byte at per call, so I
think you could just check the last byte?

func NewlineSplitter(b []byte, atEOF bool) (int, []byte) {
if atEOF {
return len(b), b
}
if b[len(b)-1] == '\n' {
if len(b) == 1 {
return 1, nil
}
return len(b), b[:len(b)-2]
}
return 0, nil
}

But rereading the proposal again, it looks unspecified whether the
slice could advance by more than 1 byte.

-Tarmigan

Tamás Gulácsi

unread,

Apr 1, 2013, 8:40:09 AM4/1/13

to golan...@googlegroups.com, Brad Fitzpatrick

Hi,

I just wanted to thank for everyone for this nice and usable solution - I'm using it to write converting readers (stripping =\r?\n, converting unheard-of semi-base64 quoted-printable encodings in mails), and it is easy, simple and performant (at least memory-wise, compared to the ReadFull + bytes.Replace).

Thanks again!

Tamás Gulácsi

unread,

Apr 1, 2013, 3:35:37 PM4/1/13

to golan...@googlegroups.com, Brad Fitzpatrick

2013. április 1., hétfő 14:40:09 UTC+2 időpontban Tamás Gulácsi a következőt írta:

Hi,

I just wanted to thank for everyone for this nice and usable solution - I'm using it to write converting readers (stripping =\r?\n, converting unheard-of semi-base64 quoted-printable encodings in mails), and it is easy, simple and performant (at least memory-wise, compared to the ReadFull + bytes.Replace).

Thanks again!
Tamás Gulácsi

And here is a Scanner -> Reader "converter" (just joins the tokens returned by the scanner):

type ScannerReader struct {

s *bufio.Scanner

part []byte

}

// turns a bufio.Scanner to an io.Reader

func NewScannerReader(s *bufio.Scanner) io.Reader {

return &ScannerReader{s: s}

}

// Implements io.Reader: reads at most len(p) bytes into p,

// returns the number of bytes read and/or the error encountered

func (sr *ScannerReader) Read(p []byte) (n int, err error) {

var d int

if len(sr.part) > 0 {

d = len(sr.part) - len(p)

if d > 0 {

copy(p, sr.part[:len(p)])

sr.part = sr.part[d:]

n = len(p)

return

}

copy(p, sr.part)

n = len(sr.part)

sr.part = sr.part[:0]

return

}

for n = 0; n < len(p); {

if ok := sr.s.Scan(); !ok {

if err = sr.s.Err(); err == nil {

err = io.EOF

}

return

}

sr.part = sr.s.Bytes()

d = len(sr.part) - (len(p) - n)

if d > 0 {

copy(p[n:], sr.part[:len(p)-n])

sr.part = sr.part[d:]

n = len(p)

return

}

copy(p[n:], sr.part)

n += len(sr.part)

}

panic("unreachable")

}

Reply all

Reply to author

Forward

Message has been deleted