Count of actually "processed" bytes from csv.Reader

130 views
Skip to first unread message

Severyn Lisovsky

unread,
Oct 30, 2020, 10:17:26 PM10/30/20
to golang-nuts
Hi,

I have difficulty counting bytes that were processed by csv.Reader because it reads from internally created bufio.Reader. If I pass some counting reader to csv.NewReader it will show not the actual number bytes "processed" by csv.Reader to receive the output I get calling csv.Reader.Read method, but the number of bytes copied to bufio.Reader's buffer internally (some bytes may be read during next csv.Reader.Read call from the buffer).

Is there a way I can deal with this issue by not forking encoding/csv package?

To give you more high-level picture - I want to split remote csv file to chunks. Each chunk should be standalone csv file - starting from actual beginning of the line, ending with newline byte. So I'm trying to do the following - split file size by the number of chunks, and for each chunk - skip first bytes up to newline symbol and read to offset+chunkSize+[number of bytes to the next newline symbol]

peterGo

unread,
Oct 31, 2020, 10:51:30 AM10/31/20
to golang-nuts
Severyn,

The best way to deal with this issue is to redefine the issue. Use csv lines not bytes as the measure.

For example,


Peter

Robert Engels

unread,
Oct 31, 2020, 12:18:17 PM10/31/20
to peterGo, golang-nuts
If you want to do this, create a ByteCount reader that wraps the underlying reader and pass that to the csv reader. 

On Oct 31, 2020, at 9:52 AM, peterGo <go.pe...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/cf20e79f-5683-446e-bda1-4029875e4860o%40googlegroups.com.

Tamás Gulácsi

unread,
Oct 31, 2020, 1:02:32 PM10/31/20
to golang-nuts
Give csv.NewReader your own *bufio.Reader.
Regarding (https://pkg.go.dev/pkg/bufio/#NewReaderSize) if the underlying io.Reader is already a *bufio.Reader with a big enough size (and csv.NewReader uses the default 4k),
then the underlying reader is used, no new wrapping is introduced.

This way if you use
  cr := countingReader{Reader:r} 
  br := bufio.NewReader(cr)
  csvR := csv.NewReader(br)

then cr.N - br.Buffered() is the number of bytes read by csv.Reader, the end of the last line read.

Hope this helps.

Severyn Lisovsky

unread,
Oct 31, 2020, 1:31:34 PM10/31/20
to golang-nuts
Tamás Gulácsi, this was basically my initial idea to do that, but unfortunately there is no access to internal bufio.Reader. See:
https://golang.org/src/encoding/csv/reader.go#L170

peterGo, my file is ~100GB so downloading it just for sake of splitting doesn't make sense to me. I want for each worker to make use of NewRangeReader method to download only related piece of the file. 

ren...@ix.netcom.com ByteCount reader that wraps the underlying reader wouldn't help because csv.Reader doesn't read from underlying reader synchronically, it reads from bufio.Reader which buffers the bytes. So for example if you read 1 row from CSV (eg. 10 bytes) from underlying io.Reader will be 4096 bytes read. On the next csv.Reader.Read() call none of bytes will be read from underlying io.Reader because it will take next row out of the buffer

Tamás Gulácsi

unread,
Oct 31, 2020, 1:50:18 PM10/31/20
to golang-nuts
Why do you need an access to the internal bufio.Reader?

If you provide a bufio.Reader to bufio.NewReader, then it will NOT create a new reader, but give back your reader.
So if you keep your bufio.Reader, and give it to csv.NewReader, than you will have the same *bufio.Reader
as what the csv.Reader's inner r !

Severyn Lisovsky

unread,
Oct 31, 2020, 2:01:08 PM10/31/20
to golang-nuts
Tamás Gulácsi, wow didn't know that providing bufio.Reader to bufio.NewReader doesn't wrap your reader. Looks like this is the solution I've been looking for. Thanks!

Severyn Lisovsky

unread,
Oct 31, 2020, 2:16:12 PM10/31/20
to golang-nuts
for everyone interested this is the solution in Go Playground:

Victor Giordano

unread,
Oct 31, 2020, 7:40:21 PM10/31/20
to golang-nuts
nice! a god old decorator example! thanks
Reply all
Reply to author
Forward
0 new messages