Writing chunks of data to a single gob file and later reading each chunk

1,742 views
Skip to first unread message

datachunk

unread,
Jun 21, 2014, 4:59:43 AM6/21/14
to golan...@googlegroups.com
Hi everybody,

I am currently working on some data - analysis tools in go. Basic idea is to do plain old statistics in a parallel/concurrent fashion with goroutines, which works like a charm.
As datasets can be very large, they absolutely have to be partitioned and processed in chunks and stored in a file as chunks. This is where the problem starts, as I try to do this with encoding/gob, because I want to store my data in binary format, later even compress it. The encoding is done by repeatedly calling gobEncoder.Encode([][]float64), which then presumably writes each chunk to a single gob file. At least the size of the file looks right. I have not delved into the minutiae of how gob stores its data, so I cannot say what gob is storing.

Now when I repeatedly call gobDecoder.Decode(&[][]float64), I always get the same chunk of data. It seems that gob-encoding cannot handle multiple chunks of data in one file.

Help, leads, or hints are very appreciated.

Bernard

chris dollin

unread,
Jun 22, 2014, 11:37:49 AM6/22/14
to datachunk, golang-nuts
Show code (minimal complete example preferred).

Chris

--
Chris "allusive" Dollin

bernhard...@gmail.com

unread,
Jun 24, 2014, 3:25:19 PM6/24/14
to golan...@googlegroups.com, bernhard...@gmail.com
Here is some sample code:
 //In this main function, I am parsing the ascii-datafile and write float64 to a dat2bin.gob file. I have left out the float conversion etc..
func main() {
r, _ := regexp.Compile(";")
file, _ := os.Open("/path/to/datafile/household_power_consumption.txt")
scanner := bufio.NewScanner(file)
var basetime time.Time
var bulk [][]float64
var pkgCounter int

f, err := os.Create("/path/to/datafile/prepareData/dat2bin.gob")
defer f.Close()

gobEncoder = gob.NewEncoder(f)
pkgCounter = 0
for j := 0; scanner.Scan(); j++ {
if j%10000 == 0 && j != 0 {
pkgCounter++
err = gobEncoder.Encode(bulk)
bulk = nil
}
res := r.Split(scanner.Text(), -1)
bulk = append(bulk, res)
}
pkgCounter++
err = gobEncoder.Encode(bulk)
bulk = nil
}




// In this second main function, I am trying to pull out the data from dat2bin.gob packagewise. It is just experimental. 
// I first read into stuffList then into stuffList1, and then again convert to a csv-ascii format to check my data 
// in human readable format. And here is the problem, these two calls:
//  err = gobDecoder.Decode(&stuffList)
// err = gobDecoder.Decode(&stuffList1)
// give me the exact same data 'package'.


func main() {
gobFile, err := os.Open("/path/to/datafile/prepareData/dat2bin.gob")
defer gobFile.Close()
gobDecoder := gob.NewDecoder(gobFile)

var stuffList, stuffList1 [][]float64
err = gobDecoder.Decode(&stuffList)
err = gobDecoder.Decode(&stuffList1)
write1, err := os.Create("stuffList.txt")
write2, err := os.Create("stuffList1.txt")
enc1 := csv.NewWriter(write1)
enc2 := csv.NewWriter(write2)
var strWrite [][]string = make([][]string, 0, 10000)
var strWrite1 [][]string = make([][]string, 0, 10000)
var rows, rows1 int
for i, oneline := range stuffList {
strWrite = append(strWrite, make([]string, 9))
rows++
for j, onecell := range oneline {
strWrite[i][j] = strconv.FormatFloat(onecell, 'f', 6, 64)
}
}

for i, oneline := range stuffList1 {
strWrite1 = append(strWrite, make([]string, 9))
rows1++
for j, onecell := range oneline {
strWrite1[i][j] = strconv.FormatFloat(onecell, 'f', 6, 64)
}
}
fmt.Println(rows)
fmt.Println(rows1)
for i, _ := range strWrite {
enc1.Write(strWrite[i])
enc2.Write(strWrite1[i])
}
enc1.Flush()
enc2.Flush()
}

Nigel Tao

unread,
Jun 24, 2014, 7:59:37 PM6/24/14
to bernhard...@gmail.com, golang-nuts
On Wed, Jun 25, 2014 at 5:25 AM, <bernhard...@gmail.com> wrote:
> err = gobEncoder.Encode(bulk)

You assign the result to err, but don't otherwise seem to inspect the
error returned.


> err = gobDecoder.Decode(&stuffList)
> err = gobDecoder.Decode(&stuffList1)

Similarly here.

For example, if your I/O failed (and returned a non-nil error that you
ignored), you might be reading data written by an older version of
your program without knowing it.

bernhard...@gmail.com

unread,
Jun 28, 2014, 12:20:48 PM6/28/14
to golan...@googlegroups.com, bernhard...@gmail.com
Nigel, true. But this is only sample code. In my 'real' code I have panics for non-nil errors, so to make absolutely sure, I get warned whenever it goes wrong. And I do not get any panic etc..

Any other idea?

Nigel Tao

unread,
Jun 28, 2014, 10:52:45 PM6/28/14
to Bernhard Spanyar, golang-nuts
On Sun, Jun 29, 2014 at 2:20 AM, <bernhard...@gmail.com> wrote:
> Any other idea?

I don't see any obvious programming mistake, so I'd try reducing this
to a minimal example that still exhibits the wrong behavior. For
example, if you drop the "10000" to just "10" or even less, then do
you still see the bug? If so, you should be able to just print out and
eyeball the first few entries both before encoding to and after
decoding from gob, and between decoding from gob and encoding to CSV.
That would narrow down in which part of the two programs the problem
is first seen.

Another thing to try is to take out the file I/O and use a
bytes.Buffer instead, a la
http://golang.org/pkg/encoding/gob/#example__basic

If you did that, and were still able to reproduce the problem with
only a few lines of the original "household_power_consumption.txt"
data, then you could move the reproduction case to the Go playground
at http://play.golang.org/ once you hard-code those few lines of input
into a strings.NewReader. Once you have a (minimal) reproducible case
in the Go playground, it will be much easier for the community to help
find the bug.

Gary Scarr

unread,
Jun 29, 2014, 4:11:32 PM6/29/14
to golan...@googlegroups.com, bernhard...@gmail.com
strWrite1 = append(strWrite, make([]string, 9)) looks suspicious but, as others have said, a minimal complete example including data that reproduces the problem is the only way to reasonably expect people to help you.  That includes putting in error handling so that they don't have to waste time wondering where the error is.  You may well find the problem yourself as you simplify the code.

Gary

bernhard...@gmail.com

unread,
Jul 1, 2014, 2:46:42 PM7/1/14
to golan...@googlegroups.com, bernhard...@gmail.com
Hello,

ok, here is some function code. It does not reproduce my problem but it shows some unexpected behavior still. Note that go-playground also allows using a filesystem. And this is the first example. It shows two chunks of data written to a gob encoder, backed by a FILE on A FILESYSTEM:
It is returning empty decoded chunks.

Contrast this with this, backed by an in-memory byte-buffer which en- and decodes flawlessly:

Not exactly my problem, but odd nevertheless. I  would have expected that gob-encoding works equally well no matter by what it is backed, a byte buffer or a file.
Bernhard

bernhard...@gmail.com

unread,
Jul 1, 2014, 2:52:50 PM7/1/14
to golan...@googlegroups.com, bernhard...@gmail.com
Here is a version where I check for failed file creation:

Nigel Tao

unread,
Jul 1, 2014, 7:53:07 PM7/1/14
to Bernhard Spanyar, golang-nuts
On Wed, Jul 2, 2014 at 4:52 AM, <bernhard...@gmail.com> wrote:
> Here is a version where I check for failed file creation:
> http://play.golang.org/p/gi7VJloDjC

Your code does this:

gobFile, err := os.Create("gobFile.gob")
if err != nil {
panic(err)
}
enc := gob.NewEncoder(gobFile)
dec := gob.NewDecoder(gobFile)

which is sharing the same file descriptor for the writing and the
reading. That means that, after you have written 984 bytes of
gob-encoded data, the file descriptor position is 984: it is
positioned at the end of the file, not the beginning. Thus, when you
next read from that file, you're reading nothing but EOF, since you're
already at the end. Instead, open the file twice, once for writing,
once for reading.

wFile, err := os.Create("gobFile.gob")
if err != nil {
panic(err)
}
rFile, err := os.Open("gobFile.gob")
if err != nil {
panic(err)
}
enc := gob.NewEncoder(wFile)
dec := gob.NewDecoder(rFile)

http://play.golang.org/p/8aN3ZoSE_L

Gary Scarr

unread,
Jul 2, 2014, 12:08:29 AM7/2/14
to golan...@googlegroups.com
And if you uncomment your err check on line 30 you see you get EOF. Why are you asking for help then ignoring our advice to check every err?

Carlos Castillo

unread,
Jul 2, 2014, 12:51:16 AM7/2/14
to golan...@googlegroups.com, bernhard...@gmail.com
Alternatively, if you must keep using a single file handle (a somewhat unwise situation), you can use os.File.Seek between the writes and reads to put the read location back at the beginning of the now written data: http://play.golang.org/p/8MQzcw0b5J

bernhard...@gmail.com

unread,
Jul 2, 2014, 1:52:03 PM7/2/14
to golan...@googlegroups.com, bernhard...@gmail.com
Big thanks to both of you Carlos and Nigel. While it did not solve my problem yet, it gave me a very nice diagnostic with Carlos' seek command.
For completeness and the afterworld :-) here is how to instrument your go-code to figure where your file pointer currently is:

To get the current file pointer position, I use a os.Seek(0, os.SEEK_CUR), i.e. I am seeking zero from current position, which returns the current position. It is from unwind here: http://stackoverflow.com/a/10901436

I will now use this, to find out what gob is doing while encoding data packages of size 10000.

Bernhard

bernhard...@gmail.com

unread,
Jul 2, 2014, 2:06:50 PM7/2/14
to golan...@googlegroups.com, bernhard...@gmail.com
10000 sized packages work equally well in a file with gob encoding, of course. Had a typo.
Problem solved.
Reply all
Reply to author
Forward
0 new messages