Dumping/reading data structures to/from a file

8,478 views
Skip to first unread message

Miguel Pignatelli Moreno

unread,
May 12, 2011, 7:00:05 PM5/12/11
to golang-nuts
Hi all,

I have the following map:

/////////
type tids struct {
taxid int
name string
}

type taxon struct {
*tids
taxon string
}

type Taxnode struct {
this taxon
parentId int
}

tree := map[int]*Taxnode
//////

It is populated by reading and processing a lot of records from a file (1.4M).
The whole process takes some time. What is the best way to "dump" the data structure (the map) to a file and read it back (as the same data structure) efficiently?

Thanks in advance!

M;

eksp...@gmail.com

unread,
May 12, 2011, 7:04:58 PM5/12/11
to golan...@googlegroups.com
Only binary.Read and binary.Write come to mind.

http://golang.org/pkg/encoding/binary/

Russ Cox

unread,
May 12, 2011, 7:30:01 PM5/12/11
to Miguel Pignatelli Moreno, golang-nuts

Kyle Consalus

unread,
May 12, 2011, 7:59:33 PM5/12/11
to Miguel Pignatelli Moreno, golang-nuts
I'd use Gob; see http://golang.org/pkg/gob/

On my machine, I wrote a little program that generates a 2 meg slice of struct { int, int }, writes it to a file, then reads it back.
Timing:
Encode: 121.316ms
Decode: 109.776ms

Not too shabby, and couldn't be much easier.

Miguel Pignatelli

unread,
May 13, 2011, 6:56:58 AM5/13/11
to golang-nuts
Would it also be possible to use the "unsafe" package to "map" the data
structure to the file and read it back again?
I understand that this would be very fast.

Trying to figure out if this is possible I have found a post [1] that
does this with structs:

func struct2bytes(x *myStruct) []byte {
return (*[unsafe.Sizeof(*x)]byte)(unsafe.Pointer(x))[0:]
}

but I am not able to convert it back:

func bytes2struct (s []byte) *field {
f := &field{}
f = (*field)(unsafe.Pointer(&s))
return f
}

Causes a segfault.

In case I can solve this... Would it be also possible to use the same
strategy with a map[int]*MyStruct ?

Thanks in advance,

M;


[1]
http://groups.google.com/group/golang-nuts/browse_thread/thread/85a3306f24a6ebb2

Russ Cox

unread,
May 13, 2011, 9:57:00 AM5/13/11
to Miguel Pignatelli, golang-nuts
On Fri, May 13, 2011 at 06:56, Miguel Pignatelli
<miguel.p...@uv.es> wrote:
> Would it also be possible to use the "unsafe" package to "map" the data
> structure to the file and read it back again?
> I understand that this would be very fast.

Please try gob and let us know if it's too slow.

You could use unsafe if your data structure had no
pointer in it, but once pointers are involved you can't
just write the memory out to disk and read it back in.
That's a recipe for disaster.

Russ

Ian Lance Taylor

unread,
May 13, 2011, 10:33:35 AM5/13/11
to Miguel Pignatelli, golang-nuts
Miguel Pignatelli <miguel.p...@uv.es> writes:

> Would it also be possible to use the "unsafe" package to "map" the
> data structure to the file and read it back again?
> I understand that this would be very fast.
>
> Trying to figure out if this is possible I have found a post [1] that
> does this with structs:
>
> func struct2bytes(x *myStruct) []byte {
> return (*[unsafe.Sizeof(*x)]byte)(unsafe.Pointer(x))[0:]
> }

Note that this views the object as an array and then slices the array.


> but I am not able to convert it back:
>
> func bytes2struct (s []byte) *field {
> f := &field{}
> f = (*field)(unsafe.Pointer(&s))
> return f
> }

Here you are trying to view a slice, not an array, as though it were the
object. You need to get to the underlying array, not the slice. In
other words, do this instead:

func bytes2struct (s []byte) *field {

return (*field)(unsafe.Pointer(&s[0]))
}


Not that I recommend this kind of thing. It works until something,
anything, changes, and then you can't read any of your old data files.
Doing proper serialization and unserialization takes longer, but it
means that you won't lose access to your data.

You said it took "some time" to read 1.4M. It shouldn't take all that
long. How are you reading it? What does "some time" mean to you?

Ian

André Moraes

unread,
May 13, 2011, 10:40:11 AM5/13/11
to golang-nuts
Protocol Buffers maybe?

Go has support for those.

--
André Moraes
http://andredevchannel.blogspot.com/

Rob 'Commander' Pike

unread,
May 13, 2011, 10:45:03 AM5/13/11
to André Moraes, golang-nuts

On May 13, 2011, at 7:40 AM, André Moraes wrote:

> Protocol Buffers maybe?
>
> Go has support for those.

Protocol buffers require you to write a data specification and can only handle structs; on the other hand, they allow interoperability between Go and Java, C++, and Python.

Gobs do not require a data specification and can handle anything except a channel or function, including slices, arrays, and basic types; on the other hand, they are Go-only. They're much easier to use and comparable in performance.

For what you're asking, I think gobs are the better choice but either would work.

-rob


Miguel Pignatelli

unread,
May 13, 2011, 11:07:22 AM5/13/11
to Ian Lance Taylor, golang-nuts
On 13/05/11 15:33, Ian Lance Taylor wrote:
> Miguel Pignatelli<miguel.p...@uv.es> writes:
>
>> Would it also be possible to use the "unsafe" package to "map" the
>> data structure to the file and read it back again?
>> I understand that this would be very fast.
>>
>> Trying to figure out if this is possible I have found a post [1] that
>> does this with structs:
>>
>> func struct2bytes(x *myStruct) []byte {
>> return (*[unsafe.Sizeof(*x)]byte)(unsafe.Pointer(x))[0:]
>> }
>
> Note that this views the object as an array and then slices the array.
>
>
>> but I am not able to convert it back:
>>
>> func bytes2struct (s []byte) *field {
>> f :=&field{}
>> f = (*field)(unsafe.Pointer(&s))
>> return f
>> }
>
> Here you are trying to view a slice, not an array, as though it were the
> object. You need to get to the underlying array, not the slice. In
> other words, do this instead:
>
> func bytes2struct (s []byte) *field {
> return (*field)(unsafe.Pointer(&s[0]))
> }
>

Oops... I see, thanks!

>
> Not that I recommend this kind of thing. It works until something,
> anything, changes, and then you can't read any of your old data files.
> Doing proper serialization and unserialization takes longer, but it
> means that you won't lose access to your data.
>

Yes, that's true. Speed is important in this particular case. I will try
with gob and compare the implementations.


> You said it took "some time" to read 1.4M. It shouldn't take all that
> long. How are you reading it? What does "some time" mean to you?
>

It is 1.4M of records, not 1.4Mb of data.

Thanks for the help!

M;


> Ian
>

zhai

unread,
May 13, 2011, 11:11:25 AM5/13/11
to Rob 'Commander' Pike, André Moraes, golang-nuts

2011/5/13 Rob 'Commander' Pike <r...@google.com>

Gobs do not require a data specification and can handle anything except a channel or function, including slices, arrays, and basic types; on the other hand, they are Go-only.  They're much easier to use and comparable in performance.

 I hope gob will take care of cyclic pointers someday and not slow it's performance

Gob document:
Recursive types work fine, but recursive values (data with cycles) are problematic. This may change.


package main


import (
"gob"
"bytes"
)
type T struct {
Item int
Next *T
}
func main() {
b := new(bytes.Buffer)
var t T = T{1, nil}
t.Next = &t
enc := gob.NewEncoder(b)
enc.Encode(t)
}

Miguel Pignatelli Moreno

unread,
May 13, 2011, 5:23:11 PM5/13/11
to r...@golang.org, golang-nuts

El 13/05/2011, a las 14:57, Russ Cox escribió:

> On Fri, May 13, 2011 at 06:56, Miguel Pignatelli
> <miguel.p...@uv.es> wrote:
>> Would it also be possible to use the "unsafe" package to "map" the data
>> structure to the file and read it back again?
>> I understand that this would be very fast.
>
> Please try gob and let us know if it's too slow.

Using gob I am able to speed the process ~2.6x
A good improvement without much work involvement.

Cheers

M;

Steven

unread,
May 13, 2011, 9:45:58 PM5/13/11
to Miguel Pignatelli Moreno, r...@golang.org, golang-nuts
On Friday, May 13, 2011, Miguel Pignatelli Moreno <eme...@gmail.com> wrote:
> Using gob I am able to speed the process ~2.6x
> A good improvement without much work involvement.

What is this compared to?

Miguel Pignatelli Moreno

unread,
May 14, 2011, 5:33:13 PM5/14/11
to Steven, r...@golang.org, golang-nuts


The original idea was to speed up the process of building a tree (based on a map) from data stored in 2 text files (80Mb of data). So the idea is to build the map (tree) once and dump the data structure to a file. Then to upload the generated file/tree when it is needed (without preprocessing).

These are the numbers (some code below):

$ ./test_IOtree
Parsed the tree in 8.07656 seconds
30216971 bytes successfully written to file
tree written to file in 3.47415 seconds
tree read from file in 3.06392 seconds

I speed up the process by a factor of ~2.6x (based on several runnings).
I know this is only useful in this particular context.
In a more general thought, dumping 29Mb of data (using gob) in more than 3 segs doesn't seem very fast to me. I am aware that the Go's developer are not currently optimizing the code, so... i) for now I am happy with this speed up of 2.6x; ii) I expect the IO to become more efficient with time.

The code....

type tids struct {
Taxid int
Name string
}

type taxon struct {
Tids *tids
Taxon string
}

type Taxnode struct {
This taxon
ParentId int
}

type taxonomy map[int]*Taxnode

func New(nodes, names string) taxonomy {
/////// Function to build the tree from the files nodes and names
}

func (t taxonomy) Store (fname string) os.Error {
b := new(bytes.Buffer)
enc := gob.NewEncoder(b)
err := enc.Encode(t)
if err != nil {
return err
}

fh, eopen := os.OpenFile(fname, os.O_CREATE|os.O_WRONLY, 0666)
defer fh.Close()
if eopen != nil {
return eopen
}
n,e := fh.Write(b.Bytes())
if e != nil {
return e
}
fmt.Fprintf(os.Stderr, "%d bytes successfully written to file\n", n)
return nil
}

func Load (fname string) (taxonomy, os.Error) {
fh, err := os.Open(fname)
if err != nil {
return nil, err
}
t := make(taxonomy)
dec := gob.NewDecoder(fh)
err = dec.Decode(&t)
if err != nil {
return nil, err
}
return t, nil
}

func main() {
nodes := "/Users/pignatelli/src/tests/nodes.dmp"
names := "/Users/pignatelli/src/tests/names.dmp"

s1 := time.Nanoseconds()
newtax := New(nodes, names)
s2 := time.Nanoseconds()
if newtax == nil {
fmt.Fprintf(os.Stderr,"the map is empty\n")
}
fmt.Printf("Parsed the tree in %.5f seconds\n", (float32(s2-s1))/1e9)

s1 = time.Nanoseconds()
e := newtax.Store("file.bin")
s2 = time.Nanoseconds()
if e != nil {
fmt.Println(e)
os.Exit(1)
}
fmt.Printf("tree written to file in %.5f seconds\n", float32(s2-s1)/1e9)

s1 = time.Nanoseconds()
_,err := Load("file.bin")
s2 = time.Nanoseconds()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
fmt.Printf("tree read from file in %.5f seconds\n", float32(s2-s1)/1e9)
}

Reply all
Reply to author
Forward
0 new messages