using gob to serialise very large objects

1,427 views
Skip to first unread message

Dan Kortschak

unread,
Jan 25, 2014, 8:59:23 PM1/25/14
to golang-nuts
First, I'm quite happy to accept the answer, "Don't do this."

I am wondering about using gob to serialise very large data objects
(e.g. machine learning data sets as matrices [0.5GB], Burrows-Wheeler
indexes for searching mammalian genomes [2-4B], etc.). In most cases I
would want to keep the internals of these types unexported (we do this
with the provided gonum float64 matrix type, and in the BW search I'm
planning I would as well), so this means we must implement a
GobEncode/GobDecode pair. The problem here is that these methods first
go through a []byte instead of a io.Reader/io.Writer (looking at the gob
code, it's clear why this approach is used). This is a pretty
significant overhead for serialisation.

I'm wondering what alternatives there are here. One I can think of is to
provide a serialisation method that converts to a type with exported
fields that can be then directly encoded/decoded with gob behind the
scenes.

Dan

Carlos Castillo

unread,
Jan 26, 2014, 8:21:37 PM1/26/14
to golan...@googlegroups.com
The problem that you are hitting on is that all the common encoding methods provide an "all or nothing" type interface, even when using gob.NewEncoder(). You hand your full dataset as a single value, and it turns it into bytes, and should no error have occurred in the encoding, the object is written to disk. All the data needs to exist twice in memory (original data & encoded bytes) before a single byte is written to disk for encoding.

You could instead add a layer on top of of an existing format to cut the problem into manageable pieces. For example, to encode your matrices, encoder.Encode() a "Header" type that contains some metadata and a # of matrices, then encoder.Encode() each matrix independently. You now have a single file with N objects in it, and you only used extra memory comparable to the size of the largest piece. If something went wrong during the encode, and you stopped writing matrices to disk, you find out in the decode since you hit io.EOF before all values are read.


Note: I wrote Encode/DecodeMyBigData using the Encoder/Decoder interfaces I defined, so it's easy to substitute json or xml encoding instead (see commented out lines in main). Also instead of a header type, I use a raw int, which works just as fine since the only data outside the matrices is the metadata count.

Dan Kortschak

unread,
Jan 26, 2014, 8:28:27 PM1/26/14
to Carlos Castillo, golan...@googlegroups.com
Yes, that's exactly the problem. With the gonum matrix type it's pretty easy since there's a fully public type that can be serialised. With the BW index I have not yet completed the design, but I could just return the transform and let the client deal with that how they want.

thanks
Dan
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Carlos Castillo

unread,
Jan 26, 2014, 10:15:44 PM1/26/14
to golan...@googlegroups.com, Carlos Castillo
The types being exported/visible is not an issue, I've revised the example to make the types unexported (using json was just for flavor): http://play.golang.org/p/45ogZePUgs

AFAIK, you can encode values of unexported types (only unexported fields matter to reflect), since the myBigData type is not handled directly by gob/json/xml, it's private fields don't matter in the equation. Matrix being exported or not doesn't matter, only it's fields (and not their types) do: http://play.golang.org/p/xbNvzz-b4H

Dan Kortschak

unread,
Jan 26, 2014, 10:45:23 PM1/26/14
to Carlos Castillo, golan...@googlegroups.com, Carlos Castillo
Yes, sorry, I should be more clear. I'm talking about the fields of the data object.

Dan Kortschak

unread,
Jan 28, 2014, 1:00:06 PM1/28/14
to Dan Kortschak, Carlos Castillo, golan...@googlegroups.com
On a related note, the same (though worse; the size differential between the data and the byte slice is up to about 9x) issue exists with JSON since the interface creates a []byte rather than writing directly to an io.Writer, which it seems there is not reason for from my reading of the code. This situation is similar for the unmarshaling interface.

Is this something that could be changed in Go2? I'm thinking of filing an issue for this but would like opinions first.

Carlos Castillo

unread,
Jan 28, 2014, 1:53:57 PM1/28/14
to Dan Kortschak, golan...@googlegroups.com
The original purpose of JSON is to send relatively small messages between clients (eg: web browsers) and servers. The message is in fact usually a single Object that is parsed in real-time on the user's computer by a Javascript interpreter, or facsimile. Its strength lies in the fact that it can represent arbitrary data arrangements easily, in a human-readable manner that is similar-to (read: identical) to a language the programmer already knows (and has a parser/interpreter for.) It was never meant to store massively large data items, so it performs badly in that case. 

Similar arguments can be made about Gob, it's different from json in that it uses a binary format (instead of a text one) for speed and space-efficiency, and is much more strongly integrated with Go's native types, but again, it processes an object all at once, because it's easier / faster to do so when the individual items are relatively small, which is the usual case.

In both these cases, despite being designed as a wire-format, they work relatively well as file-formats for small and medium-sized data items (no more than a few MB or so).
For large data items, you should probably use a more appropriate format, such as a database, or cut your items into a sequence of smaller ones as I suggested earlier. 

Although it might be nice if encoding/json's Encoder and Decoder types weren't just functionally a wrapper around json.Marshal, and json.Unmarshal, but it would require having two different implementations of the same code. Since the common case is the small one, the implementation that favours that situation is the one used in both cases. I don't see a compelling argument for why the extra effort is needed.

I shudder as I write this, but XML might be useful to you here, since it's ridiculously easy to dump valid XML yourself to avoid the memory explosion, and go does provide a stream parser in encoding/xml.


On Tue, Jan 28, 2014 at 10:00 AM, Dan Kortschak <dan.ko...@adelaide.edu.au> wrote:
On a related note, the same (though worse; the size differential between the data and the byte slice is up to about 9x) issue exists with JSON since the interface creates a []byte rather than writing directly to an io.Writer, which it seems there is not reason for from my reading of the code. This situation is similar for the unmarshaling interface.

Is this something that could be changed in Go2? I'm thinking of filing an issue for this but would like opinions first.



--
Carlos Castillo

Dan Kortschak

unread,
Jan 28, 2014, 6:11:43 PM1/28/14
to Carlos Castillo, golan...@googlegroups.com
Yeah, I understand where JSON has come from, but it has also proved to
be a very versatile data storage format and one that I would like to use
more generally in out domain which is notorious for format
proliferation[1].

I should point out that in most cases this 'problem' is not an issue for
me since object with private fields are small enough not to result in
large allocations, but I am wondering about the design aspects to see if
it is worth reconsidering for Go2.

There is no real in-principle reason not to store very large data sets
in JSON if one accepts that you can do the same with XML (and people do
- both in my field and elsewhere - e.g. wikipedia).

Theres a clear difference in the implementation of JSON encoding in Go
and gob encoding that make handing an encoded slice of bytes to the gob
encoder sensible that does not make nearly as much sense for JSON - i.e.
there is no reason that I can see that prevents the json.Encoder from
handing an io.Writer to an EncodeJSON method pretty much as is done for
xml (yes, this is not strictly true, but certainly close enough).

I don't see how the interaction between Marshal and Encoder impacts on
the signature of JSON. If the signature were changed to
MarshalJSON(io.Writer) error there would be minimal change to the
package, mainly in {,addr}MarshalerEncoder[1] but also compact which
flows on to 1 other place (Compact) with trivial changes either to shim
(wrap the privided []byte with a bytes.Reader) or alter (change the
signature to Compact(io.Writer, io.Reader) error.

Finally, yeah. No. XML is not an option.

[1]http://www.biostars.org/p/7126/#7136
[2]http://golang.org/src/pkg/encoding/json/encode.go?s=12503:13250#L416
--
Omnes mundum facimus.

Dan Kortschak <dan.ko...@adelaide.edu.au>
F9B3 3810 C4DD E214 347C B8DA D879 B7A7 EECC 5A40
10C7 EEF4 A467 89C9 CA00 70DF C18F 3421 A744 607C

Brendan Tracey

unread,
Jan 28, 2014, 6:41:37 PM1/28/14
to golan...@googlegroups.com, Carlos Castillo
Couldn't this issue be solved before go2 by the definition of an "encodable" interface

type Encoder interface{
     Encode(v interface{}) error
}

type Encodable interface{
     EncodeInto(e Encoder) error 
}

For my cases (at least) usually my MarshalJSON functions create a temporary struct with the private types public, and then call the relevant Marshal functions. The Encodable interface would have the same functionality, except I could call the Encode interface. It would fix the problem Dan brings up (which is one I may share in the near future) since (as far as I'm aware) the encoders are backed by an io.Writer, so the writes could happen as needed (without needing all of the temporary memory). This interface seems to also allow for the creation of new custom encoders that types do not need to be aware of in order to satisfy. At the moment, if I want to support the encoding into gob, JSON, and xml, I need to have three different encoding methods. Am I missing something?

Carlos Castillo

unread,
Jan 28, 2014, 8:10:56 PM1/28/14
to Brendan Tracey, golang-nuts
Brendan, your solution does make it more natural to express "When you encode MyType, you should actually encode myOtherType", and doesn't have the limitation of requiring the the user generate a []byte, so it could handle large types, except the way the encoders do the encoding is still a problem. Also, although the encoders are backed by an io.Writer, that writer is always a bytes.Buffer, so the data is going to be stored in memory.

No matter how you write your custom marshaler, if you are using encoding/json or encoding/gob, then an entire data item (the argument to encode/marshal) will be converted to one large []byte. In the case of Marshal, this makes sense, since you are being returned that byte slice (and an error). In the case of json/gob.Encoder.Encode, which writes it's output to the io.Writer it was constructed with, this makes less sense. As it is currently written, Encode uses Marshal (or more precisly it's underlying bytes.Buffer implementation) to create the large []byte, and then io.Write()s it.

I can see one advantage (other than code reuse) to the current behaviour of Encode: Nothing is written if there is an error during encoding. If that behaviour is changed, then existing programs will write (incomplete) data on error when previously they did not. If you don't want to have the entire object twice in memory (once encoded, once not), you will have this problem, as those encoded bytes have to be somewhere . Theoretically, Encode could write to a temporary file, so it's not in memory, and then read it back on success, but that would excessive to be in the stdlib, especially for the default behaviour of all programs.

One way to deal with this changing behaviour would be to leave Encode alone and create a new method EncodeDirect, which explicitly doesn't guarantee that nothing will be written on error, but otherwise operates identical to Encode. It would, however, also add a wrinkle to the proposed Encoder interface{}, ie: which method, Encode, or EncodeDirect should be used?

Otherwise, there is still my initial solution of making a method/function that splits the large item into small items and encodes them individually. It solves both the visibility issue, and the memory explosion. Since it's your code, the fact that data could be written on error doesn't affect existing programs, and you can work around that yourself. It also doesn't need to touch the standard library code at all.
--
Carlos Castillo
Reply all
Reply to author
Forward
0 new messages