gob size limit

880 views
Skip to first unread message

Dan Kortschak

unread,
Apr 19, 2015, 9:34:44 PM4/19/15
to golan...@googlegroups.com, Rob Pike, David Symonds
Resurrecting an old thread.

https://groups.google.com/d/topic/golang-nuts/4aFAxBEjjBY/discussion

I'm using gob to serialise very large data sets for storage rather than
message passing (maybe this is not an intended use - please suggest
other approach if this is the case - it seems like the best option at
this stage). Some cases leave me with gobs that are on the order of
10GB.

This is not for external consumption, so I'm happy to modify the
encoding.tooBig constant (at least in the short term), but I was
wondering if there was still a view that having a method on gob.Decoder
to set this limit is appropriate... and whether a CL adding this would
be accepted at this point.

Egon

unread,
Apr 20, 2015, 2:25:32 AM4/20/15
to golan...@googlegroups.com, r...@golang.org, dsym...@golang.org
On Monday, 20 April 2015 04:34:44 UTC+3, kortschak wrote:
Resurrecting an old thread.

https://groups.google.com/d/topic/golang-nuts/4aFAxBEjjBY/discussion

I'm using gob to serialise very large data sets for storage rather than
message passing (maybe this is not an intended use - please suggest
other approach if this is the case - it seems like the best option at
this stage). Some cases leave me with gobs that are on the order of
10GB.

What data?
Can you give the data-structure(s)?
How often do you read/write the whole data?
Do you need random access to data?

I would say use some other format.

Dan Kortschak

unread,
Apr 20, 2015, 2:49:59 AM4/20/15
to Egon, golan...@googlegroups.com, r...@golang.org, dsym...@golang.org
On Sun, 2015-04-19 at 23:25 -0700, Egon wrote:
> What data?
> Can you give the data-structure(s)?
> How often do you read/write the whole data?
> Do you need random access to data?

Slices of named 4-tuples (struct of ints). Infrequent. Sequential.

> I would say use some other format.

I was considering using plain binary encoding and compressing, but that
feels too brittle.

Egon

unread,
Apr 20, 2015, 2:55:47 AM4/20/15
to golan...@googlegroups.com, egon...@gmail.com, r...@golang.org, dsym...@golang.org
On Monday, 20 April 2015 09:49:59 UTC+3, kortschak wrote:
On Sun, 2015-04-19 at 23:25 -0700, Egon wrote:
> What data?
> Can you give the data-structure(s)?
> How often do you read/write the whole data?
> Do you need random access to data?

Slices of named 4-tuples (struct of ints). Infrequent. Sequential.

As in

type Data [][]Tuple
type Tuple struct { X, Y, Z, W int }

Or do the names vary between the tuples as:

type Data [][]Tuple
type Tuple struct { Xname string, X int ... }

Or do you mean that the slices are named?

type Data []struct {  Name string; Values []Tuple } 

Also, how long are the names?

Dan Kortschak

unread,
Apr 20, 2015, 2:57:48 AM4/20/15
to Egon, golan...@googlegroups.com, r...@golang.org, dsym...@golang.org
On Sun, 2015-04-19 at 23:55 -0700, Egon wrote:
> As in
>
> type Data [][]Tuple
> type Tuple struct { X, Y, Z, W int }
>
> Or do the names vary between the tuples as:
>
> type Data [][]Tuple
> type Tuple struct { Xname string, X int ... }
>
> Or do you mean that the slices are named?
>
> type Data []struct { Name string; Values []Tuple }
>
> Also, how long are the names?
>
Sorry, I was unclear, I should just write code:

type Trap struct {
Top, Bottom int
Left, Right int
}

type Traps []Trap

We are keeping Traps.

Sebastien Binet

unread,
Apr 20, 2015, 3:05:06 AM4/20/15
to Egon, golang-nuts, Rob Pike, David Symonds
using encoding/binary isn't that brittle and seems to have rather nice
performances (size, speed):
http://ugorji.net/blog/benchmarking-serialization-in-go

I went with encoding/binary for my disk format too.

and with go-generate, one can get very nice performances and handling
of slices or maps.

-s
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Egon

unread,
Apr 20, 2015, 3:46:50 AM4/20/15
to golan...@googlegroups.com, dsym...@golang.org, r...@golang.org, egon...@gmail.com
Does the order matter?

First I would change the Top/Bottom ... to int8, int16, int32, int64 depending on what is needed.
Then I would just mmap the whole structure to a file.

If I need a little packing then probably encoding/binary + compression/*.
If I need more packing then custom encoding + custom packing + compression/*. i.e. delta encoding, arithmetic encoding, RLE... depends on the data-set.

+ Egon

Dan Kortschak

unread,
Apr 20, 2015, 4:04:28 AM4/20/15
to Egon, golan...@googlegroups.com, dsym...@golang.org, r...@golang.org
On Mon, 2015-04-20 at 00:46 -0700, Egon wrote:
> Does the order matter?

No, not really.

> First I would change the Top/Bottom ... to int8, int16, int32, int64
> depending on what is needed.
> Then I would just mmap the whole structure to a file.

This is something I'd rather avoid. This is research software and is
subject to change. This is the reason I wanted to avoid encoding/binary.

The cost of reading the data is minor cf the cost of actually performing
the analysis on the data (these are seeds for genome alignment) which
takes on the order of weeks.

> If I need a little packing then probably encoding/binary +
> compression/*.

Yes, compress was going to be part of the solution.

> If I need more packing then custom encoding + custom packing +
> compression/*. i.e. delta encoding, arithmetic encoding, RLE...
> depends on the data-set.


thanks

Egon

unread,
Apr 20, 2015, 4:48:18 AM4/20/15
to golan...@googlegroups.com, dsym...@golang.org, egon...@gmail.com, r...@golang.org


On Monday, 20 April 2015 11:04:28 UTC+3, kortschak wrote:
On Mon, 2015-04-20 at 00:46 -0700, Egon wrote:
> Does the order matter?

No, not really.

> First I would change the Top/Bottom ... to int8, int16, int32, int64
> depending on what is needed.
> Then I would just mmap the whole structure to a file.

This is something I'd rather avoid. This is research software and is
subject to change. This is the reason I wanted to avoid encoding/binary.

Do you mean that you need backwards-compatibility or you need to be easily convert from old format to a newer format?
Or simply for experimentation it's easier if you don't have to worry about serialization?

Basically creating a reader/writer for encoding is quite trivial:

Writing a tool that generates that code should be trivial.

For backcomp you can use a map to formats and either use format promotion or backcomp readers.
Also, if you write the values as int64 and send it through compression it should get rid of all the extra zeros in the bit stream.

Also with gob, are you calling Encode once or multiple times? Just in case, you should call Encode only once on the array instead of each value separately.

To avoid the tooBig error with gob, you can serialize as blocks... i.e.

const maxBlock = 16<<20

for len(traps) > 0 {
    enc := gob.NewEncoder(out)
    n := maxBlock
    if n > len(traps) {
        n = len(traps)
    }
    err := enc.Encode(traps[:n])
    traps = traps[n:]
}

You'll get multiple headers - but compared to the whole dataset it isn't a big deal.

+ Egon

Dan Kortschak

unread,
Apr 20, 2015, 5:06:31 AM4/20/15
to Egon, golan...@googlegroups.com, dsym...@golang.org, r...@golang.org
On Mon, 2015-04-20 at 01:48 -0700, Egon wrote:
> > This is something I'd rather avoid. This is research software and is
> > subject to change. This is the reason I wanted to avoid encoding/binary.
> >
>
> Do you mean that you need backwards-compatibility or you need to be easily
> convert from old format to a newer format?
> Or simply for experimentation it's easier if you don't have to worry about
> serialization?

A little of both, mainly the former though.

> Basically creating a reader/writer for encoding is quite trivial:
> https://github.com/egonelbre/exp/blob/master/physicscompress/physics/state.go#L65
> https://github.com/egonelbre/exp/blob/master/bit/reflect.go#L8
>
> Writing a tool that generates that code should be trivial.
>
> For backcomp you can use a map to formats and either use format promotion
> or backcomp readers.

Yes, this was something that I was considering.

> Also, if you write the values as int64 and send it through compression it
> should get rid of all the extra zeros in the bit stream.
>
> Also with gob, are you calling Encode once or multiple times? Just in case,
> you should call Encode only once on the array instead of each value
> separately.

Once.

> To avoid the tooBig error with gob, you can serialize as blocks... i.e.
>
> const maxBlock = 16<<20
>
> for len(traps) > 0 {
> enc := gob.NewEncoder(out)
> n := maxBlock
> if n > len(traps) {
> n = len(traps)
> }
> err := enc.Encode(traps[:n])
> traps = traps[n:]
> }
>
> You'll get multiple headers - but compared to the whole dataset it isn't a
> big deal.

Thank you for all of this.

I have looked over the gob code to add the proposed method and it was
not hard to do, so I'm just waiting on r or dsymonds to say yes or no
since the change also has the possibility of tightening the restriction
for cases where that may be useful for reducing DoS targets where
expected messages are known to be small.

Rob Pike

unread,
Apr 20, 2015, 10:56:54 AM4/20/15
to Dan Kortschak, Egon, golan...@googlegroups.com, David Symonds
The underlying issue is that the gob protocol requires that the whole message be in memory both for encoding and decoding. An arbitrarily large message cannot work.

I am uncomfortable just setting the limit. Although in your case it seems reasonable, there are other constraints, like the need for a gob written on one machine to be readable on another, which is why the limit is what it is.

In your case I would either just write a custom codec or just raise the maximum or disable the check.

-rob

Dan Kortschak

unread,
Apr 20, 2015, 2:20:03 PM4/20/15
to Rob Pike, Egon, golan...@googlegroups.com, David Symonds
Thanks. Comments in line.

On 21/04/2015, at 12:26 AM, "Rob Pike" <r...@golang.org> wrote:

> The underlying issue is that the gob protocol requires that the whole message be in memory both for encoding and decoding. An arbitrarily large message cannot work.

The change I have doesn't make it arbitrarily large, it just allows the client to specify within an upper (currently MaxInt, but could be smaller) and lower (0) bound.

> I am uncomfortable just setting the limit. Although in your case it seems reasonable, there are other constraints, like the need for a gob written on one machine to be readable on another, which is why the limit is what it is.

This is currently not true. You can write arbitrarily large gobs without any problem. When it comes time to read them is when it fails. My change keeps this behaviour but allows the client to be more or less retrictive on the size of messages received.

> In your case I would either just write a custom codec or just raise the maximum or disable the check.

That's probably what I will do.

Dan Kortschak

unread,
Apr 20, 2015, 2:31:16 PM4/20/15
to Rob Pike, Egon, golan...@googlegroups.com, David Symonds
To clarify, the default limit is still tooBig (1<<30), but can be altered within this range.

Rob Pike

unread,
Apr 20, 2015, 2:36:07 PM4/20/15
to Dan Kortschak, Egon, golan...@googlegroups.com, David Symonds
The entire message is built in memory before being transmitted, so the length can be written as a prefix.

-rob

Dan Kortschak

unread,
Apr 20, 2015, 2:40:27 PM4/20/15
to Rob Pike, Egon, golan...@googlegroups.com, David Symonds
I don't see how making the limit per-decoder harms that.

Rob Pike

unread,
Apr 20, 2015, 3:31:32 PM4/20/15
to Dan Kortschak, Egon, golan...@googlegroups.com, David Symonds
I was just rebutting your point that it is possible to write arbitrarily large messages. You are limited by available memory as well as the static check.

-rob

Dan Kortschak

unread,
Apr 20, 2015, 4:54:30 PM4/20/15
to Rob Pike, Egon, golan...@googlegroups.com, David Symonds
OK, so it's not really a rebuttal then; I can write 10GB gobs right now - I can't read them. So it is already the case that for some (probably not small) set of machines it is possible to wrte gobs that cannot be read on another machine (or even that same machine).

The current approach seems fine for what it was intended for, but is blunt; it prevents broader use, and also prevents a fine tuning of the DoS guarding characteristics for cases where only very small messages are expected.

Is there any particular reason why a client should not be able to set a per-decoder limit by a method call, with the default behaviour being what it is now?

Rob Pike

unread,
Apr 20, 2015, 5:28:06 PM4/20/15
to Dan Kortschak, Egon, golan...@googlegroups.com, David Symonds
I am resisting because APIs that let you choose parameters like this enable bad designs.  The constraint is there for a reason: bad data can cause bad things to happen to the heap, and the easiest way to get bad data is to have a bad size in the input.

-rob

Dan Kortschak

unread,
Apr 20, 2015, 5:52:58 PM4/20/15
to Rob Pike, Egon, golan...@googlegroups.com, David Symonds
Thanks. So are you saying that if someone were to set the limit too high for their machine to cope with the input that will lead to heap corruption?

I'm really struggling to see how this would happen in the new situation where it could not already happen today given the diversity of hardware that runs Go. Can you explain this?

The bad cases I see are: 1) the limit is too large for the machine and gobs may be accepted that cannot be properly handled, or 2) the limit is too small for legitimate messages to be accepted. The first case already potentially exists for arm devices like the pi and the second case just fails with an error.

It was my understanding that the limit was there to prevent DoS of a message receiver. Having a limit that can be tuned down potentially helps there when you have a service that only expects small messages where currently a DoS can keep the messages at the 1GB limit to maximise cost to the service.

I also don't really understand what you mean by "bad size in the input". I'm not proposing any changes to input (no changes to Encoder), just a change that allows longer (or shorter message limits).

Rob Pike

unread,
Apr 20, 2015, 6:24:30 PM4/20/15
to Dan Kortschak, Egon, golan...@googlegroups.com, David Symonds
I am saying that I'd prefer not to change the API, which implies, as I said before, that your best solution is to change it yourself or perhaps write a custom codec.

I really dislike knobs in APIs.

The code should probably complain in the encoder if it's too big; I'll file an issue for that.


-rob

Dan Kortschak

unread,
Apr 20, 2015, 6:27:30 PM4/20/15
to Rob Pike, Egon, golan...@googlegroups.com, David Symonds
On Mon, 2015-04-20 at 15:23 -0700, Rob Pike wrote:
> I am saying that I'd prefer not to change the API, which implies, as I
> said before, that your best solution is to change it yourself or
> perhaps write a custom codec.
>
> I really dislike knobs in APIs.
>
> The code should probably complain in the encoder if it's too big; I'll
> file an issue for that.

Thanks. Maybe remove the TODO on tooBig then as well.

Ugorji Nwoke

unread,
Apr 20, 2015, 7:26:00 PM4/20/15
to golan...@googlegroups.com, dan.ko...@adelaide.edu.au, dsym...@golang.org, egon...@gmail.com
Follow up to what Rob said, you can write a custom codec or use one which supports explicit delimiters at end of lists (like JSON "]"), as opposed to length-prefixing. cbor is a really good encoding format and supports this; feature is called indefinite-length arrays. 


The go library fully supports this: http://ugorji.net/blog/go-codec-primer ; go get github.com/ugorji/go/codec

Dan Kortschak

unread,
Apr 20, 2015, 7:36:03 PM4/20/15
to Ugorji Nwoke, golan...@googlegroups.com
Thanks Ugorji, I have been occasionally looking for a cbor
implementation and it probably is what I want here and more generally.
(Seb's post reminded me of that).

Peter Vessenes

unread,
Apr 20, 2015, 8:30:09 PM4/20/15
to golan...@googlegroups.com, ugo...@gmail.com
Just to pipe in, in our production code we have increased the size of the gob decoding routine a few orders of magnitude with no troubles by twiddling that constant. 

It took a while to figure out why encodes worked and decodes didn't from current error messages, so a bit more specificity in the message would be nice. As a user of your package, I'd also recommend enforcing the size constraint on write: what is the purpose of an asymmetric API like this? It would be simpler to remember the "gob limit" once and for all if you are trying to keep the mental overhead of using the package as low as possible.

In the end, I went with binary encoding, gob reads are just too slow. Kortschak, you mentioned the program will run for weeks and so load times don't matter, but if you are going to be recompiling/reloading/testing, you may find the wait times annoying.

We got roughly a 6-10x speedup using binary.Read, and then a further 2x+ speedup using binary.ReadUint32 (or whatever) in place of binary.Read. It's really not much trouble for your datamodel to implement, and you may find you enjoy the benefits.

Best,

Peter

Dan Kortschak

unread,
Apr 20, 2015, 9:19:23 PM4/20/15
to Peter Vessenes, golan...@googlegroups.com, ugo...@gmail.com
On Mon, 2015-04-20 at 17:30 -0700, Peter Vessenes wrote:
> In the end, I went with binary encoding, gob reads are just too slow.
> Kortschak, you mentioned the program will run for weeks and so load
> times don't matter, but if you are going to be
> recompiling/reloading/testing, you may find the wait times annoying.

When I run tests, I use small data sets. The loads aren't noticeable.
When I run analytical jobs, I just let the machine do what it needs to
and walk away.

> We got roughly a 6-10x speedup using binary.Read, and then a further
> 2x+ speedup using binary.ReadUint32 (or whatever) in place of
> binary.Read. It's really not much trouble for your datamodel to
> implement, and you may find you enjoy the benefits.

Yeah. I may do that.

Dan Kortschak

unread,
Apr 20, 2015, 9:51:55 PM4/20/15
to Ugorji Nwoke, golan...@googlegroups.com
On Mon, 2015-04-20 at 16:26 -0700, Ugorji Nwoke wrote:
> The go library fully supports this:
> http://ugorji.net/blog/go-codec-primer
> ; go get github.com/ugorji/go/codec

Can you explain why the codecgen generator doesn't allow unexported
types to be generated? That doesn't match other codec types* (or the
non-generated behaviour) and the generator doesn't warn in cases where
these types are skipped.

* It's perfectly reasonable to do this:

type t struct {
Name string `name:"name"`
Field int `json:"field"`
}

and have encoding/json or go-codec encode/decode this type.

Ugorji Nwoke

unread,
Apr 20, 2015, 9:55:21 PM4/20/15
to Dan Kortschak, golan...@googlegroups.com

Hmm. No reason - sounds like a bug/overnight. I will look into it next couple of days.

Sebastien Binet

unread,
Apr 21, 2015, 12:38:51 PM4/21/15
to Ugorji Nwoke, Dan Kortschak, golang-nuts

By the way, it would be great if you could expose a simpler CBOR encode/decode API.
Something more akin to, say, the json or gob ones.

ie:
err := cbor. NewEncoder(w).Encode(v)

I understand why your go/codec package exposes such a sophisticated API, but being able to just have a json/gob dropping one would be quite neat.

-s

sent from my droid

Ugorji Nwoke

unread,
Apr 21, 2015, 9:59:57 PM4/21/15
to golan...@googlegroups.com, dan.ko...@adelaide.edu.au
Filed and fixed. 


Please re-test and verify and update the above-referenced bug if things still are not working well.

Ugorji Nwoke

unread,
Apr 21, 2015, 10:03:57 PM4/21/15
to golan...@googlegroups.com, dan.ko...@adelaide.edu.au, ugo...@gmail.com
The equivalent codec one-liner API is: 
    err := codec.NewEncoder(w, new(codec.CborHandle)).Encode(v)
compared to stdlib encoding/json
    err := json.NewEncoder(w).Encode(v)

The only difference is that you have to pass a handle, since codec supports multiple formats with options, but the zero-value works by default.

I think there is parity in between them. 

Sebastien Binet

unread,
Apr 23, 2015, 7:56:22 AM4/23/15
to Ugorji Nwoke, golang-nuts, Dan Kortschak
On Wed, Apr 22, 2015 at 4:03 AM, Ugorji Nwoke <ugo...@gmail.com> wrote:
> The equivalent codec one-liner API is:
> err := codec.NewEncoder(w, new(codec.CborHandle)).Encode(v)
> compared to stdlib encoding/json
> err := json.NewEncoder(w).Encode(v)
>
> The only difference is that you have to pass a handle, since codec supports
> multiple formats with options, but the zero-value works by default.
>
> I think there is parity in between them.

right.
I still think there is value to expose a simple API.

so I wrote this thin wrapper around go/codec.CborXYZ:
https://github.com/gonuts/cbor
(the archive clocks at around ~150k)

-s

(thanks again for providing CBOR encoding/decoding!)
Reply all
Reply to author
Forward
0 new messages