Compressed flatbuffers larger than compressed JSON

jtr...@twitter.com

unread,

Feb 6, 2016, 1:44:28 PM2/6/16

to FlatBuffers, John Schulz

Hello --

So this might just be my ignorance about gzip and compression in general, but I'm kind of confused by what I'm seeing, which is:

Looking at a JSON api, we've generated an equivalent flatbuffers schema. Using flatc, we generate the flatbuffers binary from a sample json payload. The flatbuffers binary is (not surprisingly) far smaller than the JSON data that was used to generate it.

However, if we gzip the binary and the JSON, the JSON ends up being significantly smaller.

I understand that, as a percentage of the original size, it's expected that the JSON will compress much more since it has so much redundancy relative to the flatbuffers binary representation. But I wouldn't have guessed that the gzip compression would be able to make the json smaller than the flatbuffers representation.

So is this surprising? Is something about the flatbuffers format challenging for gzip? Are there best practices for organizing schemas that make the binary representation more compress-able?

(I'll post my schema and json and steps to this thread later today).

Thanks!

Justin

Wouter van Oortmerssen

unread,

Feb 8, 2016, 2:41:13 PM2/8/16

to jtr...@twitter.com, FlatBuffers, John Schulz

We did test gzipped payloads in the past, and for us, gzipped FlatBuffer binaries came out significantly smaller than gzipped JSON: https://google.github.io/flatbuffers/flatbuffers_benchmarks.html

However, these were quite small payloads, and for larger ones it is possible that the compressor is able to deal more efficiently with the JSON redundancy.

One thing that is challenging in a FlatBuffer binary is all the offsets used. In a small binary, those tend to be small and produce lots of zeroes (thus compress well), but for larger binaries, you end up with large "random numbers" that don't compress well.

It be good to see your data to see if there's anything that can be improved.

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jsc...@twitter.com

unread,

Feb 12, 2016, 5:11:28 PM2/12/16

to FlatBuffers, jtr...@twitter.com, jsc...@twitter.com

I work with the OP and have attached our data.

Here are the sizes before and after running gzip --best

425K typeahead_prefetch_1000_users.bin
126K typeahead_prefetch_1000_users.bin.gz

841K typeahead_prefetch_1000_users.json
87K typeahead_prefetch_1000_users.json.gz

This is my first attempt at a FlatBuffer schema, so it's likely sub-optimal.

Thanks for the help,

John

Typeahead FBS.zip

Wouter van Oortmerssen

unread,

Feb 12, 2016, 5:32:31 PM2/12/16

to John Schulz, FlatBuffers, jtr...@twitter.com

There's nothing particularly wrong with your schema or data. It confirms what I suspected: at such large data sizes, the compressor is very efficient at removing the JSON redundancy, but not at the relatively random-looking offsets you'd find in a large FlatBuffer files.

The data is string-heavy, anything that could turn strings or tables into cheaper things would help. Some smaller optimisations you could do:

Omit the id_str field, it can be generated from id.
Make social_context a struct instead of a table, if you don't intend to add to it later.
Make tokens a vector of strings, instead of a vector of tables of strings.
Omit connecting_user_count since you can query the size of connecting_user_ids.
Omit the default prefix of those urls.
(this gets more elaborate): you could try string-pooling strings that you expect to occur more than once, like e.g. location values. FlatBuffers allows sharing of strings in the binary (you simply reuse the offset). Of course, if strings form a limited set, enums are even better.

mikkelfj

unread,

Feb 13, 2016, 7:17:37 AM2/13/16

to FlatBuffers, jsc...@twitter.com, jtr...@twitter.com

On Friday, February 12, 2016 at 11:32:31 PM UTC+1, Wouter van Oortmerssen wrote:

There's nothing particularly wrong with your schema or data. It confirms what I suspected: at such large data sizes, the compressor is very efficient at removing the JSON redundancy, but not at the relatively random-looking offsets you'd find in a large FlatBuffer files.

Flatbuffers is sort of indexed data, so it is perhaps not that surprising. But it is interesting to think that serializing flatbuffers to json before compression could be a viable storage and transmission strategy when you have time and space for processing, given that flatbuffer driven json parsing and printing can be much faster than the compression overhead (short of lz4). As an added benefit you get increased portability for those that do not speak flatbuffers.

BTW lz4 compresses bin to about 200K and json to about 150K. This is roughly double the best achievable compression with 7zip (bin) and bzip2 (jzon).

The fastest transmission option is probably flatbuffers + LZ4. This is still 4 times smaller than raw json and very fast.

gzip is surprisingly faster on json (14ms) vs flatbuffers (24ms), but it is not fast enough to saturate 1GB/s link using a 2GHz core, so the compression gains require additional traffic to be beneficial on such an environment.

Reply all

Reply to author

Forward