JSON dictionary encoding

224 views
Skip to first unread message

ChrisLu

unread,
Jan 4, 2021, 2:09:44 PM1/4/21
to golang-nuts
Hi,

For a list of json objects, the key names are usually repeated.
e.g., {aaaa:1, bbbb:2},{aaaa:2, bbbb:3}, ...

The key names, "aaaa" and "bbbb" for the above example, could be very long.
Is there any existing library already encode json objects via a dictionary?

This is a JSON-specific compression. Would be good to see the compression ratio compared to gzip, which has a general dictionary encoding.

Howard C. Shaw III

unread,
Jan 4, 2021, 2:56:32 PM1/4/21
to golang-nuts
It is not JSON at that point - and if you want a JSON compatible format that compresses, well, there are dozens of them - some just elide the keynames, others include compression of the data elements as well. Here is an article discussing a lot of them: https://www.lucidchart.com/techblog/2019/12/06/json-compression-alternative-binary-formats-and-compression-methods/

Many of them have Go-language implementations.

Jesper Louis Andersen

unread,
Jan 4, 2021, 3:18:03 PM1/4/21
to ChrisLu, golang-nuts
On Mon, Jan 4, 2021 at 8:09 PM ChrisLu <chri...@gmail.com> wrote:
Hi,

For a list of json objects, the key names are usually repeated.
e.g., {aaaa:1, bbbb:2},{aaaa:2, bbbb:3}, ...

The key names, "aaaa" and "bbbb" for the above example, could be very long.
Is there any existing library already encode json objects via a dictionary?

This is a JSON-specific compression. Would be good to see the compression ratio compared to gzip, which has a general dictionary encoding.


It is a bit old, but take a look at Transit. It's a JSON to JSON encoding format where there's a concept of a rolling implicit cache on the objects, much like you would see in a compression system through a dictionary method. One particularly interesting case is that decoding Transit into a language representation can often be faster than decoding the exact same gzipped JSON. The reason being that once un-gzipped, the parser has to work on more bytes than in the Transit case.

My usual approach is often to just use Protobuf 3 nowadays. Especially if you have a good grasp on the structure of data and it's not very volatile. If you have very large amounts of data, and data is highly tabular, you could look into ORC (Optimized Row Columnar) or Parquet since the columnar formats tend to win out. The usual key trick in those formats are two-fold: decouple the description from the data, and store data that are the same locally close to each other by looking at columns rather than rows.

For a completely different approach, Joe Armstrong's UBF[0]. Data is encoded as a program that can be executed by a simple stack machine with registers for caching. Decoding amounts to executing said stack machine in order to produce the language specific representation. For instance

'person'>p # {p "Joe" 123} & {p 'fred' 3~abc~} & $
will store the string 'person' in the p register and then proceed to use it in the following structure definitions. A format like these are interesting hybrids in that they are self-describing (as JSON), but also allow for compression in order to avoid the repetition of something like JSON. So they fall somewhere in between Protobuf and columnar formats.

[0] Universal Binary Format: http://ubf.github.io/ubf/ubf-user-guide.en.html 



--
J.
Reply all
Reply to author
Forward
0 new messages