S2: High speed compression with upgrade path from Snappy.

Klaus Post

unread,

Aug 26, 2019, 6:29:29 AM8/26/19

to golang-nuts

Hi!

S2 is an extension of Snappy. It can decode Snappy content but S2 content cannot be decoded by Snappy. S2 is aimed for high throughput, so it features concurrent compression for bigger payloads (streams).

Benefits over Snappy:

Better compression
Concurrent stream compression - several GB/sec.
Faster decompression
Ability to quickly skip forward in compressed stream
Compatible with Snappy compressed content
Smaller max block size overhead on incompressible blocks.
An alternative, more efficient, but slightly slower compression mode available.
Block concatenation.
Automatic stream size padding.

Drawbacks over Go Snappy:

Not optimized for 32 bit systems.
No AMD64 assembler compression implementation yet, meaning slightly slower compression speed on 1 core CPU.
Uses slightly more memory due to larger blocks and concurrency (configurable).

The decompression compatibility means S2 can be used as a one-way upgrade to Snappy without data recompression.

This package is aimed at replacing Snappy as a high speed compression package. If you are mainly looking for better compression zstandard gives better compression, but typically at speeds slightly below "better" mode in this package.

Github, with more details and benchmarks: https://github.com/klauspost/compress/tree/master/s2#s2-compression

Jason E. Aten

unread,

Aug 26, 2019, 7:49:23 AM8/26/19

to golang-nuts

Excellent! Thank you, Klaus!

It sounds like it does, but just to be clear: does S2 also replace/upgrade the snappy *streaming* format (the outer format that is different from snappy itself; e.g. https://github.com/glycerine/go-unsnap-stream )?

Nigel Tao

unread,

Aug 27, 2019, 4:23:06 AM8/27/19

to Jason E. Aten, golang-nuts

On Mon, Aug 26, 2019 at 9:49 PM Jason E. Aten <j.e....@gmail.com> wrote:
> It sounds like it does, but just to be clear: does S2 also replace/upgrade the snappy *streaming* format (the outer format that is different from snappy itself; e.g. https://github.com/glycerine/go-unsnap-stream )?

Tangential, but the https://github.com/golang/snappy package can also
read the streaming format. As
https://godoc.org/github.com/golang/snappy says, "There are actually
two Snappy formats... The block format is the Decode and Encode
functions and the stream format is the Reader and Writer types."

Nigel Tao

unread,

Aug 27, 2019, 4:38:36 AM8/27/19

to Klaus Post, golang-nuts

Nice work, Klaus!

On Mon, Aug 26, 2019 at 8:29 PM Klaus Post <klau...@gmail.com> wrote:
> Concurrent stream compression - several GB/sec.
> Faster decompression

A number of modern compression formats and implementations allow for
concurrent *compression*.

Coincidentally, I've been working on a format that allows for
concurrent *decompression*.

Well, actually, my primary motivation is random access
(de)compression: being able to reconstruct part of the decompressed
file, starting at a given offset into the decompressed file, without
always having to first decompress all of the preceding data. But being
able to decompress parts of a compressed file independently means that
we can decompress in parallel. In this example below, RAC + ZLIB
decompressed about 8x faster (wall clock time) than ZIP, even though
(1) they're both based on DEFLATE, and (2) the Go DEFLATE decoder
isn't quite as fast as the C one.

----
$ ls -l
-rw-r--r-- 1 tao tao 100000000 Jun 2 2011 enwik8
-rw-r--r-- 1 tao tao 36445475 Sep 2 2011 enwik8.zip
-rw-r--r-- 1 tao tao 37320896 Aug 9 19:16 shared.rac

$ cat enwik8 | sha256sum
2b49720ec4d78c3c9fabaee6e4179a5e997302b3a70029f30f2d582218c024a8 -
$ unzip -p enwik8.zip | sha256sum
2b49720ec4d78c3c9fabaee6e4179a5e997302b3a70029f30f2d582218c024a8 -
$ ractool -decode shared.rac | sha256sum
2b49720ec4d78c3c9fabaee6e4179a5e997302b3a70029f30f2d582218c024a8 -

$ time unzip -p enwik8.zip > /dev/null
real 0m0.737s
user 0m0.713s
sys 0m0.025s
$ time ractool -decode shared.rac > /dev/null
real 0m0.095s
user 0m1.316s
sys 0m0.069s
----

More details are at https://godoc.org/github.com/google/wuffs/cmd/ractool

In particular, RAC is a 'container format', not tied to any one
underlying compression codec like DEFLATE / GZIP / ZLIB. We could
easilly have a "RAC + Snappy(2)" flavor.

Klaus Post

unread,

Aug 27, 2019, 4:52:12 AM8/27/19

to golang-nuts

On Monday, 26 August 2019 13:49:23 UTC+2, Jason E. Aten wrote:

Excellent! Thank you, Klaus!

It sounds like it does, but just to be clear: does S2 also replace/upgrade the snappy *streaming* format (the outer format that is different from snappy itself; e.g. https://github.com/glycerine/go-unsnap-stream )?

Yes. The main change was to allow bigger blocks (up to 4MB) in the streaming format.

The streaming format has a new header 'magic bytes' so it can be recognized and rejected by Snappy.

The only change to the block format is that it can have matches at offset '0' which indicates that the used offset should be used.

/Klaus

Klaus Post

unread,

Aug 27, 2019, 5:24:56 AM8/27/19

to golang-nuts

On Tuesday, 27 August 2019 10:38:36 UTC+2, Nigel Tao wrote:

Nice work, Klaus!

On Mon, Aug 26, 2019 at 8:29 PM Klaus Post <klau...@gmail.com> wrote:
> Concurrent stream compression - several GB/sec.
> Faster decompression

A number of modern compression formats and implementations allow for
concurrent *compression*.

Coincidentally, I've been working on a format that allows for
concurrent *decompression*.

The streaming format of Snappy (and S2 by inheritance) does allow for both concurrent decompression and fast forward seeking since the compressed block size is known.

I have implemented the skip method here https://godoc.org/github.com/klauspost/compress/s2#Reader.Skip - it will skip blocks until it actually needs to decode, so everything skipped is basically read at IO speed. It will not allow arbitrary seeking. That would require an index to be stored at the end... and then you are not really dealing with streams any more.

I have not done a concurrent decompressor yet, since I suspect the gain will be fairly minimal due to the already high speed of decompression. It would be fairly trivial to split an incoming stream into blocks and decode them separately.

Nice!

There is definitely a good place for compression where you have random access to the data. Using Snappy/S2 blocks would be basically the same as the streaming format. Actually you could make a Snappy/S2 stream -> rac converter, which could use the already compressed stream.

The shared dictionary is a nice touch!

Can the concurrent decompression run on a pure stream, or does it need to read the index first?

Oh yeah, now I have you here, I have sent in a few PR's with some speedups for Snappy I found while I was writing this: https://github.com/golang/snappy/pulls - they are fairly trivial.

/Klaus

Nigel Tao

unread,

Aug 27, 2019, 8:12:26 AM8/27/19

to Klaus Post, golang-nuts

On Tue, Aug 27, 2019 at 7:25 PM Klaus Post <klau...@gmail.com> wrote:
> Can the concurrent decompression run on a pure stream, or does it need to read the index first?

It cannot run a pure stream. As the
https://github.com/google/wuffs/blob/master/doc/spec/rac-spec.md
Overview section says, an explicit non-goal is supporting streaming
decompression.

Nigel Tao

unread,

Aug 27, 2019, 8:37:25 PM8/27/19

to Klaus Post, golang-nuts

On Mon, Aug 26, 2019 at 8:29 PM Klaus Post <klau...@gmail.com> wrote:
> This package is aimed at replacing Snappy as a high speed compression package. If you are mainly looking for better compression zstandard gives better compression, but typically at speeds slightly below "better" mode in this package.

Do you know how S2 compares with LZ4? https://github.com/lz4/lz4
suggests that LZ4 is faster (for both compression and decompression)
than Snappy, for a comparable compression ratio.

Klaus Post

unread,

Aug 28, 2019, 5:11:37 AM8/28/19

to golang-nuts

To be fair to LZ4, here is a single core comparison. Only the fastest mode of LZ4 is really 'competitive' - this is level 0 in the chart. S2 level 1 is default, level 2 is "better".

It is against master of `github.com/pierrec/lz4` to which I also sent in some nice speedups.

|-----------------------------|-----|-------|-------------|------------|--------|---------|

| rawstudio-mint14.tar | lz4 | 0 | 8558382592 | 4553246910 | 17942 | 454.90 |

| rawstudio-mint14.tar | s2 | 1 | 8558382592 | 4461314089 | 12935 | 630.95 |

| rawstudio-mint14.tar | s2 | 2 | 8558382592 | 4285178801 | 25292 | 322.7 |

| github-june-2days-2019.json | lz4 | 0 | 6273951764 | 1274316041 | 9441 | 633.75 |

| github-june-2days-2019.json | s2 | 1 | 6273951764 | 1086090092 | 7425 | 805.76 |

| github-june-2days-2019.json | s2 | 2 | 6273951764 | 1055384443 | 10936 | 547.1 |

| github-ranks-backup.bin | lz4 | 0 | 1862623243 | 618233345 | 3655 | 485.89 |

| github-ranks-backup.bin | s2 | 1 | 1862623243 | 624632852 | 3002 | 591.58 |

| github-ranks-backup.bin | s2 | 2 | 1862623243 | 564891142 | 4990 | 355.95 |

| consensus.db.10gb | lz4 | 0 | 10737418240 | 5065051629 | 23608 | 433.75 |

| consensus.db.10gb | s2 | 1 | 10737418240 | 4610183941 | 14913 | 686.63 |

| consensus.db.10gb | s2 | 2 | 10737418240 | 4612959122 | 32878 | 311.45 |

| adresser.json | lz4 | 0 | 7983034785 | 470399106 | 5572 | 1366.27 |

| adresser.json | s2 | 1 | 7983034785 | 457342320 | 4546 | 1674.7 |

| adresser.json | s2 | 2 | 7983034785 | 438421980 | 5320 | 1431 |

| gob-stream | lz4 | 0 | 1911399616 | 377393110 | 2904 | 627.56 |

| gob-stream | s2 | 1 | 1911399616 | 356137611 | 2439 | 747.21 |

| gob-stream | s2 | 2 | 1911399616 | 345184546 | 3392 | 537.28 |

| 10gb.tar | lz4 | 0 | 10065157632 | 5852284245 | 22916 | 418.87 |

| 10gb.tar | s2 | 1 | 10065157632 | 5936329234 | 20340 | 471.91 |

| 10gb.tar | s2 | 2 | 10065157632 | 5725266005 | 29463 | 325.79 |

| sharnd.out.2gb | lz4 | 0 | 2147483647 | 2147485710 | 442 | 4632.35 |

| sharnd.out.2gb | s2 | 1 | 2147483647 | 2147500041 | 246 | 8323.35 |

| sharnd.out.2gb | s2 | 2 | 2147483647 | 2147500041 | 272 | 7527.72 |

| enwik9 | lz4 | 0 | 1000000000 | 469015635 | 3681 | 259.02 |

| enwik9 | s2 | 1 | 1000000000 | 489432671 | 2973 | 320.71 |

| enwik9 | s2 | 2 | 1000000000 | 435069175 | 4733 | 201.49 |

| silesia.tar | lz4 | 0 | 211947520 | 95265393 | 600 | 336.75 |

| silesia.tar | s2 | 1 | 211947520 | 97205820 | 484 | 417.52 |

| silesia.tar | s2 | 2 | 211947520 | 90759543 | 794 | 254.51 |

On a single core, it seems like LZ4 on the fastest setting is between the two modes. Some cases (the 2 JSON cases, the VM image and the gob stream) are strictly worse on LZ4.

TLDR; LZ4 is typically between the default and "better" mode of s2.

Nigel Tao

unread,

Aug 28, 2019, 6:57:33 AM8/28/19

to Klaus Post, golang-nuts

On Wed, Aug 28, 2019 at 7:11 PM Klaus Post <klau...@gmail.com> wrote:
> TLDR; LZ4 is typically between the default and "better" mode of s2.

Nice!

Just a suggestion: rename "better" to either "betterSize / smaller"
(i.e. better compression ratio, worse throughput) or "betterSpeed /
faster", otherwise it's not immediately obvious on which axis "better"
is better in. (Or if it's better in both axes, why not just always
turn it on).

Also, from https://github.com/klauspost/compress/tree/master/s2#format-extensions:
> Format Extensions
> ...
> Framed compressed blocks can be up to 4MB (up from 64KB).

Do you know how much of the size or speed gains come from bumping 64K
to 4M? Put another way, do you still see good size/speed gains if you
do everything else other than bump this number? One of the original
design goals was to limit the amount of memory needed for
decompressing an arbitrary snappy stream.

https://github.com/google/snappy/blob/master/framing_format.txt says:
"the uncompressed data in a chunk must be no longer than 65536 bytes.
This allows consumers to easily use small fixed-size buffers".

Klaus Post

unread,

Aug 28, 2019, 8:02:35 AM8/28/19

to golang-nuts

On Wednesday, 28 August 2019 12:57:33 UTC+2, Nigel Tao wrote:

On Wed, Aug 28, 2019 at 7:11 PM Klaus Post <klau...@gmail.com> wrote:
> TLDR; LZ4 is typically between the default and "better" mode of s2.

Nice!

Just a suggestion: rename "better" to either "betterSize / smaller"
(i.e. better compression ratio, worse throughput) or "betterSpeed /
faster", otherwise it's not immediately obvious on which axis "better"
is better in. (Or if it's better in both axes, why not just always
turn it on).

Also, from https://github.com/klauspost/compress/tree/master/s2#format-extensions:
> Format Extensions
> ...
> Framed compressed blocks can be up to 4MB (up from 64KB).

Do you know how much of the size or speed gains come from bumping 64K
to 4M? Put another way, do you still see good size/speed gains if you
do everything else other than bump this number? One of the original
design goals was to limit the amount of memory needed for
decompressing an arbitrary snappy stream.

Then it depends more on the content. Especially machine generated content suffers a lot.

Here is the size/speed of the compression, with 64KB block size restriction on S2:

|-----------------------------|--------|-------|-------------|------------|--------|---------|

| rawstudio-mint14.tar | snappy | 1 | 8558382592 | 4796307031 | 17814 | 458.17 |

| rawstudio-mint14.tar | s2 | 1 | 8558382592 | 4741757590 | 13157 | 620.3 |

| rawstudio-mint14.tar | s2 | 2 | 8558382592 | 4568241215 | 25962 | 314.37 |

| github-june-2days-2019.json | snappy | 1 | 6273951764 | 1525176492 | 12085 | 495.07 |

| github-june-2days-2019.json | s2 | 1 | 6273951764 | 1416959675 | 8496 | 704.17 |

| github-june-2days-2019.json | s2 | 2 | 6273951764 | 1344106237 | 13165 | 454.45 |

| github-ranks-backup.bin | snappy | 1 | 1862623243 | 589837541 | 3785 | 469.21 |

| github-ranks-backup.bin | s2 | 1 | 1862623243 | 637075877 | 2869 | 619.01 |

| github-ranks-backup.bin | s2 | 2 | 1862623243 | 579284488 | 4411 | 402.61 |

| consensus.db.10gb | snappy | 1 | 10737418240 | 5412897703 | 22273 | 459.75 |

| consensus.db.10gb | s2 | 1 | 10737418240 | 5303306037 | 15025 | 681.5 |

| consensus.db.10gb | s2 | 2 | 10737418240 | 5377169131 | 71860 | 142.5 |

| adresser.json | snappy | 1 | 7983034785 | 809697891 | 7997 | 951.91 |

| adresser.json | s2 | 1 | 7983034785 | 568975289 | 4944 | 1539.86 |

| adresser.json | s2 | 2 | 7983034785 | 527205659 | 9675 | 786.88 |

| gob-stream | snappy | 1 | 1911399616 | 457984585 | 3668 | 496.85 |

| gob-stream | s2 | 1 | 1911399616 | 440972884 | 2528 | 720.91 |

| gob-stream | s2 | 2 | 1911399616 | 426735950 | 3982 | 457.67 |

| 10gb.tar | snappy | 1 | 10065157632 | 6056946612 | 23905 | 401.54 |

| 10gb.tar | s2 | 1 | 10065157632 | 6170962298 | 22430 | 427.95 |

| 10gb.tar | s2 | 2 | 10065157632 | 5775224369 | 31923 | 300.69 |

| sharnd.out.2gb | snappy | 1 | 2147483647 | 2147745801 | 327 | 6261.61 |

| sharnd.out.2gb | s2 | 1 | 2147483647 | 2147745801 | 356 | 5751.51 |

| sharnd.out.2gb | s2 | 2 | 2147483647 | 2147745801 | 485 | 4221.73 |

| enwik9 | snappy | 1 | 1000000000 | 508028601 | 4098 | 232.66 |

| enwik9 | s2 | 1 | 1000000000 | 523651446 | 2736 | 348.49 |

| enwik9 | s2 | 2 | 1000000000 | 461872591 | 4689 | 203.38 |

| silesia.tar | snappy | 1 | 211947520 | 103008711 | 634 | 318.74 |

| silesia.tar | s2 | 1 | 211947520 | 103019781 | 463 | 436.47 |

| silesia.tar | s2 | 2 | 211947520 | 95522535 | 792 | 255.16 |

S2, GOMAXPROCS=1, Snappy no asm.

S2 is faster in 9/10, and compression is better in 7/10 cases.

Using 'better' is always better than Snappy, but also slower.

https://github.com/google/snappy/blob/master/framing_format.txt says:
"the uncompressed data in a chunk must be no longer than 65536 bytes.
This allows consumers to easily use small fixed-size buffers".

Yeah. The size impact is quite huge, so it doesn't make sense to have that restriction by default IMO. But it is configurable in S2, so you can just set it to what you want.

There is only really one place in the decoder where it affects allocations - the allocation of the input buffer, and that could easily be adjusted to only allocate space as it needs it, at the cost of a few more allocations. I considered adding the max block size to the stream, but eventually dropped it since it had so little practical impact and I wanted to keep the changes to an absolute minimum.

In most environments allocating 4MB isn't a big deal for reading a stream, especially since the decoder can be reused. Except for maybe embedded systems I don't see this coming much into play.

Blocks already allocate the full space needed for ~2x the input, so there is no difference to Snappy there - except that the max block size overhead is smaller for S2.

/Klaus

Klaus Post

unread,

Aug 28, 2019, 9:04:12 AM8/28/19

to golang-nuts

Also, while there was other opportunities for making it more effective (3 byte offsets, even shorter repeat codes), I choose to keep it as close to Snappy as feasible so the same decoder works for both formats.