[ANN] Optimized gzip/zip packages, 30-50% faster.

Klaus Post

unread,

Jul 28, 2015, 7:06:43 AM7/28/15

to golang-nuts

Hi!

I have worked on optimizing the standard library deflate function, and I am happy to announce revised gzip/zip packages, that on x64 is about 30-50% faster with slightly improved compression. It contains no cgo.

Project: https://github.com/klauspost/compress

All packages are drop-in replacements for the standard libraries, so you can use them by simply changing imports.

The biggest gains are on machines with SSE4.2 instructions available on Intel Nehalem (2009) and AMD Bulldozer (2012). The optimized functions are:

* Minimum matches are 4 bytes, this leads to fewer searches and better compression.
* Stronger hash (iSCSI CRC32) for matches on x64 with SSE 4.2 support. This leads to fewer hash collisions.
* Literal byte matching using SSE 4.2 for faster long-match comparisons.
* Bulk hashing on matches.
* Much faster dictionary indexing with NewWriterDict()/Reset().
* Make Bit Coder faster by assuming we are on a 64 bit CPU.

* CRC32 optimized for 10x speedup on SSE 4.2. Available separately: https://github.com/klauspost/crc32

For benchmarks see the project page.

In short, there will be better compression at levels 1 to 4 and about 1.5 times the throughput at higher compression levels.

Furthermore "pgzip" (multi-cpu gzip for longer streams) has also been updated to the new deflate/crc32, so it you update the repo you will also get a "free" speed boost there. See https://github.com/klauspost/pgzip

Comments, questions and other feedback is very welcome!

/Klaus

Sebastien Binet

unread,

Jul 28, 2015, 7:33:01 AM7/28/15

to Klaus Post, golang-nuts

here is your feedback: great work!

do you have an idea of the minimum payload size at which point it's
beneficial to use pgzip wrt gzip? (I surmise there is some overhead
going from the latter to the former)

-s

Klaus Post

unread,

Jul 28, 2015, 7:39:31 AM7/28/15

to golang-nuts, seb....@gmail.com

On Tuesday, 28 July 2015 13:33:01 UTC+2, Sebastien Binet wrote:

here is your feedback: great work!

you are welcome :)

do you have an idea of the minimum payload size at which point it's
beneficial to use pgzip wrt gzip? (I surmise there is some overhead
going from the latter to the former)

If uncompressed input is below 1MB, just use regular gzip. Default block size is 250k bytes, so if you send 1MB, if will be split into parallel 4 encodes. But below that any gains will mostly be eaten by communication overhead, and 1MB is pretty fast on 1 core anyway.

-s

/Klaus

Klaus Post

unread,

Jul 28, 2015, 10:53:56 AM7/28/15

to golang-nuts, seb....@gmail.com, klau...@gmail.com

Hi again!

Carlos Cobo just notified that some fixes for inflate had been sent in, and merged to tip. Mostly they relate to https://github.com/golang/go/issues/11030

I have merged these changes, so you get that as a free bonus if you are using 1.3 or 1.4.

/Klaus

joe...@google.com

unread,

Jul 28, 2015, 11:41:53 PM7/28/15

to golang-nuts, seb....@gmail.com, klau...@gmail.com

This is awesome and was something I was thinking about doing myself eventually. I may still play around with optimizing flate at the higher level with tweaks to the algorithm itself.

Also, thanks for merging in #11030.

JT

Arne Hormann

unread,

Jul 29, 2015, 6:35:38 AM7/29/15

to golang-nuts, klau...@gmail.com

Hi, thank you so much for this, your library is amazing.

I checked it with https://gist.github.com/arnehormann/65421048f56ac108f6b5 and love it so far!

Klaus Post

unread,

Jul 29, 2015, 7:32:43 AM7/29/15

to golang-nuts, arneh...@gmail.com

On Wednesday, 29 July 2015 12:35:38 UTC+2, Arne Hormann wrote:

Hi, thank you so much for this, your library is amazing.
I checked it with https://gist.github.com/arnehormann/65421048f56ac108f6b5 and love it so far!

Very cool! If you want to improve it further, you could look into real-world samples, like JSON/HTML/XML/CSS/JS, which are more likely candidates for real-world scenario.

Artificial sources have a tendency to skew benchmarks, and to get the most usable results, real sources show the tradeoffs the best. Just so you know what it makes sense to compare.

/Klaus

Arne Hormann

unread,

Jul 29, 2015, 7:42:18 AM7/29/15

to Klaus Post, golang-nuts

That's entirely possible with the program in that gist.
Just use -r=raw for the input and pipe a file into it.
I tried it with a tared directory and compared the result with diff -q to check the unpacked output matches the input.

Klaus Post

unread,

Jul 29, 2015, 9:06:20 AM7/29/15

to golang-nuts, arneh...@gmail.com

On Wednesday, 29 July 2015 13:42:18 UTC+2, Arne Hormann wrote:

That's entirely possible with the program in that gist.
Just use -r=raw for the input and pipe a file into it.
I tried it with a tared directory and compared the result with diff -q to check the unpacked output matches the input.

Even cooler! I took the liberty of adding an in/out parameter (since pipes perform quite bad on Windows), as well as adding pgzip and csv-output of stats, and cpu allocation (for pgzip). That makes it perfect for my testing.

https://gist.github.com/klauspost/00f7c9a19e56581f5ead

/Klaus

Arne Hormann

unread,

Jul 29, 2015, 9:25:39 AM7/29/15

to golang-nuts, klau...@gmail.com

Glad I could help - especially if this allows you to tune it better!

demetri...@gmail.com

unread,

Jul 30, 2015, 1:41:59 AM7/30/15

to golang-nuts

How does performance compare to calling zlib/libzip via cgo?

Klaus Post

unread,

Jul 30, 2015, 7:43:03 AM7/30/15

to golang-nuts, demetri...@gmail.com

On Thursday, 30 July 2015 07:41:59 UTC+2, demetri...@gmail.com wrote:

How does performance compare to calling zlib/libzip via cgo?

Never ask that again. Getting it to compile under windows was a nightmare ;)

I have benchmarked cgzip along with Go standard library, my revised gzip, pgzip as well as 7zip and gzip executable:

https://docs.google.com/spreadsheets/d/1nuNE2nPfuINCZJRMt6wFWhKpToF95I47XjSsc-1rbPQ/edit?usp=sharing

At level 1 cgzip is slower 20% slower than the new zip and compresses worse.

At level 9 cgzip is 20% faster, but compresses worse.

The test file is a 7GB highly compressible JSON. I might add Matt Mahoneys 10GB corpus - http://mattmahoney.net/dc/10gb.html - as a formal test, although JSON and similar formats are probably more real-world scenario.

/Klaus

thebroke...@gmail.com

unread,

Jul 30, 2015, 6:28:55 PM7/30/15

to golang-nuts, demetri...@gmail.com, klau...@gmail.com

Just wondering, does gzkp allocate more memory resources than gzstd?

Klaus Post

unread,

Jul 30, 2015, 6:42:57 PM7/30/15

to golang-nuts, demetri...@gmail.com, klau...@gmail.com, thebroke...@gmail.com

On Friday, 31 July 2015 00:28:55 UTC+2, thebroke...@gmail.com wrote:

Just wondering, does gzkp allocate more memory resources than gzstd?

It has an additional 1-2KB array (depending on the size of an int) used for bulk hashing. It is allocated when you create a new Writer, but otherwise the memory use should be the same.

/Klaus

PS. I have added benchmarks for high compressible input, and level 1-7 on medium compressible input on separate sheets. It looks as if there could be something gained by being able to quickly skip output that is hard to compress at lower compression levels.

Darren Hoo

unread,

Jul 30, 2015, 10:25:59 PM7/30/15

to golang-nuts, demetri...@gmail.com, klau...@gmail.com

how about compressing small chunks of data, like 4kb for each message?

I have used your library, but it can not keep up with the speed of messages flows in, So I switch back to cgzip

Klaus Post

unread,

Aug 1, 2015, 4:44:07 AM8/1/15

to golang-nuts, demetri...@gmail.com, klau...@gmail.com, darre...@gmail.com

On Friday, 31 July 2015 04:25:59 UTC+2, Darren Hoo wrote:

how about compressing small chunks of data, like 4kb for each message?

I have used your library, but it can not keep up with the speed of messages flows in, So I switch back to cgzip

I haven't tested small workloads yet. I will put that next on my to-do list. What is your average payload size and payloads per second?

Meanwhile I have summarized my findings on high-throughput workloads: http://blog.klauspost.com/go-gzipdeflate-benchmarks/

/Klaus

Klaus Post

unread,

Aug 2, 2015, 3:26:50 PM8/2/15

to golang-nuts, demetri...@gmail.com, klau...@gmail.com

On Friday, 31 July 2015 04:25:59 UTC+2, Darren Hoo wrote:

how about compressing small chunks of data, like 4kb for each message?

I looked through it, and even got 5-10% additional performance in standard deflate.

The most important thing is to use the Reset() function, and not use a NewWriter on every instance. This more close to doubles the number of compressed files with both the standard library and this. Use a sync.Pool if you need to share across goroutines.

I created a test-set with 548 files, containing JSON, HTML, Javascript, SVG and CSS. Average size is a little more than 8KB/file. All are compressed separately to simulate a http server. ~500 of the files are small JSON files.

With the modified gzip library I get 22 MB/sec (2656 files/s), using a Reset() between each.

With cgzip I get about 28MB/sec(3391 files/sec). So it has about a 25% performance advantage.

For comparison the standard library achieves 13MB/sec (1574 files/sec) with Reset(). Without Reset() the speed of the modified library is 16 MB/sec (1926 files/sec).

I will do more tests and see if I can squeeze out more speed.

/Klaus

Klaus Post

unread,

Aug 3, 2015, 10:41:59 AM8/3/15

to golang-nuts, demetri...@gmail.com, klau...@gmail.com, thebroke...@gmail.com

On Friday, 31 July 2015 00:28:55 UTC+2, thebroke...@gmail.com wrote:

Just wondering, does gzkp allocate more memory resources than gzstd?

Just finished eliminating allocations in deflate. The aggregate results when using gzip, which appears to do a few allocations by itself:

* "BenchmarkOld" below are the allocations for the standard library.

BenchmarkGzipL1 20 77454430 ns/op 64.06 MB/s 40386 B/op 0 allocs/op

BenchmarkGzipL2 20 84054810 ns/op 59.03 MB/s 40386 B/op 0 allocs/op

BenchmarkGzipL3 20 86904970 ns/op 57.09 MB/s 40386 B/op 0 allocs/op

BenchmarkGzipL4 10 118906800 ns/op 41.73 MB/s 80772 B/op 1 allocs/op

BenchmarkGzipL5 10 148208480 ns/op 33.48 MB/s 80772 B/op 1 allocs/op

BenchmarkGzipL6 10 148608500 ns/op 33.39 MB/s 80772 B/op 1 allocs/op

BenchmarkGzipL7 5 200011440 ns/op 24.81 MB/s 161545 B/op 3 allocs/op

BenchmarkGzipL8 3 396022666 ns/op 12.53 MB/s 269242 B/op 6 allocs/op

BenchmarkGzipL9 3 403356400 ns/op 12.30 MB/s 269242 B/op 6 allocs/op

BenchmarkOldGzipL1 10 104305970 ns/op 47.57 MB/s 396857 B/op 4333 allocs/op

BenchmarkOldGzipL2 10 123907090 ns/op 40.04 MB/s 386457 B/op 4249 allocs/op

BenchmarkOldGzipL3 10 137007830 ns/op 36.22 MB/s 379177 B/op 4180 allocs/op

BenchmarkOldGzipL4 5 200011440 ns/op 24.81 MB/s 503955 B/op 3615 allocs/op

BenchmarkOldGzipL5 5 216612400 ns/op 22.91 MB/s 499715 B/op 3558 allocs/op

BenchmarkOldGzipL6 5 254814560 ns/op 19.47 MB/s 499475 B/op 3651 allocs/op

BenchmarkOldGzipL7 5 299817140 ns/op 16.55 MB/s 500675 B/op 3696 allocs/op

BenchmarkOldGzipL8 2 529030250 ns/op 9.38 MB/s 936048 B/op 3678 allocs/op

BenchmarkOldGzipL9 2 577033000 ns/op 8.60 MB/s 936128 B/op 3681 allocs/op

Luna Duclos

unread,

Aug 3, 2015, 10:43:42 AM8/3/15

to Klaus Post, golang-nuts, demetri...@gmail.com, thebroke...@gmail.com

is there any convincing reason to not integrate most of these improvements in the std lib once they're entirely finished ?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Klaus Post

unread,

Aug 3, 2015, 10:49:52 AM8/3/15

to golang-nuts, klau...@gmail.com, demetri...@gmail.com, thebroke...@gmail.com

On Monday, 3 August 2015 16:43:42 UTC+2, Luna Duclos wrote:

is there any convincing reason to not integrate most of these improvements in the std lib once they're entirely finished ?

No. Initially I thought it would only be x64 specific, but a lot should make it generally better. I am still regression testing to see if there are any cases where it is significantly worse.

Either way, it will not be in before 1.6.

/Klaus

Benoît Amiaux

unread,

Aug 3, 2015, 1:55:34 PM8/3/15

to Klaus Post, golang-nuts

Nice speed !

Have you seen this article ? http://fastcompression.blogspot.fr/2015/07/huffman-revisited-part-2-decoder.html

--

Klaus Post

unread,

Aug 4, 2015, 6:57:19 AM8/4/15

to golang-nuts, klau...@gmail.com

On Monday, 3 August 2015 19:55:34 UTC+2, Benoît Amiaux wrote:

Nice speed !
Have you seen this article ? http://fastcompression.blogspot.fr/2015/07/huffman-revisited-part-2-decoder.html

No,I hadn't seen it, but funny enough, I was just experimenting with doing Huffman-only compression to create an alternative to the big variance in compression speed.

I have a prototype working, although currently the speed isn't particularly impressive yet - even though usually faster than level 1.

The compression hit is rather big though.

https://github.com/klauspost/compress/pull/6

My few initial tests (one CPU used):

enwiki9:

* Level 1: 40.17MB/s, 365,776,800 bytes

* Huffman only: 70.82 MB/s, 641,017,571 bytes

10GB Matt Mahoney corpus:

* Level 1: 34.59 MB/s, 5,105,308,274 bytes

* Huffman only: 82.15 MB/s, 6,485,492,430 bytes

Pure Huffman should give a more predictable speed, meaning that different input doesn't make the compression speed "tank" as it can to some degree with ordinary deflate.

My aim is to get the Huffman-only above 150MB/s to justify the size tradeoff and to make it a "well, it cannot hurt to add" option.

Regarding "FSE" described in the article, we cannot use that, since we must remain compatible with the deflate format.

/Klaus

Klaus Post

unread,

Aug 5, 2015, 3:11:04 PM8/5/15

to golang-nuts, klau...@gmail.com

Hi!

I wrote up a summary of my findings when using Gzip for small payloads - specifically for web server style loads:

http://blog.klauspost.com/gzip-performance-for-go-webservers/

/Klaus

Paul Graydon

unread,

Aug 6, 2015, 1:25:48 AM8/6/15

to golang-nuts

If you're building webservers/websites from scratch, you might want to consider precompressing all your static content in advance.

Nginx will happily host the content directly http://nginx.org/en/docs/http/ngx_http_gzip_static_module.html and even supports the alternative approach of serving only gzip content and uncompressing for those clients that don't support compression.

With such easy wins available, it somewhat surprises me that more static site generators don't provide it as a native setting.
If you really want to get fancy you can use zopfli, Google's painfully slow, high compression libraries, they produce gzipped content at better compression ratios than gzip. You'd never use it to compress on the fly, but it's ideal for precompressing.

Frits van Bommel

unread,

Aug 6, 2015, 3:28:25 AM8/6/15

to golang-nuts, klau...@gmail.com

Klaus,

About your example code:

		// Get a Writer from the Pool
		gz := zippers.Get().(*gzip.Writer)

		// We use Reset to set the writer we want to use.
		gz.Reset(w)
		defer gz.Close()

		// When done, put the Writer back in to the Pool
		defer zippers.Put(gz)

Please note that defers run in reverse order, so you want to defer the Put() before the Close() to avoid calling the latter on a writer that has already been Put() back into the pool and is potentially already being used by another goroutine.
I'd suggest putting the Put() right after the Get() unless there's some way for Reset() to fail that would make it unsuitable for reuse (in that case it should be between Reset() and the deferred Close()).

Klaus Post

unread,

Aug 6, 2015, 7:21:13 AM8/6/15

to golang-nuts, klau...@gmail.com

On Thursday, 6 August 2015 09:28:25 UTC+2, Frits van Bommel wrote:

Klaus,
[...]


Please note that defers run in reverse order, so you want to defer the Put() before the Close() to avoid calling the latter on a writer that has already been Put() back into the pool and is potentially already being used by another goroutine.
I'd suggest putting the Put() right after the Get() unless there's some way for Reset() to fail that would make it unsuitable for reuse (in that case it should be between Reset() and the deferred Close()).

Thanks for that!

I updated the sample code!

/Klaus

Reply all

Reply to author

Forward