compress/flate: replace level 1 to 3 with specialized versions (Discussion)

1,486 views
Skip to first unread message

klau...@gmail.com

unread,
Mar 22, 2016, 6:53:33 AM3/22/16
to golang-dev, Joe Tsai, Nigel Tao, Russ Cox
Hi!

This is discussion for CL 21021: https://go-review.googlesource.com/#/c/21021/

What I am proposing in the CL is to replace the levels 1 to 3 with specialized versions, that are significantly faster. Usually throughput is around 1.7x of the standard encoder. This does comes at a compression cost; usually the reduction loss is between 2 and 3 percent compared to the original level 1,2 and 3. 

This affects the following other packages: "compress/gzip", "compres/zlib", "archive/zip", "image/png". My main target has been webservers, but I have also had talks with the Docker crew, who would greatly appreciate faster deflate for disk image creation. Also Russ Cox has proposed zip for object file archives, which will also be affected by this change.

My rationale for this is that levels 1 to 3 indicate that we want the best speed. Typically the user has done this be selecting flate.BestSpeed (or similar), so they have already indicated they are willing to sacrifice compression efficiency for better speed.

My main discussion points: 
* Is it worth the additional code complexity for a 1.7x speed up?
* Is the 2-3% compression loss worth the speedup?
* Do you see any other downsides, or do you see any places where this would benefit you?


Other discussion points:
* In the future the remaining levels 4-8 should be re-adjusted, but that isn't strictly needed now.


DETAILED TEST SET DESCRIPTION:

---> Web content. This benchmark compresses a selection of individual HTML, JS, CSS, SVG and small individual JSON files. This was selected to give an indication of typical web server performance on content that is likely to be gzip encoded.

Web content, level 1: 2.5% less compression reduction. 2.6x speedup.
Web content, level 2: 2.1% less compression reduction. 2.3x speedup.
Web content, level 3: 2.1% less compression reduction. 2.4x speedup.



---> A huge JSON stream, containing highly repetitive JSON. Typically compresses around 95%. This content could represent a database dump, a network stream or similar content.

JSON, level 1: 1.3% less compression reduction. 2.6x speedup.
JSON, level 2: 0.1% more compression reduction. 1.9x speedup.
JSON, level 3: 0.1% more compression reduction. 1.9x speedup.



---> enwik9 is a standard Corpus, containing the first 10^9 bytes of the XML text dump of the English version of Wikipedia on Mar. 3, 2006 used by the "Large Text Compression benchmark". See more here: http://mattmahoney.net/dc/text.html

enwik9, level 1: 2.7% less compression reduction. 1.8x speedup.
enwik9, level 2: 2.8% less compression reduction. 1.9x speedup.
enwik9, level 3: 2.6% less compression reduction. 2.0x speedup.



---> 10GB is a standard test set, that represents a typical backup scenario. The test data is designed to test archivers in realistic backup scenarios with lots of already-compressed or hard to compress files and lots of duplicate or nearly identical files. It consists of exactly 10 GB (1010) bytes in 79,431 files in 4006 directories. For this test, a TAR file has been created. See http://mattmahoney.net/dc/10gb.html 

10gb, level 1: 3.1% less compression reduction. 3.0x speedup.
10gb, level 2: 3.1% less compression reduction. 3.2x speedup.
10gb, level 3: 3.4% less compression reduction. 3.4x speedup.



---> Random data, should be uncompressible. Could be JPG, MP3, MP4, MKV files in the "real world".

random, level 1: 0.0% less compression reduction. 19.5x speedup.
random, level 2: 0.0% less compression reduction. 19.2x speedup.
random, level 3: 0.0% less compression reduction. 19.4x speedup.

Test set: http://mattmahoney.net/dc/#sharnd - 200000000 bytes


Your input is appreciated.

/Klaus

antoine...@gmail.com

unread,
Mar 22, 2016, 1:39:28 PM3/22/16
to golang-dev, joe...@digital-static.net, nige...@golang.org, r...@golang.org, klau...@gmail.com
The compromise makes a lot of sense to me. I had to use the `cgzip` bindings in many projects, adding complexity to builds and preventing the use of many tools like gorename, so that I could get a faster "BestSpeed" story.

If the default `compress/gzip` implementation is ~1.7x faster for at most 3% less compression, that's a huge win in all the cases I have that require `BestSpeed`.

Russ Cox

unread,
Mar 22, 2016, 1:54:14 PM3/22/16
to antoine...@gmail.com, golang-dev, joe...@digital-static.net, nige...@golang.org, klau...@gmail.com
Hi Klaus,

Can you post a link the graphs you showed me a month or two ago? (Maybe it was a blog post, but maybe it was a spreadsheet, I forget.) I thought they made a very compelling case for shifting the levels.

Thanks.
Russ

klau...@gmail.com

unread,
Mar 22, 2016, 2:17:53 PM3/22/16
to golang-dev, joe...@digital-static.net, nige...@golang.org, r...@golang.org, klau...@gmail.com, antoine...@gmail.com
On Tuesday, 22 March 2016 18:39:28 UTC+1, antoine...@gmail.com wrote:
The compromise makes a lot of sense to me. I had to use the `cgzip` bindings in many projects, adding complexity to builds and preventing the use of many tools like gorename, so that I could get a faster "BestSpeed" story.

In regards to cgzip, the new levels 1-3 pretty much match the cgzip on easy-to-compress content. That being 'web content', "JSON" and 'enwik9' - especially with the SSA optimizations coming in. "Mixed content" (10GB) is approximately 2x the speed of cgzip, likely because it is quicker at skipping hard to compress sections.

@rsc: I don't have it for this specific revision, but the overall pictures is: http://imgur.com/a/5sWM4

Horizontal is compression (note this is a zoomed in representation), and vertically is throughput in MB/s. Since not some changes are left out, level 4-9 throughput is probably about 10-20% lower than these numbers.

However, to me they indicate that there is little need for levels 7 and 8, and level 6 (the current default) is not a good trade-off between speed and compression throughput.

Currently I have hit this target: http://i.imgur.com/dGvgnIf.png (10gb.tar testset) - I have written my considerations for how the compression/speed has been balanced here: https://blog.klauspost.com/rebalancing-deflate-compression-levels/


/Klaus

Mark Kubacki

unread,
Mar 22, 2016, 3:31:05 PM3/22/16
to golang-dev
This mixes 'flavour' into the dimension described by fast–best. In the light of zopfli being another potential candidate, perhaps we can find a way to express 'flavour/variant' as additional distinct dimension?

Similar to "encoding.base64"'s StdEncoding, URLEncoding; only that there is no guarantee which one is used if none is explicitly asked for. (Think: 1–4 Klaus', 7–9 zopfli or similar; subject to change with fashion and prevalent performance.)

-- 
Mark

minux

unread,
Mar 22, 2016, 7:25:04 PM3/22/16
to klau...@gmail.com, Joe Tsai, Russ Cox, Nigel Tao, golang-dev

Nice speedups indeed. Does this require any assembly code?

Marvin Stenger

unread,
Mar 22, 2016, 8:07:20 PM3/22/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org, nige...@golang.org
This is discussion for CL 21021: https://go-review.googlesource.com/#/c/21021/
...no.

klau...@gmail.com

unread,
Mar 23, 2016, 4:39:30 AM3/23/16
to golang-dev
@Mark:

> In the light of zopfli being another potential candidate [...]

I can see your point. However, I think the performance degradation of zopfli would be too big to be acceptable as a replacement for level 9. Luckily we don't have the same restrictions as with the lower levels, so zopfli could be implemented as level 10-12 or something similar. That way we don't explode CPU usage for people using level 9, while still maintaining the option to go to "11" for something like precompressed static files. That said, it probably should be in a 3rd party package.

@minux

>Nice speedups indeed. Does this require any assembly code?

No - I have experimented with replacing the matcher with SSE 4.2 code, but the call overhead usually outweighs the benefits, and only for the revised level 3 does it give a persistent speedup in the area of 5% - which really doesn't match the added complexity. The main problem is that SSE 4.2 only really speeds up the cases that are already fast (long matches). 

I am mostly interested in speeding up the "slowest" cases, and this applies a bunch of tricks that helps this case a lot. Also the new SSA compiler will help eliminate most bounds checks, so I don't currently plan this to have any assembler added in the future.

Most speedups are from simpler LZ77 matching, being based on the main loop from Snappy, some heuristic skipping and selective Huffman encoding.


/Klaus

Peter Waller

unread,
Mar 23, 2016, 5:51:02 AM3/23/16
to klau...@gmail.com, golang-dev
On 22 March 2016 at 10:53, <klau...@gmail.com> wrote:
What I am proposing in the CL is to replace the levels 1 to 3 with specialized versions, that are significantly faster. Usually throughput is around 1.7x of the standard encoder. This does comes at a compression cost; usually the reduction loss is between 2 and 3 percent compared to the original level 1,2 and 3.

This sounds great. I have also noticed the slowness of BestSpeed and have been tempted to use cgzip at times, so a big +1 from me.

One question: You mention SSA "will bring improvements", so it is unclear to me if you mean future improvements to SSA, or that the benchmarks you quote are without SSA? I assume from the fact that you mention numbers in the CL that they are comparing to the existing performance with SSA.

From Keith's "SSA performance numbers" mailing list post he showed Gzip (compress) was ~1/(1 - 0.1397) = 1.16x faster, if I understand correctly. Unclear to me which speed parameter though.

So, does that mean the performance benefit we will get compared to go1.6 may be even greater? (Optimistic ballpark 1.16 * 1.7 is ~2, which feels like a big win!)

Nigel Tao

unread,
Mar 23, 2016, 7:28:36 AM3/23/16
to Klaus Post, golang-dev, Joe Tsai, Russ Cox
I'm OK with this in principle, with a few remarks below.

Re the overall pictures at http://imgur.com/a/5sWM4 and at
https://blog.klauspost.com/rebalancing-deflate-compression-levels/
it's not obvious to me whether the graphs are showing the old or new
algorithms for levels 1, 2 and 3. In fact, I think the best graph
would show both.

Flate is normally Huffman + LZ77. In
https://go-review.googlesource.com/#/c/20982/ you are also proposing a
Huffman-only compression level. How does that relate (if at all) to
this proposal? Should we also have a LZ77-only compression level?
Should we have more than one such level? Conversely, when introducing
Snappy style matching, should we replace levels 1, 2 and 3, or just
level 1 aka BestSpeed?

https://go-review.googlesource.com/#/c/21021/ is marked "DO NOT
SUBMIT", but it still has code like this:

// Skip 1 byte for 16 consecutive missed.
s += 1 + ((s - lit) >> 4)

Please don't do that. We have had problems with this before:
https://github.com/golang/snappy/commit/d1d908a252c22fd7afd36190d5cffb144aa8f777

Also, that CL has three separate copy-pasted-and-specialized methods
for levels 1, 2 and 3. Yes, each method has a sentence or two in
comment, but it would be easier to understand the difference between
the three actual algorithms if there was just a single method, with
some "if e.level > 2" style blocks. Sure, copy-paste-and-specialize
can increase performance (at a cost of having more code), but it can
easily be a follow-up CL.

klau...@gmail.com

unread,
Mar 23, 2016, 8:43:19 AM3/23/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org
On Wednesday, 23 March 2016 12:28:36 UTC+1, Nigel Tao wrote:
I'm OK with this in principle, with a few remarks below.

I'm glad to hear that.
 

Re the overall pictures at http://imgur.com/a/5sWM4 and at
https://blog.klauspost.com/rebalancing-deflate-compression-levels/
it's not obvious to me whether the graphs are showing the old or new
algorithms for levels 1, 2 and 3. In fact, I think the best graph
would show both.

I didn't want to go into details about the re-balancing, since we can take that discussion separately, but the point was to show that:

* There is a big gap from the proposed level 3 to level 4, which will be 
* Default level (6) does not represent a "sweet spot" between CPU/compression ratio.
* Level 7-8 provide no significant compression ratio increase over level 6.
 
Actual numbers are from my personal repo. With CL 20929 merged, I expect numbers to be close, but a bit lower because of the slower token slice and combined lazy/no-lazy code. Russ asked for the numbers I had shown him, and here they are.

Once we have the pending things merged, I will submit the re-balance proposal and I will be able to do the "real" numbers.


Flate is normally Huffman + LZ77. In
https://go-review.googlesource.com/#/c/20982/ you are also proposing a
Huffman-only compression level. How does that relate (if at all) to
this proposal?

It is not directly related, so I separated it out in a separate CL, and didn't mention it here to have the discussion a bit more focused. It is more a special case, that definitely has its uses, but doesn't provide a good generic compressor, but can be used on Snappy output, non-artificial PNG images and content where this proposed level 1 is even is too slow.
 
Should we also have a LZ77-only compression level?

Unfortunately that is not possible with the deflate format. LZ77 (matching) data must be Huffman compressed. One thing I will investigate is the possibility to only recalculate the Huffman tree every n deflate blocks (which are up to 64k each), since tree calculation takes ~50% of the CPU time in this proposed level 1. If we calculate them every second block, that would be a 25% speed increase.

It will probably only be added to my personal package, but if it works well, who knows if it will find a place in the standard library.
 
Should we have more than one such level? Conversely, when introducing
Snappy style matching, should we replace levels 1, 2 and 3, or just
level 1 aka BestSpeed?

Level 2+3 offer good progression. The "generic" deflater leaves a big gap to level 1, and hits a performance ceiling. This is what you see in my blog post, which was written before I wrote the specialized level 2+3.


Please don't do that. We have had problems with this before:

Don't worry, I know what I am doing. Each deflate block is never longer thank 64KB, so skipping is reset for every 64KB data processed. This works as intended. If it didn't 10GB.tar would end up considerably worse.

Btw, now that you mention it, you still haven't fixed Snappy which will completely stop compressing data once it encounters 1MB uncompressible data: https://github.com/golang/snappy/pull/23#issuecomment-183889455 - maybe you should add changes as PRs in the future, instead of committing directly to master - you are welcome to ping me for review, and maybe others also have some input.

 
Also, that CL has three separate copy-pasted-and-specialized methods
for levels 1, 2 and 3. Yes, each method has a sentence or two in
comment, but it would be easier to understand the difference between
the three actual algorithms if there was just a single method, with
some "if e.level > 2" style blocks. Sure, copy-paste-and-specialize
can increase performance (at a cost of having more code), but it can
easily be a follow-up CL.

Sure - I will probably go the route of trying to factor out the common code in smaller functions. That way the compiler can inline it, and we don't have the branching overhead.


/Klaus

Joe Tsai

unread,
Mar 23, 2016, 4:06:52 PM3/23/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org
Personally, I think the adjustments to levels 1-3 are worth doing, although I am worried about the amount of added logic. The loss of compression ratio does not bother me.

In several of the recent CLs you posted, there has been significant amounts of duplicated logic. I am wondering if we could write one generic function that can do the logic of all three levels, but with the portions that differ segregated using a if constValueX {...} block. Then we can write a code generator that simply outputs another source file that contains three renamed versions of that generic function and just replaces each instance of constValueX with true/false and let the compiler deal with getting rid of the dead code. Thoughts?

JT

Nigel Tao

unread,
Mar 24, 2016, 8:10:46 AM3/24/16
to Klaus Post, golang-dev, Joe Tsai, Russ Cox
On Wed, Mar 23, 2016 at 11:43 PM, <klau...@gmail.com> wrote:
>> Re the overall pictures at http://imgur.com/a/5sWM4 and at
>> https://blog.klauspost.com/rebalancing-deflate-compression-levels/
>> it's not obvious to me whether the graphs are showing the old or new
>> algorithms for levels 1, 2 and 3. In fact, I think the best graph
>> would show both.
>
> I didn't want to go into details about the re-balancing, since we can take
> that discussion separately, but the point was to show that:
>
> * There is a big gap from the proposed level 3 to level 4, which will be

Is part of this bullet point missing? The sentence appears incomplete.

In any case, I will repeat myself in case there was a
mis-communication, but even if there is or was or both being a big gap
between level 3 and level 4, it would be great to see both the old and
new algorithms for levels 1-3 on the graph.


>> Flate is normally Huffman + LZ77. In
>> https://go-review.googlesource.com/#/c/20982/ you are also proposing a
>> Huffman-only compression level. How does that relate (if at all) to
>> this proposal?
>
> It is not directly related, so I separated it out in a separate CL, and
> didn't mention it here to have the discussion a bit more focused.

OK. It is an API change (an addition), so I think it's worth at least
mentioning at some point on golang-dev. But it's mentioned now, and
the change LGTM (modulo some minor issues).


>> Should we have more than one such level? Conversely, when introducing
>> Snappy style matching, should we replace levels 1, 2 and 3, or just
>> level 1 aka BestSpeed?
>
> Level 2+3 offer good progression. The "generic" deflater leaves a big gap to
> level 1, and hits a performance ceiling. This is what you see in my blog
> post, which was written before I wrote the specialized level 2+3.

They may offer good progression (again, seeing both the before and the
after for levels 1-3 on the graphs would help inform the discussion),
but in practice, do people use anything other than BestSpeed,
BestCompression, NoCompression or DefaultCompression? I don't know the
answer, but if not, it might not be worth having three copy/pastes of
the new algorithm if nobody uses levels 2 or 3.


>> Please don't do that. We have had problems with this before:
>
> Don't worry, I know what I am doing. Each deflate block is never longer
> thank 64KB, so skipping is reset for every 64KB data processed. This works
> as intended. If it didn't 10GB.tar would end up considerably worse.

I'm sorry, but I don't find "don't worry, I know what I am doing"
persuasive. That sort of reassurance would not have prevented the bug
that https://github.com/golang/snappy/commit/d1d908a252c22fd7afd36190d5cffb144aa8f777
fixed.

Even if you are now resetting every 64KB, that still means that the "s
+= 1 + ((s - lit) >> 4)" skipping can be very aggressive within each
64KB block. Specifically, it is exponential instead of quadratic (see,
for example, the table after "It prints" on the golang/snappy commit
linked to immediately above), which I think is too aggressive.

We clearly disagree on where to pick the trade-off for speed vs size,
but I feel that "do more or less what C++ snappy does" is a safe
choice, and using a novel algorithm has some risk: I discovered and
fixed one bug in d1d908a2 but there may be others, on other classes of
input.

I know that the compression experts that made google/snappy test
algorithmic tweaks on large corpora of test data, whether samples of a
web crawl or of RPC traffic, although it's clearly not practical to
check in 100s of megabytes of test data into the google/snappy git
repository. If you can convince the C++ google/snappy maintainers to
change their skipping algorithm to yours, then I'm happy to adopt your
skipping algorithm. But until then, I don't think I've seen enough
evidence to move off the safe choice.

For example, https://go-review.googlesource.com/#/c/20513/ points to a
spreadsheet at https://docs.google.com/spreadsheets/d/1VLxi-ac0BAtf735HyH3c1xRulbkYYUkFecKdLPH7NIQ/edit?usp=sharing
but to repeat what I said earlier on that code review: If I'm reading
the spreadsheet linked to from CL description right, there are only
three test files used to compare the before and after size. I don't
see links to the actual test files, but all three of them ("very
compressible", "random data", "xml") sound like they're either all
compressible or all incompressible. Three isn't a large sample size,
and it doesn't cover things like PDF files which, IIUC, is like a
textual format (i.e. compressible) that can embed binary resources
such as JPEG images (i.e. incompressible) all in the one file.


> Btw, now that you mention it, you still haven't fixed Snappy which will
> completely stop compressing data once it encounters 1MB uncompressible data

This is drifting off-topic, but that was fixed in
https://github.com/golang/snappy/commit/bf2ded9d81f5c22b62cf76363673c6f9765a6703
which was committed a month ago. The test cases added in that commit
includes a 2MB input of which the first 1MB is uncompressible.


Finally, I hope that I haven't come across as being too negative in
the various flate discussions. I have made a number of criticisms, but
I aspire to be constructive. Please let me know if you are not finding
that to be the case. I do appreciate the work you've done, and I do
think that it will significantly improve Go 1.7 and beyond.

klau...@gmail.com

unread,
Mar 28, 2016, 10:55:03 AM3/28/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org
Hi!

Thanks for the feedback.

I have adjusted the CL, since the benchmark numbers have changed after CL 20929 has been merged.

I have added several more benchmarks to the spreadsheet (A disk image, 10GB.tar, GOB stream, Go Object files, Go source files, web server benchmarks), as well as the revision numbers used for the benchmark. Files for each benchmark can now be downloaded.


It should show the strengths and weaknesses of the revised algorithm doesn't perform so well compared to the current master. However, the "bad" cases are mostly on content that is already very fast to compress, so this CL should hopefully reduce the "worst case" (in terms of speed) compression scenarios, where it provides a very good speedup. I want to be transparent about this tradeoff.


I have made some small adjustments to the code:
* I have taken out most of the shared code, so each function now only has the main encoding loop.
* I have removed an optimization that may be too aggressive for a standard package, which skips Huffman compression if no matches are found in a 64KB block. This would have a negative impact on encoding random base64 encoded data for instance, so to avoid this case it now always attempts a Huffman encode.
* Hash values are now 16 bits instead of 14 bits. This does not impact performance negatively and allows for better compression than the previous version.
* Minor tweaks.


On Thursday, 24 March 2016 13:10:46 UTC+1, Nigel Tao wrote:

In any case, I will repeat myself in case there was a
mis-communication, but even if there is or was or both being a big gap
between level 3 and level 4, it would be great to see both the old and
new algorithms for levels 1-3 on the graph.

I will do that, when we should look at re-balancing the levels. I have updated the spreadsheet with the new numbers, which you are welcome to grab if you should have the time.
 
They may offer good progression (again, seeing both the before and the
after for levels 1-3 on the graphs would help inform the discussion),
but in practice, do people use anything other than BestSpeed,
BestCompression, NoCompression or DefaultCompression? I don't know the
answer, but if not, it might not be worth having three copy/pastes of
the new algorithm if nobody uses levels 2 or 3.

You do have a good point. I cannot be the judge of whether it should be included. I have tried to factor out the common code in the CL now - at least the things that could be done without sacrificing performance in a measurable way. 

If it is still "too much", I think I would go for adding "encodeL2" as BestSpeed. It is often very close to "encodeL1" in terms of speed (within 5-10%), but offers 1% better compression. If we should only replace level 1, I would choose that. Maybe other people have input on whether it should be a choice, or we should just implement 1 level.


Even if you are now resetting every 64KB, that still means that the "s
+= 1 + ((s - lit) >> 4)" skipping can be very aggressive within each
64KB block. Specifically, it is exponential instead of quadratic (see,
for example, the table after "It prints" on the golang/snappy commit
linked to immediately above), which I think is too aggressive.

I have some "mixed content" data, a typical backup set and a virtual disk image. I do not see any significant compression loss from the aggressive skipping. IMO the risk is small, since only up to 64KB will be affected. In my optinion that is a reasonable trade for level 1-3, where 2&3 are set less aggressive.

 
I know that the compression experts that made google/snappy test
algorithmic tweaks on large corpora of test data, whether samples of a
web crawl or of RPC traffic, although it's clearly not practical to
check in 100s of megabytes of test data into the google/snappy git
repository. If you can convince the C++ google/snappy maintainers to
change their skipping algorithm to yours, then I'm happy to adopt your
skipping algorithm. But until then, I don't think I've seen enough
evidence to move off the safe choice.

I have added a bunch more test sets, which you can download test. I cannot "compete" with a secret testset, plus we are talking an algorithm where the user has selected "I want the fastest option". If we where talking "DefaultCompression", I would definitely go for the safe option.

 
compressible or all incompressible. Three isn't a large sample size,
and it doesn't cover things like PDF files which, IIUC, is like a
textual format (i.e. compressible) that can embed binary resources
such as JPEG images (i.e. incompressible) all in the one file.

The 10GB testset is a good "mixed content", as well as the disk image. If you have any more specific ones, I would love to test it.



Finally, I hope that I haven't come across as being too negative in
the various flate discussions. I have made a number of criticisms, but
I aspire to be constructive. Please let me know if you are not finding
that to be the case. I do appreciate the work you've done, and I do
think that it will significantly improve Go 1.7 and beyond.

And I also love a good discussion. For issues like this it is important to discuss tradeoffs, but I don't like to be presented with "facts" where you subtly imply I don't know what I am doing. We both do sub-optimal things from time to time, but bringing up the same thing for 4th or 5th time and implying "you made a mistake there, therefore this is garbage" does not feel like a constructive way of discussing this.

I don't think we have wildly different views on this. I think I try to cater a bit more toward the typical use case. Whether we as "compression experts" know that gzip compressing a JPEG is usually a silly thing, that is something that will happen quite often, (think backups, disk images) so for the compressor to quickly (5x faster)  being able to "skip" that is an important in my design. That is a much more likely scenario than a tiny chance that there might be some slightly sub-optimal compression in a 64KB block, so I optimize for the former.

The optimization I just removed is another case. In cases where no matches were found in a 64KB block, it was always skipped (stored). For *most* content that is fine, but for something like random data that was base 64 encoded, it would never be able to reduce the entropy. To me this is a clear case of the worst case being bad, and a realistic usage scenario - yes, people will compress base 64 encoded data and it should be compressed.

We are not dealing with "perfect" compression here, which makes this discussion harder than average. We are dealing with a "give me a fast" (BestSpeed), "give me a reasonable" (Default) and "give me a close to optimal" (BestCompression) compression spectrum. For the "fast", you *will* be able to create worst cases (both in terms of speed and compression), however, they should be rare. To me (and hopefully to a users) it is obvious that BestSpeed will not deliver the best compression because it makes certain choices for speed. To me a worst case where speed is very low is bad when you select a "fast" option, and my choices in the code should reflect that.


/Klaus
 

Nigel Tao

unread,
Mar 31, 2016, 8:08:56 AM3/31/16
to Klaus Post, golang-dev, Joe Tsai, Russ Cox
On Tue, Mar 29, 2016 at 1:54 AM, <klau...@gmail.com> wrote:
>> Even if you are now resetting every 64KB, that still means that the "s
>> += 1 + ((s - lit) >> 4)" skipping can be very aggressive within each
>> 64KB block. Specifically, it is exponential instead of quadratic (see,
>> for example, the table after "It prints" on the golang/snappy commit
>> linked to immediately above), which I think is too aggressive.
>
> I have some "mixed content" data, a typical backup set and a virtual disk
> image. I do not see any significant compression loss from the aggressive
> skipping. IMO the risk is small, since only up to 64KB will be affected. In
> my optinion that is a reasonable trade for level 1-3, where 2&3 are set less
> aggressive.

We're repeating ourselves, but IMO the risk is non-zero, and the
existing snappy algorithm is plenty fast enough.

I asked the snappy-compression mailing list (where the C++ snappy code
is discussed) for their thoughts. I'll crunch some C++ numbers
tomorrow (it's getting late here in Sydney):

https://groups.google.com/forum/#!topic/snappy-compression/3Qa3fASLkNA

Nigel Tao

unread,
Apr 2, 2016, 9:45:24 PM4/2/16
to Klaus Post, golang-dev, Joe Tsai, Russ Cox
A couple of responses below, but note that for
github.com/golang/snappy, I have just updated the Go encoder to have
byte-for-byte identical output to the C++ encoder. There were a few
smaller commits recently to set this up, and further optimization work
remains, but the punchline commit at
https://github.com/golang/snappy/commit/8939696c2214cde5dda896a76f5bf56e80a16855
shows both faster throughput (on GOARCH=amd64) and smaller output.


On Tue, Mar 29, 2016 at 1:54 AM, <klau...@gmail.com> wrote:
>> They may offer good progression (again, seeing both the before and the
>> after for levels 1-3 on the graphs would help inform the discussion),
>> but in practice, do people use anything other than BestSpeed,
>> BestCompression, NoCompression or DefaultCompression? I don't know the
>> answer, but if not, it might not be worth having three copy/pastes of
>> the new algorithm if nobody uses levels 2 or 3.
>
> You do have a good point. I cannot be the judge of whether it should be
> included. I have tried to factor out the common code in the CL now - at
> least the things that could be done without sacrificing performance in a
> measurable way.
>
> If it is still "too much", I think I would go for adding "encodeL2" as
> BestSpeed. It is often very close to "encodeL1" in terms of speed (within
> 5-10%), but offers 1% better compression. If we should only replace level 1,
> I would choose that. Maybe other people have input on whether it should be a
> choice, or we should just implement 1 level.

I'd prefer only replacing level 1, and leaving 2 and 3 alone,
especially since, AFAICT,
https://go-review.googlesource.com/#/c/21021/4/src/compress/flate/deflatefast.go
still has a lot of copy/paste between encodeL1, encodeL2 and encodeL3.

In any case, I suspect that being based on the C++ encoder algorithm
(as mentioned above) will help.


On Thu, Mar 31, 2016 at 11:08 PM, Nigel Tao <nige...@golang.org> wrote:
> I asked the snappy-compression mailing list (where the C++ snappy code
> is discussed) for their thoughts. I'll crunch some C++ numbers
> tomorrow (it's getting late here in Sydney):
>
> https://groups.google.com/forum/#!topic/snappy-compression/3Qa3fASLkNA

FWIW, my reading of that discussion thread is that Steinar Gunderson
(one of the C++ snappy authors) reckons that while it's much better
throughput on binary data and slightly worse on textual data, the "1 +
((s - lit) >> 5)" skipping algorithm is a net negative for the C++
implementation, so I'm less inclined to adopt it for the Go
implementation, whether in golang/snappy or in compress/flate.

klau...@gmail.com

unread,
Apr 3, 2016, 6:56:22 AM4/3/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org
On Sunday, 3 April 2016 03:45:24 UTC+2, Nigel Tao wrote:

I'd prefer only replacing level 1, and leaving 2 and 3 alone,
especially since, AFAICT,
https://go-review.googlesource.com/#/c/21021/4/src/compress/flate/deflatefast.go
still has a lot of copy/paste between encodeL1, encodeL2 and encodeL3.

If you prefer that, I will keep the "encodeL2".

encodeL3 does not have any duplicate code, except the first few lines. "encodeL1" and "encodeL2" share some code, and encodeL1 could be achieved by just not copying the buffer on each call and calling encodeL2. However, you would lose all the speed gained, except the memcpy. We could look at the possibility of swapping two encoding buffers in the future, so we can keep a reference to the previous block, and avoid the memcpy.

I will remove all but encodeL2, and make that level 1, and restore the previous level 2+3. I am in the middle of starting a new job, so it may take a little time, but I hope we can get it finalized before the 1.7 lockdown.


> https://groups.google.com/forum/#!topic/snappy-compression/3Qa3fASLkNA

FWIW, my reading of that discussion thread is that Steinar Gunderson
(one of the C++ snappy authors) reckons that while it's much better
throughput on binary data and slightly worse on textual data, the "1 +
((s - lit) >> 5)" skipping algorithm is a net negative for the C++
implementation, so I'm less inclined to adopt it for the Go
implementation, whether in golang/snappy or in compress/flate.

Yes, it is not a clear-cut win/lose. On some archs, I would expect the subtraction to be faster than a separate variable, not so much on others. It really depends on register allocation, etc.

I will add the "conservative"  skipping. Another change I have to make is to make writeBlockDynamic able to output raw bytes if the bitcount is smaller for storing it uncompressed. That should also make random data faster to make up for the conservative skipping.

/Klaus

Nigel Tao

unread,
Apr 4, 2016, 8:23:50 PM4/4/16
to Klaus Post, golang-dev, Joe Tsai, Russ Cox
On Sun, Apr 3, 2016 at 11:45 AM, Nigel Tao <nige...@golang.org> wrote:
> On Thu, Mar 31, 2016 at 11:08 PM, Nigel Tao <nige...@golang.org> wrote:
>> I asked the snappy-compression mailing list (where the C++ snappy code
>> is discussed) for their thoughts. I'll crunch some C++ numbers
>> tomorrow (it's getting late here in Sydney):
>>
>> https://groups.google.com/forum/#!topic/snappy-compression/3Qa3fASLkNA
>
> FWIW, my reading of that discussion thread is that Steinar Gunderson
> (one of the C++ snappy authors) reckons that while it's much better
> throughput on binary data and slightly worse on textual data, the "1 +
> ((s - lit) >> 5)" skipping algorithm is a net negative for the C++
> implementation, so I'm less inclined to adopt it for the Go
> implementation, whether in golang/snappy or in compress/flate.

Ah, the sentiment seems to be turning on that thread. As I said
earlier, if the C++ google/snappy maintainers can be convinced to
change their skipping algorithm to yours, then I'm happy to adopt your
skipping algorithm. Kudos to you.

Giovanni Bajo

unread,
Apr 5, 2016, 5:51:24 PM4/5/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org
Just for the sake of history records, the original idea comes from the LZO compressor, that I ported to Go a few months ago:
 
Klaus spotted it while doing benchmarks between LZO and Snappy, extracted it, and adapted it, so all credits to him anyway. I just love the fact that this little, uncommented compression trick went through several packages now, to be fully re-analyzed some 20 years after it was first conceived :)

Giovanni

klau...@gmail.com

unread,
Apr 10, 2016, 10:06:59 AM4/10/16
to golang-dev, klau...@gmail.com, joe...@digital-static.net, r...@golang.org

Just for the sake of history records, the original idea comes from the LZO compressor, that I ported to Go a few months ago:
 
Klaus spotted it while doing benchmarks between LZO and Snappy, extracted it, and adapted it, so all credits to him anyway. I just love the fact that this little, uncommented compression trick went through several packages now, to be fully re-analyzed some 20 years after it was first conceived :)

Ah yes - thanks for reminding me! It is a very neat trick indeed.

I have updated the CL, which I now think is ready for review. I have removed two of the modes, so we now only replace "BestSpeed". As I mentioned, I would go for the proposed level 2, which is a tiny bit slower, but offers better compression than the originally proposed level 1. If we are only to replace one, that one has the best trade-off.


I have also submitted a separate change, which also improves compression on some edge cases - mainly base64 encoded content and similar: https://go-review.googlesource.com/#/c/21757/

Thanks for all the feedback and testing!

/Klaus
Reply all
Reply to author
Forward
0 new messages