Alternative to gzip Compression, ~37% smaller

998 views
Skip to first unread message

Ben Wills

unread,
Mar 15, 2017, 1:33:30 AM3/15/17
to Common Crawl
Knowing how ubiquitous gzip is, it makes sense to me why it might be preferred.

If you're open to considering another compression method, zstd/zstandard (https://github.com/facebook/zstd), a Facebook project by Yann Collet (xxHash and LZ4):
  • has matured past version 1.0.
  • is in production use by Facebook
  • is very fast in terms of both compression and decompression
  • compresses at a higher ratio than gzip.
I took a random Common Crawl warc file...
  • Original file size: 4,463,763,277 bytes
  • gzipped size: 941,893,587 bytes
...and ran it through various zstd compression levels. Below are 14 of the 22 compression levels, followed by the resulting byte size, and the time it took to compress. Note that these compression times were recorded using the C library where the file was already in memory. So this does not account for reading the file off of disk.
  • zstd[ 1] 707,109,581 10s 287ns
  • zstd[ 2] 683,571,696 12s 558ns
  • zstd[ 3] 627,890,986 14s 636ns
  • zstd[ 4] 626,607,826 20s 319ns
  • zstd[ 5] 599,217,283 34s 113ns *
  • zstd[ 6] 573,733,218 61s 811ns
  • zstd[ 7] 564,341,467 51s 741ns
  • zstd[ 8] 555,965,735 68s 792ns
  • zstd[ 9] 552,469,987 81s 583ns
  • zstd[10] 549,859,495 79s 14ns
  • zstd[11] 539,069,837 87s 443ns
  • zstd[12] 536,512,569 117s 464ns
* Compression level 5 is the default zstd compression level.

Assuming that these compression ratios would roughly apply across the entire dataset, the total download size of the February 2017 crawl would go from 55.88 TB to 35.5 TB using the default level 5 compression:
  • 35.4990466576 = (599,217,283 / 941,893,587) * 55.8
This would make the data far more accessible to a number of folks. Not only would several hundred dollars of disk usage be saved (considering $200/8TB on the cheapest end, that's $500 of disks), but download times would also be significantly reduced. As it is, 55.8 TB takes almost an entire month for me on a 240Mbit/s connection.

I know zstd might not be installed on everyone's machine, but it might be worth considering in order to make the data more accessible.

Ben

Ben Wills

unread,
Mar 15, 2017, 2:08:01 AM3/15/17
to Common Crawl
I just realized those numbers are wonky (level 6). Ran again, though level 9 must have ad something running in the background as it's a hair slower than level 10 in this run
  • zstd[ 1] 707109581 10s 194ns
  • zstd[ 2] 683571696 11s 486ns
  • zstd[ 3] 627890986 14s 181ns
  • zstd[ 4] 626607826 18s 322ns
  • zstd[ 5] 599217283 26s 709ns
  • zstd[ 6] 573733218 40s 786ns
  • zstd[ 7] 564341467 45s 185ns
  • zstd[ 8] 555965735 55s 225ns
  • zstd[ 9] 552469987 69s 465ns
  • zstd[10] 549859495 69s 311ns
  • zstd[11] 539069837 85s 849ns
  • zstd[12] 536512569 109s 319ns

Sebastian Nagel

unread,
Mar 15, 2017, 5:25:42 AM3/15/17
to common...@googlegroups.com
Hi Ben,

thanks for sharing these results. Could you share more details,
to make the experiment reproducible, esp.

- which WARC file was used to measure the compression ratio

- the code to compress the WARC

WARC records are compressed each in a separate deflate block.
This is in accordance with the gzip spec and allows to pull out
single documents/records by file offset as required by wayback
machines. Per-record-compressed WARC files are about 10% larger.

Also the decompression speed is an important point, WARCs are written
once but read often. Of course, a smaller size is always good
and the download time is the bottleneck if data is not processed
in a data center in the AWS us-east-1 region.

> Knowing how ubiquitous gzip is, it makes sense to me why it might be preferred.

> I know zstd might not be installed on everyone's machine, but it might be worth considering in
> order to make the data more accessible.

The main point is (from the perspective of our users): all WARC
libraries support gzipped WARC files. The chance is high that
changing the compression codec will break the data processing
pipelines of several users.

Also: the WARC format is an ISO standard, the first step should be
to get alternative compression codecs into the WARC spec:

https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#annex-a-informative-compression-recommendations

Version 1.1 is in status "proposed".

> Assuming that these compression ratios would roughly apply across the entire dataset, the total
> download size of the February 2017 crawl would go from 55.88 TB to 35.5 TB using the default
> level 5 compression:

No question, that's nice. But it's not an easy step to make...

> If you're open to considering another compression method,

It would be stupid not to be open, simply because it would allow us to archive more data.


Thanks,
Sebastian


On 03/15/2017 07:08 AM, Ben Wills wrote:
> I just realized those numbers are wonky (level 6). Ran again, though level 9 must have ad something
> running in the background as it's a hair slower than level 10 in this run
>
> * zstd[ 1]70710958110s194ns
> * zstd[ 2]68357169611s486ns
> * zstd[ 3]62789098614s181ns
> * zstd[ 4]62660782618s322ns
> * zstd[ 5]59921728326s709ns
> * zstd[ 6]57373321840s786ns
> * zstd[ 7]56434146745s185ns
> * zstd[ 8]55596573555s225ns
> * zstd[ 9]55246998769s465ns
> * zstd[10]54985949569s311ns
> * zstd[11]53906983785s849ns
> * zstd[12]536512569109s319ns
>
>
>
> On Tuesday, March 14, 2017 at 11:33:30 PM UTC-6, Ben Wills wrote:
>
> Knowing how ubiquitous gzip is, it makes sense to me why it might be preferred.
>
> If you're open to considering another compression method, zstd/zstandard
> (https://github.com/facebook/zstd <https://github.com/facebook/zstd>), a Facebook project by
> Yann Collet (xxHash and LZ4):
>
> * has matured past version 1.0.
> * is in production use by Facebook
> * is very fast in terms of both compression and decompression
> * compresses at a higher ratio than gzip.
>
> I took a random Common Crawl warc file...
>
> * Original file size: 4,463,763,277 bytes
> * gzipped size: 941,893,587 bytes
>
> ...and ran it through various zstd compression levels. Below are 14 of the 22 compression
> levels, followed by the resulting byte size, and the time it took to compress. Note that these
> compression times were recorded using the C library where the file was already in memory. So
> this does not account for reading the file off of disk.
>
> * zstd[ 1]707,109,58110s 287ns
> * zstd[ 2]683,571,69612s 558ns
> * zstd[ 3]627,890,98614s 636ns
> * zstd[ 4]626,607,82620s 319ns
> * zstd[ 5]599,217,28334s 113ns *
> * zstd[ 6]573,733,21861s 811ns
> * zstd[ 7]564,341,46751s 741ns
> * zstd[ 8]555,965,73568s 792ns
> * zstd[ 9]552,469,98781s 583ns
> * zstd[10]549,859,49579s 14ns
> * zstd[11]539,069,83787s 443ns
> * zstd[12]536,512,569117s 464ns
>
> * Compression level 5 is the default zstd compression level.
>
> Assuming that these compression ratios would roughly apply across the entire dataset, the total
> download size of the February 2017 crawl would go from 55.88 TB to 35.5 TB using the default
> level 5 compression:
>
> * 35.4990466576 = (599,217,283 / 941,893,587) * 55.8
>
> This would make the data far more accessible to a number of folks. Not only would several
> hundred dollars of disk usage be saved (considering $200/8TB on the cheapest end, that's $500 of
> disks), but download times would also be significantly reduced. As it is, 55.8 TB takes almost
> an entire month for me on a 240Mbit/s connection.
>
> I know zstd might not be installed on everyone's machine, but it might be worth considering in
> order to make the data more accessible.
>
> Ben
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Greg Lindahl

unread,
Mar 15, 2017, 2:46:00 PM3/15/17
to common...@googlegroups.com
On Wed, Mar 15, 2017 at 10:25:38AM +0100, Sebastian Nagel wrote:

> Also the decompression speed is an important point, WARCs are written
> once but read often.

I looked at the zstd github page and it appears that it decompresses
about 2X as fast as zlib, no matter which compression level is used.
So that's good.

There's a big list of binding for various languages. No mention of
OSes supported, and a glance at cpan-testers output for the Perl
binding says it's only working on Linux, perhaps because the perl
porter didn't bother to hook up other OSes. Presumably there's
a fair number of WARC users on OSX and windoze.

One interesting feature of this compression method is that you can
pre-initialize a dictionary to speed compression of small files
which contain a bunch of known strings. This should be great for
WARC Request and Response records.

Ben, if you're interested in doing additional experiments, it would be
great if you could divide a WARC into records, show the actual
compression ratio when compressed by records, and also experiment with
a pre-initialized dictionary.

-- greg

Ben Wills

unread,
Mar 15, 2017, 4:15:13 PM3/15/17
to Common Crawl

thanks for sharing these results. Could you share more details,
to make the experiment reproducible, esp.

- which WARC file was used to measure the compression ratio

- the code to compress the WARC

I can definitely do this. If not tonight, I'll do so this weekend. Greg mentioned below the use of a dictionary file, which I'd considered as well. So I'll run a bunch of use cases across several WARC, WET, and WAT files to see results across the various compression options and methods. I'll also do the same with gzip options to see how performance compares.

 
WARC records are compressed each in a separate deflate block.
This is in accordance with the gzip spec and allows to pull out
single documents/records by file offset as required by wayback
machines. Per-record-compressed WARC files are about 10% larger.

I didn't realize you could do document-specific blocks within gzip. I'll look more into that and see if there's something comparable with zstd. If zstd isn't capable of that (I know things like zpaq are able to, and are even able to update specific documents in an existing archive), I can see how that would create problems with a system like archive.org, et al.

 
Also the decompression speed is an important point, WARCs are written
once but read often.  Of course, a smaller size is always good
and the download time is the bottleneck if data is not processed
in a data center in the AWS us-east-1 region.

If I understand correctly, benchmarks show zstd as being consistently faster. But I'll pull harder data from testing to confirm with the CC datasets.
 
The main point is (from the perspective of our users): all WARC
libraries support gzipped WARC files. The chance is high that
changing the compression codec will break the data processing
pipelines of several users.

I hadn't considered the pipelines that might be affected as well. Makes sense.

 

Also: the WARC format is an ISO standard, the first step should be
to get alternative compression codecs into the WARC spec:

https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#annex-a-informative-compression-recommendations

Version 1.1 is in status "proposed".

What steps would need to be taken to have zstd considered, presuming it meets the spec?


Thanks for taking the time to respond. Definitely helps to understand more of what's required for a new compression spec. I'll dig into zstd more this weekend and should have some benchmarks and data comparing it to gzip some time this weekend.


Greg Lindahl

unread,
Mar 15, 2017, 4:19:12 PM3/15/17
to common...@googlegroups.com
On Wed, Mar 15, 2017 at 01:15:13PM -0700, Ben Wills wrote:

> I didn't realize you could do document-specific blocks within gzip. I'll
> look more into that and see if there's something comparable with zstd.

Zstd does support this, the key thing is that you can concatenate 2
zstd files and the result fully decompresses. The zstd manpage
mentions it:

> Concatenation with .zst files
>
> It is possible to concatenate .zst files as is. zstd will decompress
> such files as if they were a single .zst file.

-- greg

Ben Wills

unread,
Mar 15, 2017, 4:19:55 PM3/15/17
to Common Crawl

I looked at the zstd github page and it appears that it decompresses
about 2X as fast as zlib, no matter which compression level is used.
So that's good.

I'll run some tests against libz over the weekend to confirm. But that does seem to be consistent.

 
There's a big list of binding for various languages. No mention of
OSes supported, and a glance at cpan-testers output for the Perl
binding says it's only working on Linux, perhaps because the perl
porter didn't bother to hook up other OSes. Presumably there's
a fair number of WARC users on OSX and windoze.

I can test on an OSX laptop I've got. I don't have a Windows machine to test on, but I can check the C source to see if there are any Windows-specific definitions.=

 
One interesting feature of this compression method is that you can
pre-initialize a dictionary to speed compression of small files
which contain a bunch of known strings. This should be great for
WARC Request and Response records.

I'd considered the dictionary as well, especially since I'll be using it for another project. I'll include some dictionary setups in my tests this weekend. Given the repetitive nature of these formats, in addition to both JSON and HTML, I would expect there's the potential for a considerable improvement in compression ratios with a well-trained dictionary.

 
Ben, if you're interested in doing additional experiments, it would be
great if you could divide a WARC into records, show the actual
compression ratio when compressed by records, and also experiment with
a pre-initialized dictionary.

Yep. Will do this weekend. 

Ben Wills

unread,
Mar 15, 2017, 4:22:45 PM3/15/17
to Common Crawl

Zstd does support this, the key thing is that you can concatenate 2
zstd files and the result fully decompresses. The zstd manpage
mentions it: 

Cool. I'll test this as well. 

Sebastian Nagel

unread,
Mar 16, 2017, 6:51:03 AM3/16/17
to common...@googlegroups.com
Hi Ben,

> What steps would need to be taken to have zstd considered, presuming it meets the spec?

I don't know anything how a new WARC standard is approved. There are issues open on github,
why not open a new one to propose a better compression?

Thanks,
Sebastian
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fwarc-specifications%2Fblob%2Fgh-pages%2Fspecifications%2Fwarc-format%2Fwarc-1.1%2Findex.md%23annex-a-informative-compression-recommendations&sa=D&sntz=1&usg=AFQjCNF7C999z8SklZBPRjmJ5Zs3PbIjPQ>
>
>
> Version 1.1 is in status "proposed".
>
>
> What steps would need to be taken to have zstd considered, presuming it meets the spec?
>
>
> Thanks for taking the time to respond. Definitely helps to understand more of what's required for a
> new compression spec. I'll dig into zstd more this weekend and should have some benchmarks and data
> comparing it to gzip some time this weekend.
>
>

Ben Wills

unread,
Mar 20, 2017, 9:08:40 AM3/20/17
to Common Crawl
I put together a proof-of-concept showing how Zstandard could be used to compress small blocks into a larger file, and to also then retrieve those blocks given an offset and byte length.

I also tested the use of a dictionary file and benchmarked various compression levels, etc.

As you'd mentioned, the compression wasn't as good when compressing specific URI blocks vs the entire file. But it's still a fair improvement over Gzip.

All of my findings and source code are here, along with a detailed Readme: https://github.com/benwills/proposal-warc-to-zstandard

What would you say would be the best next step for me to take from here? I've got a busy two weeks weeks before I can work on this again, but I'm happy to help in any way I can.

Sebastian Nagel

unread,
Mar 20, 2017, 10:06:14 AM3/20/17
to common...@googlegroups.com
Hi Ben,

> I put together a proof-of-concept

I'm impressed, looks like a lot of work. It'll take me also some time to carefully read it.


> What would you say would be the best next step for me to take from here?

I think that's enough to open a request on iipc/warc-specifications.
Would be also good to (experimentally) extend a Python or Java WARC library, e.g.,
https://github.com/webrecorder/warcio
That makes real-world-testing simpler :)

Thanks,
Sebastian
> > that would create problems with a system like archive.org <http://archive.org>, et al.
> >
> >
> >
> > Also the decompression speed is an important point, WARCs are written
> > once but read often. Of course, a smaller size is always good
> > and the download time is the bottleneck if data is not processed
> > in a data center in the AWS us-east-1 region.
> >
> >
> > If I understand correctly, benchmarks show zstd as being consistently faster. But I'll pull
> harder
> > data from testing to confirm with the CC datasets.
> >
> >
> > The main point is (from the perspective of our users): all WARC
> > libraries support gzipped WARC files. The chance is high that
> > changing the compression codec will break the data processing
> > pipelines of several users.
> >
> >
> > I hadn't considered the pipelines that might be affected as well. Makes sense.
> >
> >
> >
> >
> > Also: the WARC format is an ISO standard, the first step should be
> > to get alternative compression codecs into the WARC spec:
> >
> >
> https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#annex-a-informative-compression-recommendations
> <https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#annex-a-informative-compression-recommendations>
>
> >
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fwarc-specifications%2Fblob%2Fgh-pages%2Fspecifications%2Fwarc-format%2Fwarc-1.1%2Findex.md%23annex-a-informative-compression-recommendations&sa=D&sntz=1&usg=AFQjCNF7C999z8SklZBPRjmJ5Zs3PbIjPQ
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fwarc-specifications%2Fblob%2Fgh-pages%2Fspecifications%2Fwarc-format%2Fwarc-1.1%2Findex.md%23annex-a-informative-compression-recommendations&sa=D&sntz=1&usg=AFQjCNF7C999z8SklZBPRjmJ5Zs3PbIjPQ>>
>
> >
> >
> > Version 1.1 is in status "proposed".
> >
> >
> > What steps would need to be taken to have zstd considered, presuming it meets the spec?
> >
> >
> > Thanks for taking the time to respond. Definitely helps to understand more of what's required
> for a
> > new compression spec. I'll dig into zstd more this weekend and should have some benchmarks and
> data
> > comparing it to gzip some time this weekend.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Greg Lindahl

unread,
Mar 20, 2017, 3:30:18 PM3/20/17
to common...@googlegroups.com
Ben, that's a great writeup!

If I put on my archivist hat, I suppose the one thing I'd want to know
is how much benefit that the dictionary provides. It's a large enough
file (8.2Mb) that you can't store it in every warc, but if it's not
stored in any warc, you have a chance of losing it.

To put it another way, today, if you've got a collection of warcs, and
somehow 1% of them are damaged, you can always use the other 99%.

Looking at the files you provided, it looks like that is an easy
question to answer myself!

oyeti timileyin

unread,
Aug 9, 2021, 8:22:01 PMAug 9
to Common Crawl
Hello everyone,
How are you all doing?
My name is Timileyin, I'm a growing web developer.

I would like to use gzip on a particular project, how can I go about it? .Is there any YouTube tutorial for that.

Also, can I gzip media files too?

Reply all
Reply to author
Forward
0 new messages