Hi Ben,
thanks for sharing these results. Could you share more details,
to make the experiment reproducible, esp.
- which WARC file was used to measure the compression ratio
- the code to compress the WARC
WARC records are compressed each in a separate deflate block.
This is in accordance with the gzip spec and allows to pull out
single documents/records by file offset as required by wayback
machines. Per-record-compressed WARC files are about 10% larger.
Also the decompression speed is an important point, WARCs are written
once but read often. Of course, a smaller size is always good
and the download time is the bottleneck if data is not processed
in a data center in the AWS us-east-1 region.
> Knowing how ubiquitous gzip is, it makes sense to me why it might be preferred.
> I know zstd might not be installed on everyone's machine, but it might be worth considering in
> order to make the data more accessible.
The main point is (from the perspective of our users): all WARC
libraries support gzipped WARC files. The chance is high that
changing the compression codec will break the data processing
pipelines of several users.
Also: the WARC format is an ISO standard, the first step should be
to get alternative compression codecs into the WARC spec:
https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#annex-a-informative-compression-recommendations
Version 1.1 is in status "proposed".
> Assuming that these compression ratios would roughly apply across the entire dataset, the total
> download size of the February 2017 crawl would go from 55.88 TB to 35.5 TB using the default
> level 5 compression:
No question, that's nice. But it's not an easy step to make...
> If you're open to considering another compression method,
It would be stupid not to be open, simply because it would allow us to archive more data.
Thanks,
Sebastian
On 03/15/2017 07:08 AM, Ben Wills wrote:
> I just realized those numbers are wonky (level 6). Ran again, though level 9 must have ad something
> running in the background as it's a hair slower than level 10 in this run
>
> * zstd[ 1]70710958110s194ns
> * zstd[ 2]68357169611s486ns
> * zstd[ 3]62789098614s181ns
> * zstd[ 4]62660782618s322ns
> * zstd[ 5]59921728326s709ns
> * zstd[ 6]57373321840s786ns
> * zstd[ 7]56434146745s185ns
> * zstd[ 8]55596573555s225ns
> * zstd[ 9]55246998769s465ns
> * zstd[10]54985949569s311ns
> * zstd[11]53906983785s849ns
> * zstd[12]536512569109s319ns
>
>
>
> On Tuesday, March 14, 2017 at 11:33:30 PM UTC-6, Ben Wills wrote:
>
> Knowing how ubiquitous gzip is, it makes sense to me why it might be preferred.
>
> If you're open to considering another compression method, zstd/zstandard
> (
https://github.com/facebook/zstd <
https://github.com/facebook/zstd>), a Facebook project by
> Yann Collet (xxHash and LZ4):
>
> * has matured past version 1.0.
> * is in production use by Facebook
> * is very fast in terms of both compression and decompression
> * compresses at a higher ratio than gzip.
>
> I took a random Common Crawl warc file...
>
> * Original file size: 4,463,763,277 bytes
> * gzipped size: 941,893,587 bytes
>
> ...and ran it through various zstd compression levels. Below are 14 of the 22 compression
> levels, followed by the resulting byte size, and the time it took to compress. Note that these
> compression times were recorded using the C library where the file was already in memory. So
> this does not account for reading the file off of disk.
>
> * zstd[ 1]707,109,58110s 287ns
> * zstd[ 2]683,571,69612s 558ns
> * zstd[ 3]627,890,98614s 636ns
> * zstd[ 4]626,607,82620s 319ns
> * zstd[ 5]599,217,28334s 113ns *
> * zstd[ 6]573,733,21861s 811ns
> * zstd[ 7]564,341,46751s 741ns
> * zstd[ 8]555,965,73568s 792ns
> * zstd[ 9]552,469,98781s 583ns
> * zstd[10]549,859,49579s 14ns
> * zstd[11]539,069,83787s 443ns
> * zstd[12]536,512,569117s 464ns
>
> * Compression level 5 is the default zstd compression level.
>
> Assuming that these compression ratios would roughly apply across the entire dataset, the total
> download size of the February 2017 crawl would go from 55.88 TB to 35.5 TB using the default
> level 5 compression:
>
> * 35.4990466576 = (599,217,283 / 941,893,587) * 55.8
>
> This would make the data far more accessible to a number of folks. Not only would several
> hundred dollars of disk usage be saved (considering $200/8TB on the cheapest end, that's $500 of
> disks), but download times would also be significantly reduced. As it is, 55.8 TB takes almost
> an entire month for me on a 240Mbit/s connection.
>
> I know zstd might not be installed on everyone's machine, but it might be worth considering in
> order to make the data more accessible.
>
> Ben
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.