CCNEWS data - File integrity checks

Matthias Petri

unread,

May 10, 2017, 8:41:06 PM5/10/17

to Common Crawl

Hello!

I'm trying to create a reproducible Information Retrieval test collection from the CC-NEWS crawl. To that extend I'm trying to verify the file integrity of the CC-NEWS files I downloaded from aws. According to the aws documentation found here:

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

the ETag response header should correspond to the MD5sum of the file. For example:

[mpetri@~]$ wget -vS https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz

--2017-05-11 10:26:04-- https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz

Resolving commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)... 54.231.49.16

Connecting to commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)|54.231.49.16|:443... connected.

HTTP request sent, awaiting response...

HTTP/1.1 200 OK

x-amz-id-2: VS8d6N6khsqAgPUiTtW1vnULOO6GKfqA6I76S5n9Qb1ul3KUt/6QN+4oGIa+mEtE

x-amz-request-id: 706EF34331861D8B

Date: Thu, 11 May 2017 00:26:05 GMT

Last-Modified: Thu, 02 Feb 2017 22:05:07 GMT

ETag: "9195bfa99ef5ecc9abae0d49cd19d169-129"

Accept-Ranges: bytes

Content-Type: binary/octet-stream

Content-Length: 1073748616

Server: AmazonS3

Length: 1073748616 (1.0G) [binary/octet-stream]

Saving to: 'CC-NEWS-20170202210704-00000.warc.gz’

2017-05-11 10:29:04 (5.71 MB/s) - 'CC-NEWS-20170202210704-00000.warc.gz’ saved [1073748616/1073748616]

Here I would expect "9195bfa99ef5ecc9abae0d49cd19d169" to be the MD5sum of the downloaded file. However this is not the case:

[mpetri@~]$ md5sum CC-NEWS-20170202210704-00000.warc.gz

ff41410b9705af54aaa0e8a10ce7c2c8 CC-NEWS-20170202210704-00000.warc.gz

Is this not possible or am I doing something wrong here? I tried multiple files and the md5sum never matches.

Thanks,

Matthias Petri

Sebastian Nagel

unread,

May 11, 2017, 3:45:08 AM5/11/17

to common...@googlegroups.com

Hi Matthias,

it's not that trivial due to multi-part uploads, cf. [1].
But the good news: it's still possible to verify the downloaded WARC by the ETag.
I've tried the pure bash solution [2] using [3]:

wget https://raw.githubusercontent.com/antespi/s3md5/master/s3md5
chmod +x s3md5

file=CC-NEWS-20170202210704-00000.warc.gz
etag="9195bfa99ef5ecc9abae0d49cd19d169-129"

parts=$(echo $etag | cut -d- -f2)
partsize=$((1+$(stat -c '%s' $file)/($parts*2**20)))

./s3md5 --etag $etag $partsize $file

s3md5 returns "TRUE"!

Best,
Sebastian

[1] http://stackoverflow.com/questions/6591047/etag-definition-changed-in-amazon-s3
[2] http://stackoverflow.com/a/19304527/5953351
[3] https://raw.githubusercontent.com/antespi/s3md5/master/s3md5

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Matthias Petri

unread,

May 14, 2017, 8:36:18 PM5/14/17

to Common Crawl

Thanks Sebastian!

Your script works well. Do you know by any chance if the AWS cli client automatically does this verification step?

Again couldn't find any document on this.

I found this sentence in the documentation "The AWS CLI will attempt to verify the checksum of downloads when possible, based on the ETag header returned from a GetObject request that's performed whenever the AWS CLI downloads objects from S3. If the calculated MD5 checksum does not match the expected checksum, the file is deleted and the download is retried. "

but looking at the code I can't seem to find anything related to md5 checking that looks like what s3md5 does.

-Matthias

Sebastian Nagel

unread,

May 17, 2017, 4:12:12 AM5/17/17

to common...@googlegroups.com

Hi Matthias,

great! And thanks to the author of s3md5!

> but looking at the code I can't seem to find anything related to md5 checking that looks like what
> s3md5 does.

I would just trust the documentation. In doubt, you have to check all imports (esp., boto) or debug
the "aws s3 cp" command.

Best,
Sebastian

> [2] http://stackoverflow.com/a/19304527/5953351 <http://stackoverflow.com/a/19304527/5953351>

> [3] https://raw.githubusercontent.com/antespi/s3md5/master/s3md5
> <https://raw.githubusercontent.com/antespi/s3md5/master/s3md5>
>
> On 05/11/2017 02:41 AM, Matthias Petri wrote:
> > Hello!
> >
> > I'm trying to create a reproducible Information Retrieval test collection from the CC-NEWS
> crawl. To
> > that extend I'm trying to verify the file integrity of the CC-NEWS files I downloaded from aws.
> > According to the aws documentation found here:
> >
> > http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
> <http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html>
> >
> > the ETag response header should correspond to the MD5sum of the file. For example:
> >
> > [mpetri@~]$ wget -vS
> >
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz>
>
> > --2017-05-11 10:26:04--
> >
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz

> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz>
>
> > Resolving commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>
> (commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>)... 54.231.49.16
> > Connecting to commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>
> (commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>)|54.231.49.16|:443...

Reply all

Reply to author

Forward