CCNEWS data - File integrity checks

69 views
Skip to first unread message

Matthias Petri

unread,
May 10, 2017, 8:41:06 PM5/10/17
to Common Crawl
Hello!

I'm trying to create a reproducible Information Retrieval test collection from the CC-NEWS crawl. To that extend I'm trying to verify the file integrity of the CC-NEWS files I downloaded from aws. According to the aws documentation found here: 


the ETag response header should correspond to the MD5sum of the file. For example:

Connecting to commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)|54.231.49.16|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  x-amz-id-2: VS8d6N6khsqAgPUiTtW1vnULOO6GKfqA6I76S5n9Qb1ul3KUt/6QN+4oGIa+mEtE
  x-amz-request-id: 706EF34331861D8B
  Date: Thu, 11 May 2017 00:26:05 GMT
  Last-Modified: Thu, 02 Feb 2017 22:05:07 GMT
  ETag: "9195bfa99ef5ecc9abae0d49cd19d169-129"
  Accept-Ranges: bytes
  Content-Type: binary/octet-stream
  Content-Length: 1073748616
  Server: AmazonS3
Length: 1073748616 (1.0G) [binary/octet-stream]
Saving to: 'CC-NEWS-20170202210704-00000.warc.gz’
2017-05-11 10:29:04 (5.71 MB/s) - 'CC-NEWS-20170202210704-00000.warc.gz’ saved [1073748616/1073748616]

Here I would expect "9195bfa99ef5ecc9abae0d49cd19d169" to be the MD5sum of the downloaded file. However this is not the case:

[mpetri@~]$ md5sum CC-NEWS-20170202210704-00000.warc.gz
ff41410b9705af54aaa0e8a10ce7c2c8  CC-NEWS-20170202210704-00000.warc.gz

Is this not possible or am I doing something wrong here? I tried multiple files and the md5sum never matches.

Thanks,
Matthias Petri

Sebastian Nagel

unread,
May 11, 2017, 3:45:08 AM5/11/17
to common...@googlegroups.com
Hi Matthias,

it's not that trivial due to multi-part uploads, cf. [1].
But the good news: it's still possible to verify the downloaded WARC by the ETag.
I've tried the pure bash solution [2] using [3]:

wget https://raw.githubusercontent.com/antespi/s3md5/master/s3md5
chmod +x s3md5

file=CC-NEWS-20170202210704-00000.warc.gz
etag="9195bfa99ef5ecc9abae0d49cd19d169-129"

parts=$(echo $etag | cut -d- -f2)
partsize=$((1+$(stat -c '%s' $file)/($parts*2**20)))

./s3md5 --etag $etag $partsize $file

s3md5 returns "TRUE"!

Best,
Sebastian

[1] http://stackoverflow.com/questions/6591047/etag-definition-changed-in-amazon-s3
[2] http://stackoverflow.com/a/19304527/5953351
[3] https://raw.githubusercontent.com/antespi/s3md5/master/s3md5
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Matthias Petri

unread,
May 14, 2017, 8:36:18 PM5/14/17
to Common Crawl
Thanks Sebastian!


Your script works well. Do you know by any chance if the AWS cli client automatically does this verification step?


Again couldn't find any document on this.


I found this sentence in the documentation "The AWS CLI will attempt to verify the checksum of downloads when possible, based on the ETag header returned from a GetObject request that's performed whenever the AWS CLI downloads objects from S3. If the calculated MD5 checksum does not match the expected checksum, the file is deleted and the download is retried. "


but looking at the code I can't seem to find anything related to md5 checking that looks like what s3md5 does.


-Matthias

Sebastian Nagel

unread,
May 17, 2017, 4:12:12 AM5/17/17
to common...@googlegroups.com
Hi Matthias,

great! And thanks to the author of s3md5!

> but looking at the code I can't seem to find anything related to md5 checking that looks like what
> s3md5 does.

I would just trust the documentation. In doubt, you have to check all imports (esp., boto) or debug
the "aws s3 cp" command.

Best,
Sebastian
> [2] http://stackoverflow.com/a/19304527/5953351 <http://stackoverflow.com/a/19304527/5953351>
> [3] https://raw.githubusercontent.com/antespi/s3md5/master/s3md5
> <https://raw.githubusercontent.com/antespi/s3md5/master/s3md5>
>
> On 05/11/2017 02:41 AM, Matthias Petri wrote:
> > Hello!
> >
> > I'm trying to create a reproducible Information Retrieval test collection from the CC-NEWS
> crawl. To
> > that extend I'm trying to verify the file integrity of the CC-NEWS files I downloaded from aws.
> > According to the aws documentation found here:
> >
> > http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
> <http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html>
> >
> > the ETag response header should correspond to the MD5sum of the file. For example:
> >
> > [mpetri@~]$ wget -vS
> >
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz>
>
> > --2017-05-11 10:26:04--
> >
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202210704-00000.warc.gz>
>
> > Resolving commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>
> (commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>)... 54.231.49.16
> > Connecting to commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>
> (commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com>)|54.231.49.16|:443...
Reply all
Reply to author
Forward
0 new messages