Hello!
I'm trying to create a reproducible Information Retrieval test collection from the CC-NEWS crawl. To that extend I'm trying to verify the file integrity of the CC-NEWS files I downloaded from aws. According to the aws documentation found here:
the ETag response header should correspond to the MD5sum of the file. For example:
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
x-amz-id-2: VS8d6N6khsqAgPUiTtW1vnULOO6GKfqA6I76S5n9Qb1ul3KUt/6QN+4oGIa+mEtE
x-amz-request-id: 706EF34331861D8B
Date: Thu, 11 May 2017 00:26:05 GMT
Last-Modified: Thu, 02 Feb 2017 22:05:07 GMT
ETag: "9195bfa99ef5ecc9abae0d49cd19d169-129"
Accept-Ranges: bytes
Content-Type: binary/octet-stream
Content-Length: 1073748616
Server: AmazonS3
Length: 1073748616 (1.0G) [binary/octet-stream]
Saving to: 'CC-NEWS-20170202210704-00000.warc.gz’
2017-05-11 10:29:04 (5.71 MB/s) - 'CC-NEWS-20170202210704-00000.warc.gz’ saved [1073748616/1073748616]
Here I would expect "9195bfa99ef5ecc9abae0d49cd19d169" to be the MD5sum of the downloaded file. However this is not the case:
[mpetri@~]$ md5sum CC-NEWS-20170202210704-00000.warc.gz
ff41410b9705af54aaa0e8a10ce7c2c8 CC-NEWS-20170202210704-00000.warc.gz
Is this not possible or am I doing something wrong here? I tried multiple files and the md5sum never matches.
Thanks,
Matthias Petri