{"url": "http://zh-yue.wikipedia.org/wiki/Fa", "digest": "Z7B5R4MSJTJMOOLF2L57QRDEQCA5OJP3", "length": "9150", "offset": "935270477", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936461266.22/warc/CC-MAIN-20150226074101-00253-ip-10-28-5-156.ec2.internal.warc.gz"}
I then download the url ( https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936461266.22/warc/CC-MAIN-20150226074101-00253-ip-10-28-5-156.ec2.internal.warc.gz) from the filename given (which is a 1gb download), and yet when it is extracted, and put into a text editor, this is all I get:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-05-16T05:38:48Z
WARC-Record-ID: <urn:uuid:2495048d-d982-4942-87c2-ab65fca8ba96>
Content-Length: 341
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20150417045736-00251-ip-10-235-10-82.ec2.internal.warc.gz
robots: classic
hostname: ip-10-235-10-82.ec2.internal
software: Nutch 1.6 (CC)/CC WarcExport 1.0
isPartOf: CC-MAIN-2015-18
operator: CommonCrawl Admin
description: Wide crawl of the web for April 2015
publisher: CommonCrawl
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
So how do you use the common-crawl index? And once you download the gzip file how do you extract the html?
resp = s3.get_object(bucket_name: "aws-publicdatasets", key: "#{path}", range: "bytes=#{offset}-#{offset + compressed_size - 1}")