Common crawl index usage?

95 views

Skip to first unread message

Wenqin Ye

unread,

Jun 30, 2015, 12:11:03 PM6/30/15

to common...@googlegroups.com

Using the common crawl index you get a response like the following:

{"url": "http://zh-yue.wikipedia.org/wiki/Fa", "digest": "Z7B5R4MSJTJMOOLF2L57QRDEQCA5OJP3", "length": "9150", "offset": "935270477", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936461266.22/warc/CC-MAIN-20150226074101-00253-ip-10-28-5-156.ec2.internal.warc.gz"}

I then download the url ( https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936461266.22/warc/CC-MAIN-20150226074101-00253-ip-10-28-5-156.ec2.internal.warc.gz) from the filename given (which is a 1gb download), and yet when it is extracted, and put into a text editor, this is all I get:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-05-16T05:38:48Z
WARC-Record-ID: <urn:uuid:2495048d-d982-4942-87c2-ab65fca8ba96>
Content-Length: 341
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20150417045736-00251-ip-10-235-10-82.ec2.internal.warc.gz

robots: classic
hostname: ip-10-235-10-82.ec2.internal
software: Nutch 1.6 (CC)/CC WarcExport 1.0
isPartOf: CC-MAIN-2015-18
operator: CommonCrawl Admin
description: Wide crawl of the web for April 2015
publisher: CommonCrawl
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf




So how do you use the common-crawl index? And once you download the gzip file how do you extract the html?

Wenqin Ye

unread,

Jun 30, 2015, 1:48:58 PM6/30/15

to common...@googlegroups.com

Never mind, figured it out. You have to add a "range" to the s3 request:

resp = s3.get_object(bucket_name: "aws-publicdatasets", key: "#{path}", range: "bytes=#{offset}-#{offset + compressed_size - 1}")