Common crawl index usage?

95 views
Skip to first unread message

Wenqin Ye

unread,
Jun 30, 2015, 12:11:03 PM6/30/15
to common...@googlegroups.com
Using the common crawl index you get a response like the following: 

{"url": "http://zh-yue.wikipedia.org/wiki/Fa", "digest": "Z7B5R4MSJTJMOOLF2L57QRDEQCA5OJP3", "length": "9150", "offset": "935270477", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936461266.22/warc/CC-MAIN-20150226074101-00253-ip-10-28-5-156.ec2.internal.warc.gz"}


I then download the url ( https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936461266.22/warc/CC-MAIN-20150226074101-00253-ip-10-28-5-156.ec2.internal.warc.gz) from the filename given (which is a 1gb download), and yet when it is extracted, and put into a text editor, this is all I get:
WARC/1.0
WARC
-Type: warcinfo
WARC
-Date: 2015-05-16T05:38:48Z
WARC
-Record-ID: <urn:uuid:2495048d-d982-4942-87c2-ab65fca8ba96>
Content-Length: 341
Content-Type: application/warc-fields
WARC
-Filename: CC-MAIN-20150417045736-00251-ip-10-235-10-82.ec2.internal.warc.gz

robots
: classic
hostname
: ip-10-235-10-82.ec2.internal
software
: Nutch 1.6 (CC)/CC WarcExport 1.0
isPartOf
: CC-MAIN-2015-18
operator: CommonCrawl Admin
description
: Wide crawl of the web for April 2015
publisher
: CommonCrawl
format
: WARC File Format 1.0
conformsTo
: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf




So how do you use the common-crawl index? And once you download the gzip file how do you extract the html?

Wenqin Ye

unread,
Jun 30, 2015, 1:48:58 PM6/30/15
to common...@googlegroups.com
Never mind, figured it out. You have to add a "range" to the s3 request:
resp = s3.get_object(bucket_name: "aws-publicdatasets", key: "#{path}", range: "bytes=#{offset}-#{offset + compressed_size - 1}")

I wish the "get started" pages were more clear on this.
Reply all
Reply to author
Forward
0 new messages