How to get actual HTML pages with http://index.commoncrawl.org

108 views
Skip to first unread message

Aline Bessa

unread,
Apr 20, 2015, 12:04:11 PM4/20/15
to common...@googlegroups.com
Hi folks,

I can find urls in index.commoncrawl.org, but can't seem to get the corresponding HTML page. How can I do it?

Thanks!

Tom Morris

unread,
Apr 20, 2015, 8:35:41 PM4/20/15
to common...@googlegroups.com
On Mon, Apr 20, 2015 at 12:04 PM, Aline Bessa <ali...@gmail.com> wrote:

I can find urls in index.commoncrawl.org, but can't seem to get the corresponding HTML page. How can I do it?

The JSON structure returned by the index server providers a pointer to the file and offset within the file where the contents of the URL are stored.  e.g.

{"urlkey": "com,freebase)/", "timestamp": "20150227040031", "url": "http://www.freebase.com/", "length": "10550", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00275-ip-10-28-5-156.ec2.internal.warc.gz", "digest": "SS6DS3UCISSQNIAO6W7JMFQNGVTH5HQU", "offset": "478515266"} 

You'll need to do some programming to access Amazon's S3 and extract the page(s) that you're interested in.

Tom

Tom Morris

unread,
Apr 20, 2015, 9:02:55 PM4/20/15
to common...@googlegroups.com
Oops, ignore my last message.  Ilya's answer buried in the announcement thread is:

On Mon, Apr 20, 2015 at 5:20 PM, Ilya Kreymer <ikre...@gmail.com> wrote:

There's not yet an official interface or UI for doing so like in the old index.

However, there's an 'unofficial' way to do this as the software supports access to the original resource using the replay url form.


{"urlkey": "org,commoncrawl)/", "timestamp": "20150302032705", "url": "http://commoncrawl.org/", "length": "2526", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936462700.28/warc/CC-MAIN-20150226074102-00159-ip-10-28-5-156.ec2.internal.warc.gz", "digest": "QE4UUUWUJWEZBBK6PUG3CHFAGEKDMDBZ", "offset": "53235662"}
You can access the original resource via this url, using curl or wget:
curl http://index.commoncrawl.org/CC-MAIN-2015-11/20150302032705id_/http://commoncrawl.org/
wget http://index.commoncrawl.org/CC-MAIN-2015-11/20150302032705id_/http://commoncrawl.org/
Note the format here is: /CC-MAIN-2015-11/ + the timestamp + id_ + / url
Please note that this capability is part of the pywb replay software, and may change in the future for CommonCrawl. It's not guaranteed to work in all cases..
This replay serves the original response http headers as well, which may not be consistent with content and may not always work in the browser.

Aline Bessa

unread,
Apr 27, 2015, 3:03:59 PM4/27/15
to common...@googlegroups.com
Hi Ilya,

I noticed that I can get some pages without the timestamp and id_ part. In the case a page has more than one version (more than one timestamp associated to it), which one is fetched? The latest, the earliest, or is it undefined?
Reply all
Reply to author
Forward
0 new messages