Announcing: New CommonCrawl Index and Query Api

621 views
Skip to first unread message

Ilya Kreymer

unread,
Mar 26, 2015, 12:02:40 AM3/26/15
to common...@googlegroups.com
Hello CommonCrawl,

I am happy to announce a new url index and query api system for CommonCrawl WARC dataset.

To start off, I've focused on building the index for a single crawl and wanted to show it to the community for testing/feedback.

This first index (for the Jan 2015 crawl) is built and available for access/querying at:


A full reference for the api is available here:

Here are a few quick example queries:

Exact:

Prefix:

Domain:


This index format allows for quick page count query and provides pagination support:

Num Pages Query:

Last Page: (pages are 0-indexed)

A more complete explanation of the query api is available:

(please let me know if anything doesn't work as expected)


A bit more background:

The index format is the same as has been in use for years by the Wayback Machine, a compressed flat-file index with secondary index for binary search. For whatever reason, this index has been know as a 'ZipNum CDX' and more info on the format is available here

I have previously worked as the lead engineer on the IA Wayback Machine, and have now returned to the field of web archiving to work on my own.

One of the projects I'm currently working on is brand new implementation of wayback machine replay and indexing tools (https://github.com/ikreymer/pywb). This project includes all the tools for indexing WARCs, as well as the http service for querying the index (and many other tools related to web archives).

The specific deployment of pywb running on "index.commoncrawl.org" is available at:

Additionally, all the tools used for building the index are available at:

The webarchive-indexing repo provides a series of MRJob scripts can be run on Hadoop/EMR. I've tried to make these tools as generic as possible to allow for any bulk indexing of WARC data. (I may need to add some more CommonCrawl specific examples).

I believe it is important for all the indexing tools to be fully open to complement CommonCrawl's mission of providing open data to the public.

I hope to enable others to also build the index as needed.

Hopefully this index will be a first step in making CommonCrawl even easier to work with.

I'd like to hear any bug reports, feedback, suggestions, comments, etc..., especially on the format and API.

For now, only the basic fields from the WARC files have been included, but of course it would be possible to include additional data as needed from the WARC files, and WAT/WET files could be indexed as well if necessary.

Once there is some feedback, I hope to build indexes for other crawls, and hopefully a cumulative index, if there is interest in that.

Finally, a big thanks to Stephen for giving me account access and helping out with various sysadmin tasks to help make this happen!


Happy querying,

Ilya


 

Kevin Fink

unread,
Mar 26, 2015, 1:20:01 AM3/26/15
to common...@googlegroups.com
Awesome! Thank you very much for this!

Kevin Fink
biztech.ninja


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Pavel Smrz

unread,
Mar 26, 2015, 9:01:35 AM3/26/15
to common...@googlegroups.com

Laura Dietz

unread,
Mar 26, 2015, 9:50:47 AM3/26/15
to common...@googlegroups.com
Pavel,

I am not an expert, but I see different time stamps, different offsets, and different lengths.
Are these different versions of the same page?

Cheers,
Laura

Mat Kelcey

unread,
Mar 26, 2015, 10:19:00 AM3/26/15
to common...@googlegroups.com

Great work Ilya! Really raises access to the data!

Pavel Smrz

unread,
Mar 26, 2015, 10:22:28 AM3/26/15
to common...@googlegroups.com
... it seems that different crawling nodes accessed the same URL in (slightly) different times. It would make sense for main pages of news agencies and very active blogs but not for the pages mentioned.
Relevant WARC records are in different files so that record offsets naturally differ.

Pavel

--
Pavel Smrz

Stephen Merity

unread,
Mar 27, 2015, 5:31:39 AM3/27/15
to common...@googlegroups.com
Thanks again Ilya! I've mentioned before that this has been something the community has desperately wanted and to see a member of the community pull it together so well and so quickly has been an absolute delight!

I highly recommend people interested in using it test it out and see how it works for their use cases as Ilya is very interested in feedback.

Pavel, as far as the repeats on certain URLs, I'm currently investigating this. Whilst they're expected to happen occasionally (we keep content on redirects even if we might have crawled it before), a small number of these pages occur a larger number of times.
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Alex

unread,
Mar 31, 2015, 12:13:48 PM3/31/15
to common...@googlegroups.com
Ilya,

thank you for a great contribution!

Could you also please share details on operational side - what cluster configuration did you use on EMR to create an ~120Gb index out of 145Tb corpus of 1.9 bln pages? How long does it take? 

Thanks in advance.

Alex.

Ilya Kreymer

unread,
Mar 31, 2015, 12:51:16 PM3/31/15
to common...@googlegroups.com
Hi Alex,

Sure, the whole process takes about 8-9 hours currently. There are two or three jobs that need to be run: to index the individual files, to sample the url space (optional) and then to build a final sorted cluster. A more detailed explanation of the jobs and all the tools used is available here: https://github.com/ikreymer/webarchive-indexing

Most of the time is spent reading the individual WARCs and outputting a per-warc index (cdx file). These indexes are actually also available at: 
s3://aws-publicdatasets/common-crawl/cc-index/cdx/

That took about ~7 hours this time, and then the url space sampling and final sort is ~30 min each. Thus far, I was running each job manually but s so there were a few mins of downtime, but I hope to fully automate the pipeline in the future. The sampling job is also not needed every time.

The EMR cluster config was:

1 m1.xlarge master
2 m1 large core
50 m1.xlarge task

All the instances (besides master) were spot instances set to a price one cent about the current price. Total instance hours used was 3744.

I hope to optimize this a bit more for future indexes.
Let me know if this answers your question, or if you have any suggestions also (as I'm still experimenting with the configuration).

Ilya



--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/t_H0yeL26eY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

Aline Bessa

unread,
Apr 20, 2015, 12:11:22 PM4/20/15
to common...@googlegroups.com
Hi all,

Sorry if this is a simple question, but I don't know hot to fetch a page inside the index (its HTML) by using this interface. Is it possible?

Ilya Kreymer

unread,
Apr 20, 2015, 5:20:31 PM4/20/15
to common...@googlegroups.com
Hi Aline,

There's not yet an official interface or UI for doing so like in the old index.

However, there's an 'unofficial' way to do this as the software supports access to the original resource using the replay url form.


{"urlkey": "org,commoncrawl)/", "timestamp": "20150302032705", "url": "http://commoncrawl.org/", "length": "2526", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936462700.28/warc/CC-MAIN-20150226074102-00159-ip-10-28-5-156.ec2.internal.warc.gz", "digest": "QE4UUUWUJWEZBBK6PUG3CHFAGEKDMDBZ", "offset": "53235662"}

You can access the original resource via this url, using curl or wget:
curl http://index.commoncrawl.org/CC-MAIN-2015-11/20150302032705id_/http://commoncrawl.org/
wget http://index.commoncrawl.org/CC-MAIN-2015-11/20150302032705id_/http://commoncrawl.org/

Note the format here is: /CC-MAIN-2015-11/ + the timestamp + id_ + / url
Please note that this capability is part of the pywb replay software, and may change in the future for CommonCrawl. It's not guaranteed to work in all cases..
This replay serves the original response http headers as well, which may not be consistent with content and may not always work in the browser.

The plan is to have a UI similar to the old index UI. However, I thought I'd mention this option in case it helps with using the index in the meantime.

Ilya


Aline Bessa

unread,
Apr 20, 2015, 5:25:47 PM4/20/15
to common...@googlegroups.com
Great, Ilya! Thanks!


Em quinta-feira, 26 de março de 2015 00:02:40 UTC-4, Ilya Kreymer escreveu:
Message has been deleted

Aline Bessa

unread,
Apr 20, 2015, 8:44:12 PM4/20/15
to common...@googlegroups.com
Ilya, another question: is it possible to have access to all URLs in the Jan or Feb 2015 crawls? I want to sample them. 

Thanks for all the help!

Dominik Stadler

unread,
Apr 22, 2015, 3:50:44 PM4/22/15
to common...@googlegroups.com
Hi,

I am also interested in accessing the index in a different way for my
small research project to get a large number of files of certain file
types to mass-test some frameworks that handle files, e.g. Apache Tika
and Apache POI, see
https://github.com/centic9/CommonCrawlDocumentDownload, currently I am
using the previous URL Index which stored the data in a different
format.

Thanks... Dominik.
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an

John Wiseman

unread,
Apr 28, 2015, 10:54:14 PM4/28/15
to common...@googlegroups.com
In the old trivio common_crawl_index, the index included "arcFileOffset" and "compressedSize" properties for each item in the index, which made it possible to do an efficient S3 request using a Range header to get just the portion of the ARC file corresponding to the URL of interest and decompress it.  Are you planning on offering something similar in the new interface?

Thanks,
John



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

Ilya Kreymer

unread,
Apr 28, 2015, 11:15:40 PM4/28/15
to common...@googlegroups.com
Those fields are 'offset' and 'length' in the current index, and correspond to the WARC offset and compressed length that you would use as part of the range request.

{"urlkey": "org,commoncrawl)/", "timestamp": "20150302032705", "url": "http://commoncrawl.org/", "length": "2526", "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936462700.28/warc/CC-MAIN-20150226074102-00159-ip-10-28-5-156.ec2.internal.warc.gz", "digest": "QE4UUUWUJWEZBBK6PUG3CHFAGEKDMDBZ", "offset": "53235662"}

53238187=53235662+2526-1
You could then do:
curl -r 53235662-53238187 https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936462700.28/warc/CC-MAIN-20150226074102-00159-ip-10-28-5-156.ec2.internal.warc.gz | zcat | less
to get the full WARC record.
There's not yet a UI for the query api, just the raw JSON result output.

Ilya




John Wiseman

unread,
Apr 29, 2015, 12:56:39 AM4/29/15
to common...@googlegroups.com
That's great news!

Thanks,
John

John Wiseman

unread,
Apr 29, 2015, 11:52:46 AM4/29/15
to common...@googlegroups.com
I got a result I didn't completely expect with one of the indices.  http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=metafilter.com&output=json returns several results that look like this, which I expected:

{
    "digest": "HLHP6HW2LKLFJVBRAUNORMFIGCBFFGVA",
    "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936469077.97/warc/CC-MAIN-20150226074109-00185-ip-10-28-5-156.ec2.internal.warc.gz",
    "length": "29152",
    "offset": "588020911",
    "timestamp": "20150306165228",
    "urlkey": "com,metafilter)/"
}

But it also returns one result like this, which was a small surprise:

{
    "digest": "DFV6OAJFLQF5OK7DFHRFBKUX2IMY7DMZ",
    "filename": "common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936469305.48/warc/CC-MAIN-20150226074109-00043-ip-10-28-5-156.ec2.internal.warc.gz",
    "length": "29143",
    "offset": "600214246",
    "timestamp": "20150306183442",
    "urlkey": "com,metafilter)/"
}

I just want to confirm that this is the intended behavior.  You've got to do something with dot-segments in URLs, and this seems as good as any--right?.  I just have to keep in mind the difference between "urlkey" and "url" in the result.

Thanks,
John

Ilya Kreymer

unread,
Apr 29, 2015, 1:08:03 PM4/29/15
to common...@googlegroups.com
Yes, I should point out that the urlkey is not only reversed but also 'canonicalized' or 'normalized' url, using the surt library: https://github.com/ikreymer/surt  (which itself is 
a port of several iterations of IA canonicalization from: https://github.com/internetarchive/webarchive-commons)

The .. replacement is one of the things that the surt library does. I think this makes sense, looks like entering "http://www.metafilter.com/tags/.." into a browser causes it to get "http://www.metafilter.com/"

To specifically look for exact original url, it's also possible to do a regex filter with $ like this:

Ilya
Reply all
Reply to author
Forward
0 new messages