Common Crawl

Welcome to the Common Crawl Group!


Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.


This group is intended to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. 

This group is a place to:
*Discuss challenges
*Share ideas for projects and products
*Look for collaborators and partners
*Offer advice and share methods
*Ask questions and get advice from others
*Show off cool stuff you build
*Keep up to date on the latest news from Common Crawl



Showing 1-20 of 343 topics
S3 Access Denied, April WAT files. Gregory Ray 5/29/15
S3 Error or Something Else? (Python/MRJob/Amazon EMR) Pingometer LLC 5/26/15
[VOTE] release 0.6 DigitalPebble 5/22/15
Blekko & CommonCrawl Tom Morris 5/22/15
Missing URLs from CommonCrawl Index Hassan Amir 5/22/15
Convert commoncrawl keyword search script to Hadoop EMR script sohail ahmed 5/20/15
Re: Need a sampling of 10,000 urls Tom Morris 5/13/15
subset of CC data Peter Cawdron 5/12/15
What hadoop version is used in cc-warc-examples for Java language Anatoly Vostryakov 5/5/15
Getting all URLS for Jan 2015 or Feb 2015 crawl Aline Bessa 5/1/15
Announcing: New CommonCrawl Index and Query Api Ilya Kreymer 4/29/15
Small Portion of CC data Aline Bessa 4/28/15
Inverted indices for Common Crawl data Aline Bessa 4/28/15
How to get actual HTML pages with http://index.commoncrawl.org Aline Bessa 4/27/15
Number of URLs in March 2015 index & MIME type breakdown Tom Morris 4/25/15
Python program to analyze Common Crawl index Tom Morris 4/24/15
Finding URLs by content type / file extension Tom Morris 4/22/15
Now that Blekko is no more... Jeremy Wilson 4/22/15
Errors searching for single URL Aline Bessa 4/20/15
Why do the crawls significantly vary in size? shlomi...@gmail.com 4/13/15
More topics »