Common Crawl

Welcome to the Common Crawl Group!


Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.


This group is intended to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. 

This group is a place to:
*Discuss challenges
*Share ideas for projects and products
*Look for collaborators and partners
*Offer advice and share methods
*Ask questions and get advice from others
*Show off cool stuff you build
*Keep up to date on the latest news from Common Crawl



Showing 1-20 of 356 topics
Question to favicon crawler ... ich 7/3/15
Can't find raw data for wwwranking.webdatacommons.org Kevin Burton 7/2/15
Which Languages contains by CC sohail ahmed 7/2/15
Common crawl index usage? Wenqin Ye 6/30/15
Common Crawl index access and TLD Sree Aurovindh Viswanathan 6/30/15
Pointers on extracting Information from Unstructured HTML pages on Common Crawl Sree Aurovindh Viswanathan 6/26/15
How often is the Common Crawl rejected from a website? Logan Scovil 6/25/15
The size of warc, wet, wat files Xinding Sun 6/24/15
S3 Error or Something Else? (Python/MRJob/Amazon EMR) Pingometer LLC 6/20/15
About how much data does Common Crawl store? Christopher Lupo 6/20/15
Project ideas using common-crawl dataset Pramod Bharadwaj C 6/19/15
Data access for Facebook kettl...@gmail.com 6/15/15
Request for clarifications on commoncrawl Srinath Achanta 6/14/15
Errors when reading from s3 with python / mrjob / gzipstream Martin Zahra 6/13/15
S3 Access Denied, April WAT files. Gregory Ray 5/29/15
[VOTE] release 0.6 DigitalPebble 5/22/15
Blekko & CommonCrawl Tom Morris 5/22/15
Missing URLs from CommonCrawl Index Hassan Amir 5/22/15
Convert commoncrawl keyword search script to Hadoop EMR script sohail ahmed 5/20/15
Re: Need a sampling of 10,000 urls Tom Morris 5/13/15
More topics »