Common Crawl

Welcome to the Common Crawl Group!

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

This group is intended to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. 

This group is a place to:
*Discuss challenges
*Share ideas for projects and products
*Look for collaborators and partners
*Offer advice and share methods
*Ask questions and get advice from others
*Show off cool stuff you build
*Keep up to date on the latest news from Common Crawl

Showing 1-20 of 388 topics
ANN: Web Table Corpus containing 233 million tables released Robert Meusel 11/19/15
Winter 2015 Crawl Robert Meusel 11/19/15
Persian N-gram Data? Masoud Komeily 11/5/15
mothership testservice-pa googleapis Lisa Agostoni 11/3/15
Question to favicon crawler ... ich 11/1/15
pages with specific language mucahid kutlu 10/30/15
Can CC fork of Nutch work with S3 only ? Christian Pérez-Llamas 10/28/15
Discovering 301s / 302s ? Soren Flexner 10/18/15
Common Crawl and Apache Spark giorgio79 10/8/15
Request for clarifications on commoncrawl Srinath Achanta 10/8/15
Obtaining Russian Data Lewis John Mcgibbney 10/6/15
Getting Redirect Information From WAT Robert Meusel 10/3/15
Common Crawl now has a URL index! Lisa Green 10/2/15 Domain Coverage Alexander Mitchell 10/1/15
Is there a search engine currently using CommonCrawl data? Mike Onghai 9/23/15
Domain urls needed Akash Varma 9/23/15
Job Opening at Common Crawl - Crawl Engineer / Data Scientist Sara Crouse 9/23/15
Greetings from Sara, Common Crawl’s New Director Sara Crouse 9/17/15
small amount of domains crawled Chen Yaniv 9/6/15
Bug tracker? Anomalies in most crawled domains Tom Morris 8/26/15
More topics »