Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 556 topics
Contributing to Common Crawl Indexing SD 12:57 PM
All data contained in latest crawl? brano199 7/21/17
Upgrade to Common Crawl Index Server Sebastian Nagel 7/20/17
Question on the host-level web graph Akash 7/19/17
Do we have a Slack group? 7/18/17
Using Common Crawl Aaron Johnson 7/15/17
CommonCrawl Index Server responds with "502 Bad Gateway" 7/14/17
Dataset for May and June 2017 Zvonimir Sabljic 7/10/17
NEED HELP 7/10/17
Aigerim Serikbekova 7/10/17
Common Crawl now has a URL index! Lisa Green 7/10/17
common crawl dataset 7/7/17
Any interest in running Apache Tika as part of CommonCrawl? 7/5/17
WET file for news-archive Spider99 7/5/17
[Feature] Getting latest index 7/4/17
June 2017 crawl archive now available Sebastian Nagel 7/4/17
Having Trouble working on Common Crawl Data Gopika Bhardwaj 6/28/17 - Common Crawl data in action Lukasz from Webfinery 6/27/17
How many Hops? Asha Patel 6/26/17
Advise Needed on Crawling Question Asha Patel 6/26/17
More topics »