Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:


Showing 1-20 of 572 topics
Feed crawler with URIs taken from WET files Dimitris Anag 9/19/17
url crawled but links to same domain not crawled, possibly just pagination or other limits I am not aware of? David Cottrell 9/11/17
Are index files of older crawls changing? brano199 9/4/17
How to read raw index files brano199 9/1/17
Cloudfront support - HTTP/2 download brano199 9/1/17
Microdata Parser for Commoncrawl Gautam SHAHI 8/30/17
August 2017 Crawl Archive Now Available Sebastian Nagel 8/27/17
Contributing to Common Crawl Indexing SD 8/26/17
Pickup Chinese web pages from the data set. 邓尧 8/24/17
Host- and domain-level web graph data sets of May/June/July 2017 crawl Sebastian Nagel 8/18/17
Noisy text classification ajay kumar 8/16/17
Common Crawl index server: DNS switch on Tue, Aug 15, 8:00 UTC Sebastian Nagel 8/10/17
Question about HTML parsing Yossi 8/9/17
needed advise serikbek...@gmail.com 8/9/17
Errors When Running Example Code Marianne Fletcher 8/4/17
July 2017 crawl archive now available Sebastian Nagel 8/1/17
WET file for news-archive Spider99 7/27/17
Getting the data Nikola Madjarevic 7/26/17
All data contained in latest crawl? brano199 7/21/17
Upgrade to Common Crawl Index Server Sebastian Nagel 7/20/17
More topics »