Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 626 topics
Questions about common crawl Karthik Shyamsunder 5/18/18
first level links Roxana Danger 5/17/18
Top level Domains / Sub domains with links to the corpus data Simon Burfield 5/9/18
Host- and domain-level web graph data sets of Nov/Dec/Jan 2017/2018 crawls Sebastian Nagel 5/7/18
WET file for news-archive Spider99 5/7/18
Host- and domain-level web graph data sets of Feb/Mar/Apr 2018 crawls Sebastian Nagel 5/7/18
Common Crawl for News Articles Charu Arora 5/3/18
April 2018 Crawl Archive Now Available Sebastian Nagel 5/2/18
Extracting a Specific language Ralf F 4/21/18
Iterate through warc.gz without downloading it Bogdan Metea 4/5/18
help us by sharing your opinion on the value of web-archives as born-digital research materials: Peter Mechant 4/5/18
March 2018 Crawl Archive Now Available Sebastian Nagel 3/29/18
Offset and Length of warc segment Yuheng Du 3/28/18
ChatNoir: First Public Search Engine for the Common Crawl Martin Potthast 3/27/18
Finding content with a Creative Commons license 3/15/18
cdx_toolkit, a cdx client in Python Greg Lindahl 3/12/18
Running news-crawl in a docker container Bogdan Metea 3/9/18
Re: Common Crawl Data Sebastian Nagel 3/9/18
Concerning Common Crawl monthly data release Ernest 3/7/18
How to avoid 503 error Hao Zhang 3/7/18
More topics »