Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 606 topics
Common Crawl Meetup (?) bakztfuture 2/23/18
Host- and domain-level web graph data sets of Nov/Dec/Jan 2017/2018 crawls Sebastian Nagel 2/23/18
Is it possible to crawl a set of urls for scientific research purpose. 2/13/18
Data Enrichment Raman Parashar 2/10/18
Difference of size between WET and WARC 2/9/18
I'm totally NEW to crawler. Ki Kim 2/8/18
Extracting <a hrefs> from the data 2/8/18
Page ranks available for the host-level web graph of Aug/Sept/Oct 2017 crawls Sebastian Nagel 2/8/18
Overloading and bulk index downloads Sebastian Nagel 2/6/18
Cannot fetch a url Hao Zhang 2/4/18
I have requested for Google to re-crawl my robot file City Strip 2/2/18
How to proceed with the data for expense categorization? Rahul Tiwari 1/30/18
January 2018 Crawl Archive Now Available Sebastian Nagel 1/29/18
Preview release of URL in columnar format Sebastian Nagel 1/26/18
I want to Get worlds all active website URLS 1/25/18
Type of crawl? Thirumalai Raj R 1/15/18
Heading ids for the website Samuel 1/13/18
**Beginner** HELP Thirumalai Raj R 1/9/18
robots.txt question Yossi 12/31/17
Pickup Chinese web pages from the data set. 邓尧 12/25/17
More topics »