Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 592 topics
I want to Get worlds all active website URLS 1/18/18
Type of crawl? Thirumalai Raj R 1/15/18
Heading ids for the website Samuel 1/13/18
**Beginner** HELP Thirumalai Raj R 1/9/18
robots.txt question Yossi 12/31/17
Pickup Chinese web pages from the data set. 邓尧 12/25/17
Host- and domain-level web graph data sets of Aug/Sept/Oct 2017 crawl Sebastian Nagel 12/24/17
JSON API 12/24/17
December 2017 Crawl Archive Now Available Sebastian Nagel 12/22/17
Pages versus URLs, and uniqueness of WAT file entries Henry S Thompson 12/4/17
CC-SA of Common Crawl Laura Dietz 12/3/17
November 2017 Crawl Archive Now Available Sebastian Nagel 11/30/17
time series models from common crawl data 11/9/17
Noob ? regarding phishing URLs Stephen Bright 11/2/17
October 2017 Crawl Archive Now Available Sebastian Nagel 10/29/17
CDN for faster download or alternative protocols - UDT,QUIC brano199 10/23/17
Microdata Parser for Commoncrawl Gautam Kishore Shahi 10/9/17
How crawling is performed Spider99 10/9/17
Best Cloud Services for Web crawling 10/6/17
Download Wet Files of Web Pages with a Given Tag Dakila 10/4/17
More topics »