Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 457 topics
get an WARC archive with all files from a domain David Portabella 9/25/16
Missing Open Graph property Christian Lund 9/25/16
how complete is CommonCrawl? David Portabella 9/25/16
Index / API for WAT files Christian Lund 9/25/16
Missing Indexes from November 2014 and before zbagz 9/24/16
java.lang.OutOfMemoryError: Java heap space David Portabella 9/22/16
Common Crawl Index Down Eddie Johnson 9/19/16
August 2016 crawl archive now available, release of robots.txt and redirects data set Sebastian Nagel 9/16/16
CCTLD Scrap Bhavik Hingrajia 9/15/16
Looking for contractor experienced with CC, Java, EMR Clayton Scott 9/14/16
Get title field for from CDX API Neil 9/8/16
New to Crawler Maytham Fahmi 8/31/16
Any other WARC archives like Common Crawl? Liz Ron 8/30/16
URL Search Tool is down Maximilian Böhm 8/10/16
July 2016 crawl archive now available Sebastian Nagel 8/9/16
Host-level WebGraph & PageRank datasets from the June 2016 crawl Sylvain Zimmer 8/2/16
Can an mime type error have an influence on WARC file content? Vincent Boiteau-Robert 8/1/16
Script produces Unicode Errors for certain Common Crawl Segments bakztfuture 7/29/16
I have downloaded a single warc.gz file . but it has metadata information but not the content Bhavana 7/26/16
IP searching Dan Wolff 7/26/16
More topics »