Common Crawl

Welcome to the Common Crawl Group!


Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.


This group is intended to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. 

This group is a place to:
*Discuss challenges
*Share ideas for projects and products
*Look for collaborators and partners
*Offer advice and share methods
*Ask questions and get advice from others
*Show off cool stuff you build
*Keep up to date on the latest news from Common Crawl



Showing 1-20 of 327 topics
Why do the crawls significantly vary in size? shlomi...@gmail.com 4/13/15
Errors searching for single URL Aline Bessa 4/11/15
Any interest in running Apache Tika as part of CommonCrawl? Allison, Timothy B. 4/7/15
Extracted text in WET files Lukas Michelbacher 4/6/15
Announcing: New CommonCrawl Index and Query Api Ilya Kreymer 3/31/15
Announcing: command-line client for index server Ilya Kreymer 3/30/15
Narrowing down the parent domain/website One Speed 3/16/15
steps to run the examples Srinivasan Venkatachary 3/15/15
Common Crawl enhancements to Nutch Peter Dietz 3/12/15
Create CommonCrawl files using Java tot...@di.uniroma1.it 3/9/15
Sitemap indexing problems for robots.txt topeci...@gmail.com 3/2/15
URL Index for a single page Aline Bessa 2/26/15
WARC-Record-ID uniqueness? Titus Barik 2/18/15
Does the Common Crawl include SSL sites? Colin Dellow 2/18/15
robot.txt files for denied sites Michael Pastore 2/9/15
Exploring commoncrawl with a keyword Paul Tolmer 2/9/15
What hadoop version is used in cc-warc-examples for Java language Anatoly Vostryakov 1/28/15
Requesting advice on WARC library to use Titus Barik 1/27/15
WikiReverse project Ross Fairbanks 1/26/15
Filtering whole pages with ruby streaming Christian Becker 1/11/15
More topics »