Common Crawl

Welcome to the Common Crawl Group!

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

This group is intended to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. 

This group is a place to:
*Discuss challenges
*Share ideas for projects and products
*Look for collaborators and partners
*Offer advice and share methods
*Ask questions and get advice from others
*Show off cool stuff you build
*Keep up to date on the latest news from Common Crawl

Showing 1-20 of 248 topics
Issues accessing Common Crawl data Chris Grayson 4/22/14
Does a single file in a single segment provide a reasonably random cross section of crawl data? David Parks 4/22/14
Finding set of URLs in Common Crawl Metadata Sambit Tripathy 4/18/14
Segments with missing WAT dir DigitalPebble 4/18/14
Trying to read arc.gz files using CommonCrawl Support Library Laurier Rochon 4/17/14
Only spanish Alejandro Moleiro 4/14/14
Total size of just metadata? 4/13/14
Word frequency count from common crawl Etesh Mangray 4/8/14
Frequency data HTML tags used on websites dianne finch 4/6/14
Example code for working with WARC files Lisa Green 4/6/14
how to use the common_crawl_index remote_copy script Alexander Czech 4/3/14
Costs involved in extracting URL list for 2014 crawl 3/24/14
Winter and Spring Crawl Robert Meusel 3/23/14
Performance of EMR with CC Soren Flexner 3/13/14
Which top level domains are crawled? Alexander Czech 3/12/14
Information needed about the crawling parameters of CC-MAIN-2013-48 novermber dataset..!! Zahid Adeel 3/12/14
Meanpath Jan 2014 Torrent - 1.6TB of crawl data from 115m websites. Adam Seabrook 3/10/14
Seeking Developer with Common Crawl Experience for Quick Hacky Project 3/4/14
list of example open source apps that work with common crawl? Jason 2/21/14
How much Porn is in the Common Crawl Corpus? Alex Goretoy 2/21/14
More topics »