Common Crawl

Welcome to the Common Crawl Group!

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

This group is intended to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. 

This group is a place to:
*Discuss challenges
*Share ideas for projects and products
*Look for collaborators and partners
*Offer advice and share methods
*Ask questions and get advice from others
*Show off cool stuff you build
*Keep up to date on the latest news from Common Crawl

Showing 1-20 of 383 topics
Common Crawl and Apache Spark giorgio79 10/8/15
Request for clarifications on commoncrawl Srinath Achanta 10/8/15
Obtaining Russian Data Lewis John Mcgibbney 10/6/15
mothership testservice-pa googleapis Lisa Agostoni 10/6/15
Getting Redirect Information From WAT Robert Meusel 10/3/15
Common Crawl now has a URL index! Lisa Green 10/2/15 Domain Coverage Alexander Mitchell 10/1/15
Is there a search engine currently using CommonCrawl data? Mike Onghai 9/23/15
Domain urls needed Akash Varma 9/23/15
Job Opening at Common Crawl - Crawl Engineer / Data Scientist Sara Crouse 9/23/15
Greetings from Sara, Common Crawl’s New Director Sara Crouse 9/17/15
small amount of domains crawled Chen Yaniv 9/6/15
Bug tracker? Anomalies in most crawled domains Tom Morris 8/26/15
Mistake in Common Crawl Index announcement blog post Tom Morris 8/26/15
Crawling Strategy of newer Crawls Robert Meusel 8/25/15
Random sampling from Common Crawl data PythonGuru 8/25/15
common-crawl folder is empty Martin Thurn 8/19/15
Monthly archive info 8/11/15
I'm looking for 1.6TB of crawl data from 115m websites Vesela Gavrailova 8/10/15
maintained list of domains in index Jamie Costello 8/10/15
More topics »