Common Crawl

1–30 of 913
Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:
  • Discuss challenges
  • Offer advice and share methods
  • Ask questions and get advice from others
  • Share ideas for projects and products
  • Look for collaborators and partners
  • Convey new and/or derivative uses of the data
  • Show off cool stuff you build 
  • Keep up to date on the latest news from Common Crawl.