Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Subject: Inquiry About Accessing Common Crawl Data for Commercial Use

85 views
Skip to first unread message

Manmohan Nayak

unread,
Mar 6, 2025, 6:17:56 PMMar 6
to Common Crawl

Dear Common Crawl Team,

I hope this message finds you well. My name is Manmohan, and I am a software developer/data scientist working on a project that involves analyzing global news articles from 2008 onwards in multiple languages. I came across Common Crawl and am impressed by the scale and accessibility of your dataset.

I would like to clarify a few points before proceeding:

  1. Commercial Use: I understand that Common Crawl data is available under the CC BY 4.0 license. Are there any additional restrictions or considerations for using the data in a commercial product?

  2. Attribution: Could you provide guidance on how to properly attribute Common Crawl in a commercial application?

  3. Accessing News Data: Are there any best practices or tools you recommend for filtering and extracting news articles from the WARC/WET files?

  4. Updates: How frequently is the dataset updated, and is there a way to track changes or additions to the crawl data?

If there are any additional resources, documentation, or contacts you could share to help me get started, I would greatly appreciate it.

Thank you for your time and support. I look forward to your response.

Best regards,

Jen English

unread,
Mar 7, 2025, 2:17:06 PMMar 7
to Common Crawl
Manmohan,


Commercial Use
: I understand that Common Crawl data is available under the CC BY 4.0 license. Are there any additional restrictions or considerations for using the data in a commercial product?

Please note that all of the web content in our datasets is copyrighted by people other than us, and we do not and cannot offer a license to the crawled page contents. For more information about how we collect data, please see our Terms of Use and our FAQ.

Attribution: Could you provide guidance on how to properly attribute Common Crawl in a commercial application?

We do not currently have guidelines on how to cite the usage of Common Crawl data.  However, we could suggest including "Common Crawl", our website (https://commoncrawl.org/), and the identifiers of which crawls have been used (ie, CC-MAIN-2025-08).

Accessing News Data: Are there any best practices or tools you recommend for filtering and extracting news articles from the WARC/WET files?

For news articles, you may be interested in our news dataset, which was started in 2016:  https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html


Updates: How frequently is the dataset updated, and is there a way to track changes or additions to the crawl data?

We release new datasets on a monthly basis, which you can see from the dropdown menu here: https://commoncrawl.org/overview
Note that the content of each crawl (ie, the urls crawled) will vary from one crawl to the next. You can explore crawl overlap, domains crawled, and other statistics here: https://commoncrawl.github.io/cc-crawl-statistics/
For any errata affecting specific crawls you can check: https://commoncrawl.org/errata


If there are any additional resources, documentation, or contacts you could share to help me get started, I would greatly appreciate it.

The best place to start is our Get Started page: https://commoncrawl.org/get-started 
You will find additional examples, use cases, and other information on our website and blog, as well as searching our mailing list archives here on Google.

Best, 
Jen

On Thursday, March 6, 2025 at 3:17:56 PM UTC-8 
Reply all
Reply to author
Forward
0 new messages