Questions on News Common Crawl data

153 views
Skip to first unread message

Yiyun Zhang

unread,
Jul 29, 2025, 4:11:41 PMJul 29
to Common Crawl
Hi, Common Crawl Team. I am an undergraduate student from UC Davis. I currently work in a communication lab and seek news data for possible research. I read a peer's paper. In their methods section, they downloaded a JSON file in a format that includes news text and image URLs. However, I only get the website URL from WARC files. I am thinking maybe I missed something. Can you please tell me what to do to get the same JSON file output? Thank you for your time!

Sebastian Nagel

unread,
Jul 30, 2025, 9:47:44 AMJul 30
to common...@googlegroups.com
Hi,

> downloaded a JSON file in a format that includes news text and image
> URLs.

I'm not aware that such data is provided by Common Crawl.

Could you share a link of the paper? Or the exact description
where the data was downloaded from?

Hope this helps to understand the problem. Thanks!

Best,
Sebastian

Yiyun Zhang

unread,
Jul 30, 2025, 12:52:02 PMJul 30
to Common Crawl
Thank you for your reply. Here is the paper: https://arxiv.org/abs/2503.20960
The content is located on page 3, under the "Dataset" section.
"We query the publicly available Common Crawl archives3 for the corresponding publishers and extracted each article’s text, headline, publication date, image_urls, and other metadata in JSON format."

Sebastian Nagel

unread,
Jul 30, 2025, 3:54:54 PMJul 30
to common...@googlegroups.com
Hi,

thanks! This helps to understand the problem.

They use news-please [1] which reads WARC files,
parses the HTML and extracts structured data
from it.

I tried a couple of years ago. The quality of the
extracted text and data was good. But the price
is that it wasn't not the fastest extraction tool.
You may try it. Alternatively, ask the authors
whether they can share the extracted data with you.

Best,
Sebastian

[1] https://github.com/fhamborg/news-please/

Yiyun Zhang

unread,
Jul 31, 2025, 4:52:09 PMJul 31
to Common Crawl
Thank you for your reply and time! I will check that out.

Yiyun Zhang

unread,
Aug 13, 2025, 4:54:04 PMAug 13
to Common Crawl
Hi, I tried the News-Please, and I have one question. In the paper, they have data from the New York Times, which requires a subscription. As I accessed Common Crawl, I did not get any data that requires a subscription. Does Common Crawl provide data that needs a subscription?

Greg Lindahl

unread,
Aug 14, 2025, 7:32:19 PMAug 14
to common...@googlegroups.com
Yiyun,

Common Crawl's CCBot only crawls the public web -- it doesn't log in to any website. In the past, we crawled NYT webpages, because they were public. At some point, the NYT blocked us in robots.txt, and we stopped crawling them. Also, they sent us a legal demand that we erase past crawls. This demand was made public in a NYT submission to a US Copyright Office request for public comments about AI crawling.

We recommend that you not use any of the NYT data in our crawl.

-- greg


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/a48c084b-b9f3-4317-b9af-94ad32addcb9n%40googlegroups.com.

Jose Mendez

unread,
Aug 15, 2025, 1:49:52 AMAug 15
to common...@googlegroups.com
The info behind the pay wall probably won't be available. Crawlers only access publicly available information.

--

Rich Skrenta

unread,
Aug 15, 2025, 2:00:50 AMAug 15
to common...@googlegroups.com
Publishers frequently expose content outside their paywall initially to attract crawlers and SEO, and then pull it back behind their pay-gates after an interval (7-30 days) as bait for monetization. CCBot sometimes encounters this category of content while the pages are publicly available.

I agree with Greg's recommendation to avoid NYT content in general. In the near future we will be posting an "opt-out" registry of hosts which should be avoided for any inclusion in training datasets, to avoid complaints from potentially litigious publishers.

Rich

Reply all
Reply to author
Forward
0 new messages