Is CC-NEWS included in common crawl index api or CC-MAIN in general?

59 views
Skip to first unread message

Max Dallabetta

unread,
Jul 4, 2022, 4:41:29 PM7/4/22
to Common Crawl
I was wondering if the news sub dataset of common crawl CC-NEWS is actually included in CC-MAIN as a real subset and therefor indexed by the index api and if not if there is any alternate indexing for CC-NEWS?

I want to query the data from CC-NEWS for specific publishers/domains and would like to avoid iterating over the hole dataset for various reasons.

Thanks
Max

Sebastian Nagel

unread,
Jul 7, 2022, 11:37:51 AM7/7/22
to common...@googlegroups.com
Hi Max,

unfortunately, we do not yet provide any URL index for the CC-NEWS
dataset. There is some overlap between CC-NEWS and the main crawls,
if you know the domain names you could also run a search there.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages