Is CC-NEWS included in common crawl index api or CC-MAIN in general?
169 views
Skip to first unread message
Max Dallabetta
unread,
Jul 4, 2022, 4:41:29 PM7/4/22
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
I was wondering if the news sub dataset of common crawl CC-NEWS is actually included in CC-MAIN as a real subset and therefor indexed by the index api and if not if there is any alternate indexing for CC-NEWS?
I want to query the data from CC-NEWS for specific publishers/domains and would like to avoid iterating over the hole dataset for various reasons.
Thanks
Max
Sebastian Nagel
unread,
Jul 7, 2022, 11:37:51 AM7/7/22
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Max,
unfortunately, we do not yet provide any URL index for the CC-NEWS
dataset. There is some overlap between CC-NEWS and the main crawls,
if you know the domain names you could also run a search there.