Coordination of news datasets

89 views
Skip to first unread message

Lawrence Stewart

unread,
Dec 1, 2022, 7:27:44 PM12/1/22
to Common Crawl
Hi,

I'm wondering how crawling of news sites is coordinated? Is it basically just who's submitting what? Or is there a group/groups that dividing up work? 

I noticed digging into the monthly urls there is quite a bit of local news, some lower quality aggregators that make up a large portion of the records in a month. And a surprising lack of crypto news sites. 

Sebastian Nagel

unread,
Dec 2, 2022, 5:15:56 AM12/2/22
to common...@googlegroups.com
Hi Lawrence,

the news crawling is done on a single node. The seeds (news feeds and
sitemaps) for the crawler were initially taken from DMOZ (see [1]).
There also have been contributions from the Common Crawl community (eg.
[2]). I'm now in the course of upgrading the news crawler (software and
hardware). With the upgrade additional news sites found on Wikidata [3]
will be included.

> quite a bit of local news

Yes. But that's intended and a consequence of having a large and
multi-lingual seed list.

> some lower quality aggregators that make up a large portion of the
> records in a month

Yes, there are some sites which push a lot of news articles, most of
them likely auto-generated, esp., financial news.

> a surprising lack of crypto news sites.

Interesting.

If you have any list of news site to share, please let us know!

Best,
Sebastian

[1] https://github.com/commoncrawl/news-crawl/issues/8
[2] https://github.com/commoncrawl/news-crawl/issues/12
[3] https://github.com/commoncrawl/news-crawl/issues/50

Lawrence Stewart

unread,
Dec 2, 2022, 11:23:50 AM12/2/22
to Common Crawl
I'll dig a bit more into some other months, I've only gone through oct/2022.  Most of the larger crypto news sites were absent, it's possible they don't have the correct news sitemap or rss/atom feeds. I'll do some more digging here. 

One site that stood out from Oct that I couldn't figure out how it got in was Nike.com had 74418 records in the months WARCs. I'll work on sharing the code to reproduce this. But is it possible: a) it was a new source? b) urls can be duplicated across WARC files? c) something else?

Lawrence Stewart

unread,
Dec 2, 2022, 1:07:15 PM12/2/22
to Common Crawl
It would be nice to have an index of URLs for the news datasets like exists for the monthly indexes.

I don't believe this exists, does it?

Sebastian Nagel

unread,
Dec 2, 2022, 1:37:50 PM12/2/22
to common...@googlegroups.com
Hi Lawrence,

no, there is no index. It will come sooner or later but likely with some
delay for updates (eg. monthly).

> One site that stood out from Oct that I couldn't figure out how it
> got in was Nike.com had 74418 records in the months WARCs.

It was there until I blocked it.

> a) it was a new source?

So, in my opinion it isn't. But the site uses the same ways to announce
parts of their content as news sites do - via news feeds and news
sitemaps. The problem is that other news sites sell slots in their news
feeds and sitemaps and put advertisements there. The crawler follows
these links the same way as it follows links to news articles. Because
of a news sitemap auto-detection feature, thousands of "news" articles
from the target site are then possibly crawled.

This issue is to be addressed together with the upgrade. I'm not sure
what the solution will be: disabling the auto-detection or a strict
cross-submit verification. The latter isn't trivial because not a few
sites delegate the assembling and hosting of feeds and sitemaps to
third-party domains.

b) urls can be duplicated across WARC files?

Because the crawler uses only feeds and sitemaps, there should be only
very few duplicates in general. And no duplicated URLs at all, except in
one case - when the crawler crashed and was restarted there might be few
duplicated WARC records, one immediately before and the duplicate soon
after the crash.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/659ab14e-5ded-474c-ba86-2bb1d204bfa8n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/659ab14e-5ded-474c-ba86-2bb1d204bfa8n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Lawrence Stewart

unread,
Dec 2, 2022, 3:39:16 PM12/2/22
to Common Crawl
Hi Sebastian,

Thanks for the information, that's very helpful. 

I wasn't aware of the extent that the paid submission happens in the news feed. That does make sense for Nike. Can the seed file used by the node be downloaded? Or do you have any examples of these feeds that could contain info about Nike? If it's too much work to find specifics that's alright, just a bit of curiosity. 

With the example of Nike, ideally the blog/news content would be in there, but not products (I assume that's what made up a large part of this). Similar to https://github.com/commoncrawl/news-crawl/issues/41, I did use URL patterns in a previous crawling project, it works reasonably well, but can be a pain when a site changes the url pattern. This could be tricky to manage across the hundred-of-thousands(?) of domains? Another solution is some rule based or decision tree to decide on content, but not a trivial task, especially without introducing overhead into the crawler deciding if a page is news content before yielding or appending to the WARC. 

Lawrence Stewart

unread,
Dec 2, 2022, 3:41:18 PM12/2/22
to Common Crawl
I should add, following url patterns, also involved keeping an index to avoid recrawling old and already crawled content. 
Reply all
Reply to author
Forward
0 new messages