Missing site articles in new dataset

Nikolay Kushin

unread,

Dec 27, 2021, 12:11:58 PM12/27/21

to Common Crawl

I started to work with news dataset and discovered that for some news sites particular articles for the given period of time are missing. At the same time other articles from the same new site with the same publishing date appear in the dataset. Those were some cases for January 2018, November 2018 and January 2020.

Could someone suggest what could be a reason of such a behavior?

I tried to look at the warc files dated from publishing date of article + 1 day. And in general it seems that the dates are more or less correlate between publishing on original website and appearing in the dataset - in most cases I looked at same day, rarely next day.

It also seems that the particular articles I was looking for are discoverable through sitemaps.

Thanks in advance! I also very much appreciate a lot of work done to publish this dataset!

Sebastian Nagel

unread,

Jan 2, 2022, 6:04:39 AM1/2/22

to common...@googlegroups.com

Hi Nikolay,

> discovered that for some news sites particular articles for the given
> period of time are missing. At the same time other articles from the
> same new site with the same publishing date appear in the dataset.
> Those were some cases for January 2018, November 2018 and January
> 2020.

The news crawler relies on news feeds and news sitemaps but skips
news items with an publication date older than 30 days if the
feed/sitemap indicates the publication dates.

The crawler monitors about 50k feeds and 500k news sitemaps (not
counting sitemap indexes and non-news sitemaps) from 12,000 domains /
news sites. This means not every feed/sitemap (aka. "seed") can be
fetched in short intervals and the crawler adapts to the change
frequency of a seed and revisits a frequently changing seeds every 90
minutes. If there are no changes the interval may grow to 90 days which
avoids that significant resources are spent for stale seed.

Potential reasons why a news article was missed:
- the 90 minutes interval can be too long for larger news sites, esp.
Monday morning if many news articles are published
- some news sites organize feeds/sitemaps by topic and/or geographic
region which in turn means that some of them are re-fetched not in
time to catch all news articles
- the crawler is operated with a high level politeness and a guaranteed
and comparable long delay (6 seconds) between successive requests to
the same domain
- sometimes due to site re-launches or changes of feed / sitemap URLs
the crawler looses a news site entirely. Automatic detection isn't
perfect and it may take quite a long time (even months) until the
feed/sitemaps are added again.
- Stormcrawler does a nice job running for months or years with minimal
supervision. However, there have been some downtimes in the past.

> I tried to look at the warc files dated from publishing date of
> article + 1 day. And in general it seems that the dates are more or
> less correlate between publishing on original website and appearing in
> the dataset - in most cases I looked at same day, rarely next day.

Thanks for sharing this information. Indeed, maximizing freshness is the
intended behavior. But it's for sure not perfect and may cause that
sometimes news are not included.

> Thanks in advance! I also very much appreciate a lot of work done to
> publish this dataset!

In general, we cannot and even do not want to guarantee completeness
for the news data set. Instead we want to provide a broad collection
of news from all over the world in a multitude of languages, currently
more than 100 languages from over 200 top-level domains. In addition,
the crawler fully respects robots.txt exclusions.

Regarding completeness the news dataset is definitely no replacement for
a subscription-based news provider.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/2ea28a84-59bf-401d-9f06-3906c6b834c9n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/2ea28a84-59bf-401d-9f06-3906c6b834c9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Nikolay Kushin

unread,

Jan 3, 2022, 5:01:44 AM1/3/22

to Common Crawl

Thanks a lot for your answer!

Reply all

Reply to author

Forward