[CC-NEWS] Time of Crawling vs Time of Posting

62 views
Skip to first unread message

Shuheng Liu

unread,
Sep 18, 2021, 11:14:39 PM9/18/21
to Common Crawl
(I sent a similar post an hour ago but I cannot seem to find it now...)

Hi everyone

I am new to Common Crawl and I would like to use the CC-NEWS for a project.

However, I would like to know if the time of crawling is equivalent to the time of that news article being posted. Specifically, I would like to ask if a news article is crawled, supposedly in September 2021, is the crawled article also posted in the same month, and if not, how far back in time can it be?

Thank you!
Shuheng Liu

Stephane Coulondre

unread,
Sep 19, 2021, 4:39:03 AM9/19/21
to Common Crawl
Hi Shuheng,
Usually the date of the crawl is mentioned in the crawl release announcement, it is usually performed within the 30 previous days.
On another side, although this is not your initial point but it is something important to know for everyone, Common Crawl and CC-News are not intended to be complete crawls but rather, to provide a representative sample.
Therefore there is no guarantee that a specific news article will ever be crawled (more details in this research paper: https://rodgerbenham.github.io/mbptcm20-cikm.pdf )
Best regards
Stephane

Sebastian Nagel

unread,
Sep 19, 2021, 12:48:42 PM9/19/21
to common...@googlegroups.com
Hi Shuheng,

in addition to Stephane's comment...

The crawler collection the CC-NEWS collection relies on news feeds and news sitemaps
and skips over news items with an publication date older than 30 days. However, this
requires that the feed/sitemap indicates the publication dates and that the dates
are correct. Otherwise outdated news or pages may slip into the collection.
Because there are millions of feeds and sitemaps followed the crawler cannot revisit
every feed/sitemap even daily. Instead, it adapts to the change frequency of a feed/sitemap
and revisits a frequently changing feed/sitemap every 90 minutes. If there are no changes
the interval may grow to 90 days which also avoids that significant resources are spent
to stale feeds/sitemaps.

Best,
Sebastian

Shuheng Liu

unread,
Sep 19, 2021, 11:50:32 PM9/19/21
to Common Crawl
Hi Stephane and Sebastian

Thank you so much for your replies! They are extremely helpful!

Sincerely
Shuheng

Reply all
Reply to author
Forward
0 new messages