How frequently is Commoncrawl data updated, and what is its coverage level?

406 views
Skip to first unread message

Don Boscow

unread,
May 22, 2023, 12:54:32 PM5/22/23
to Common Crawl

How often is Commoncrawl updated? On a daily cadence? Or weekly/monthly? If Meghan Markle wears a Versace gown, that becomes a BBC article, and that article shows up on Googling "meghan markle" 2-3 minutes after the publishing of the article by BBC. What is the equivalent time for CC?

And secondly, is there a place where I can see CC coverage level? I mean - which websites they cover fully, which ones they cover partially, whether they cover reuters.com at all, or how much of of vice.com they cover, etc.?

Hynek Kydlíček

unread,
May 22, 2023, 6:40:35 PM5/22/23
to Common Crawl
Hi, some sort of statistics are available at https://commoncrawl.github.io/cc-crawl-statistics/plots/domains. The crawls are usually released in something like 2-3 months basis.

Dne pondělí 22. května 2023 v 18:54:32 UTC+2 uživatel yourdo...@gmail.com napsal:

Sebastian Nagel

unread,
May 25, 2023, 12:14:02 PM5/25/23
to common...@googlegroups.com
Hi Hynek, hi Don Boscow,

yes, currently crawls are released about every two months.

> If Meghan Markle wears a Versace gown, that becomes
> a BBC article, and that article shows up on Googling "meghan markle"
> 2-3 minutes after the publishing of the article by BBC. What is the
> equivalent time for CC?

Common Crawl crawls are sample snapshots of the web. There's never
guarantee that any page or URL is included. The likelihood that
a page is included increases if
- the domain has a high harmonic centrality rank and is allowed to
contribute more pages to the dataset
- the link to a page is shared often in the public web or is even
provided on a sitemap. That way it's easier for the crawler to
find the link and more likely that it is followed.

The time from link discovery until release of a crawl dataset will take
few weeks, at least.

> And secondly, is there a place where I can see CC coverage level?

I'm not aware of any such report. The most difficult part is to estimate
the size of the web as such. There's a huge variance (see for example
https://www.worldwidewebsize.com/).

Best,
Sebastian
> partially, whether they cover reuters.com <http://reuters.com> at
> all, or how much of of vice.com <http://vice.com> they cover, etc.?
>
Reply all
Reply to author
Forward
0 new messages