Get total number of unique URLs?

Skip to first unread message

Ernest Mugo

Jun 15, 2021, 9:16:47 AMJun 15
to Common Crawl

Is it possible to get the total number of unique URLs across all crawls since 2008? e.g using Athena e.t.c

All suggestions are welcome!

Sebastian Nagel

Jun 15, 2021, 9:58:42 AMJun 15
Hi Ernest,

you can get a close approximation using Hyperloglog cardinality estimates.

- using Athena/Presto I've tried this only for one TLD and one year of data, see

- the cc-crawl-statistics also include HLL sketches. Merging all sketches from 2008 until CC-MAIN-2021-21
results in 57 billion unique URLs, see figure "crawl size cumulative" on

Note: the news data set isn't included in this number but would not change the picture significantly.

Let me know if you need more information!

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> <>.
> To view this discussion on the web visit
> <>.

Ernest Mugo

Jun 15, 2021, 12:56:31 PMJun 15
to Common Crawl
Thanks. This answers my question.
Reply all
Reply to author
0 new messages