Get total number of unique URLs?

47 views
Skip to first unread message

Ernest Mugo

unread,
Jun 15, 2021, 9:16:47 AMJun 15
to Common Crawl
Hi!

Is it possible to get the total number of unique URLs across all crawls since 2008? e.g using Athena e.t.c

All suggestions are welcome!

Sebastian Nagel

unread,
Jun 15, 2021, 9:58:42 AMJun 15
to common...@googlegroups.com
Hi Ernest,

you can get a close approximation using Hyperloglog cardinality estimates.

- using Athena/Presto I've tried this only for one TLD and one year of data, see
https://github.com/commoncrawl/cc-notebooks/blob/master/cc-index-table/cc-main-2013-2019-metrics.ipynb

- the cc-crawl-statistics also include HLL sketches. Merging all sketches from 2008 until CC-MAIN-2021-21
results in 57 billion unique URLs, see figure "crawl size cumulative" on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize

Note: the news data set isn't included in this number but would not change the picture significantly.

Let me know if you need more information!

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/69fc17ea-ac84-4b0f-a841-4c07cd50b365n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/69fc17ea-ac84-4b0f-a841-4c07cd50b365n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ernest Mugo

unread,
Jun 15, 2021, 12:56:31 PMJun 15
to Common Crawl
Thanks. This answers my question.
Reply all
Reply to author
Forward
0 new messages