URL count in CommonCrawl as compared to CommonScreens

39 views
Skip to first unread message

Siddartha

unread,
Nov 14, 2022, 4:42:39 PM11/14/22
to Common Screens
I extracted all unique URLS from the most recent CommonCrawl index, protocol+netloc , and I see 19,101,716 unique URLS, and I did the same for your URL index, and I see 55,585,805 unique URLS.

What is the difference between the two datasets? Or perhaps my methods are not accurate?

Thanks!

Sebastian Nagel

unread,
Nov 15, 2022, 4:04:21 AM11/15/22
to Common Screens
Hi Siddartha,

the number of unique host names successfully crawled in recent main
crawls of the Common Crawl is around 45 million, see the second plot on

If URL protocol/scheme and all of netloc/authority (including the port)
are taken together the number should definitely exceed 45 million.
Which approach did you use to count over the Common Crawl index?

Best,
Sebastian

Common Screens

unread,
Nov 18, 2022, 3:30:09 PM11/18/22
to Common Screens
Common Screens does not need urls it needs domain names to begin the capture process, we have a source of around 70 million domain names and 384 million host names from common crawl web graph.
We first resolve the domain name / host name and to a IP address if resolution is successful then only we attempt a screen capture, out of 70 million domains only 52 million were found viable.

Common Screens

unread,
Nov 18, 2022, 3:45:18 PM11/18/22
to Common Screens

Besides the web graph from common crawl, we use front-page.com search engine to identify new domains which is approximately 10 million which may be over and above common crawl list of domains. The first domain only capture is complete and we are progressing with top 100 million host (basically sub domains) captures based on rank.
Reply all
Reply to author
Forward
0 new messages