Number of unique Hosts in CommonCrawl

33 views
Skip to first unread message

Tom Alby

unread,
Feb 3, 2021, 3:03:56 PMFeb 3
to Common Crawl
Hi again,

I am searching for a list of unique hosts in the CommonCrawl crawls. I have used Athena with this query:

SELECT DISTINCT url_host_name, content_languages
FROM "ccindex"."ccindex"
WHERE subset = 'warc'

This resulted in 379.756.278 hosts. Then, I looked at the nodes file of Jul/Aug/Sep (thanks again to Sebastian Nagel for helping), and it has 538.570.861 hosts. My assumption is that the additional hosts have been seen in crawls as links but the pages have not been crawled. Am I right?

I had also downloaded all 300 URL index files of the last crawls and get to a different number, but I assume, based on what I have read here in this group, that a crawl will not include all known URLs.

Best

Tom

Tom Alby

unread,
Feb 3, 2021, 3:05:45 PMFeb 3
to Common Crawl
- content_languages in the query, that would be more hosts, actually :)

Sebastian Nagel

unread,
Feb 3, 2021, 3:34:51 PMFeb 3
to common...@googlegroups.com
Hi Tom,

the number of distinct host names in a monthly crawl is currently
around 50k. The webgraph includes all hosts of 3 crawls either
visited by the crawler or seen in outgoing links. As a further subtlety,
the hostnames in the graphs are normalized and cleaned up:
- www. prefix removed
- only host names with valid TLD suffix (eg. IP addresses removed)
while the column url_host_name in the index holds exactly the string
returned by the method java.net.URL.getHost().

> that a crawl will not include all known URLs.

Yes. No way, the web is too big, we need to sample.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/0d2395af-327a-4bef-a179-eac8bd3ed6e6n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/0d2395af-327a-4bef-a179-eac8bd3ed6e6n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Alby

unread,
Feb 3, 2021, 3:42:37 PMFeb 3
to Common Crawl
Thanks again, Sebastian. Where could I have RTFM to learn this and not ask these questions? :) I was really searching for this information.
Reply all
Reply to author
Forward
0 new messages