Host- and Domain-Level Web Graphs Oct/Nov/Jan 2020-2021

38 views
Skip to first unread message

Sebastian Nagel

unread,
Feb 10, 2021, 9:46:59 AMFeb 10
to Common Crawl
Hi all,

we're pleased to announce the 15th release of our webgraph data set
built from the latest three monthly crawl archives (October,
November/December 2020 and January 2021). More details and links to
download the data set can be found on our blog [1].

Best,
Sebastian

[1] https://commoncrawl.org/2021/02/host-and-domain-level-web-graphs-oct-nov-jan-2020-2021/

Tom Alby

unread,
Feb 28, 2021, 2:40:58 PMFeb 28
to Common Crawl
Hi Sebastian,

thank you for the announcement. I was wondering why there are less hosts in the nodes files than in the previous release. I found 490.193.249 hosts compared to 538.570.861. I understand that not all known URLs are crawled in every crawl, so the URLs crawled could have linked to less hosts, is that a correct interpretation?

Best

Tom

Sebastian Nagel

unread,
Feb 28, 2021, 4:40:08 PMFeb 28
to common...@googlegroups.com
Hi Tom,

> I understand that not all known URLs are crawled in every crawl

Yes, every monthly crawl includes just a sample of the web. And we're not even able to crawl all URLs
found by the crawler.

> so the URLs crawled could have linked to less hosts

Well, the difference in the number of hosts is only 10%, and it's difficult to state what the exact
reasons are. Just some remarks:

- the combination of the 3 monthly crawls used to build the two graphs are comparable in size,
but the combined Oct, Nov/Dec, Jan crawls are slightly smaller than the previous ones.
See the numbers below,

- the number of dangling nodes went done. Yes, this could be explained by "linked to less hosts"

- a graph with less hosts isn't necessarily a bad thing: in 2017 the crawler hit a link spam
which caused the inflation of the host-level graph to 5 billion nodes, see [1].
Since then the crawler is configured not to visit *known* link farms and sites which
create lots of subdomains purely for SEO purposes.


The numbers in millions:

Jul/Aug/Sep Oct/Nov/Jan

(all captures, incl. 404s, redirects, robots.txt)
pages 11,214 10,668
uniq.urls 9,754 9,253
hosts 99 100
domains 53 51

(successfully fetched)
pages 9,068 8,775
uniq.urls 7,892 7,720

(webgraph)
hosts 539 490
dangling 467 414
not dangl. 72 76
domains 89 86
dangling 45 43
not dangl. 44 43


Notes:
- for graph node numbers see the webgraph *.stats files
- numbers about successful fetches are provided in the crawl stats [2]
- counts over all captures are done using the columnar index [3] and the query
SELECT COUNT(*) as n_page_captures,
cardinality(approx_set(url)) AS uniq_urls_estim,
COUNT(DISTINCT url_host_name) AS uniq_hosts,
COUNT(DISTINCT url_host_registered_domain) AS uniq_domains
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2020-45'
OR crawl = 'CC-MAIN-2020-50'

Best,
Sebastian


[1] https://commoncrawl.org/2017/11/host-and-domain-level-web-graphs-augseptoct-2017/
[2] https://commoncrawl.github.io/cc-crawl-statistics/
[3] https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <https://commoncrawl.org/2021/02/host-and-domain-level-web-graphs-oct-nov-jan-2020-2021/>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/7628343f-2a74-461d-b2d1-eb253056634en%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/7628343f-2a74-461d-b2d1-eb253056634en%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages