Hi Tom,
> I understand that not all known URLs are crawled in every crawl
Yes, every monthly crawl includes just a sample of the web. And we're not even able to crawl all URLs
found by the crawler.
> so the URLs crawled could have linked to less hosts
Well, the difference in the number of hosts is only 10%, and it's difficult to state what the exact
reasons are. Just some remarks:
- the combination of the 3 monthly crawls used to build the two graphs are comparable in size,
but the combined Oct, Nov/Dec, Jan crawls are slightly smaller than the previous ones.
See the numbers below,
- the number of dangling nodes went done. Yes, this could be explained by "linked to less hosts"
- a graph with less hosts isn't necessarily a bad thing: in 2017 the crawler hit a link spam
which caused the inflation of the host-level graph to 5 billion nodes, see [1].
Since then the crawler is configured not to visit *known* link farms and sites which
create lots of subdomains purely for SEO purposes.
The numbers in millions:
Jul/Aug/Sep Oct/Nov/Jan
(all captures, incl. 404s, redirects, robots.txt)
pages 11,214 10,668
uniq.urls 9,754 9,253
hosts 99 100
domains 53 51
(successfully fetched)
pages 9,068 8,775
uniq.urls 7,892 7,720
(webgraph)
hosts 539 490
dangling 467 414
not dangl. 72 76
domains 89 86
dangling 45 43
not dangl. 44 43
Notes:
- for graph node numbers see the webgraph *.stats files
- numbers about successful fetches are provided in the crawl stats [2]
- counts over all captures are done using the columnar index [3] and the query
SELECT COUNT(*) as n_page_captures,
cardinality(approx_set(url)) AS uniq_urls_estim,
COUNT(DISTINCT url_host_name) AS uniq_hosts,
COUNT(DISTINCT url_host_registered_domain) AS uniq_domains
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2020-45'
OR crawl = 'CC-MAIN-2020-50'
Best,
Sebastian
[1]
https://commoncrawl.org/2017/11/host-and-domain-level-web-graphs-augseptoct-2017/
[2]
https://commoncrawl.github.io/cc-crawl-statistics/
[3]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <
https://commoncrawl.org/2021/02/host-and-domain-level-web-graphs-oct-nov-jan-2020-2021/>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/7628343f-2a74-461d-b2d1-eb253056634en%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/7628343f-2a74-461d-b2d1-eb253056634en%40googlegroups.com?utm_medium=email&utm_source=footer>.