Hi,
first, let's state the numbers and agree on terminology (but I think we mean the same):
- "domain" is the part of a host name one level below the ICANN registry suffix,
e.g., "
example.com" or "
example.co.uk" but not "
www.example.com", cf. [1].
- about 30 million different domains are crawled every month. This number only includes
successful fetches but not failures and redirects.
- there is a large overlap among the 30 million domains between monthly crawls,
however, the number of unique domains increases if multiple crawls are combined:
for the last year it's 55 million resp. 65 millions if also redirects and 404s
are counted [3]
- the domain-level webgraphs [2] contain more domains (around 90 million) because they
also include those domains which are linked from the crawled ones but haven't
been visited by the crawler
There are multiple reasons why a domain is not crawled:
- disallows crawling in the robots.txt
- unreachable (DNS failed, no response)
- not known to the crawler
- excluded as spam domain (we know around 350,000)
- a low-ranking domain, not selected for the frontier
Yes, there are limits regarding "staffing/resource", 3 billion pages per month
is the current limit. We also need to make sure that we provide representative
samples in our "snapshot" crawls. That means we should include more pages from
well-known, frequently visited domains, low-ranking domains are included with
low probability. That's to somehow emulate the "random surfer" which will only
occasionally visit a parked domain. And my guess would be that around 50% of the
300 million registered domains are parked. In the worst case they host keyword
and link spam which we try not to include (we cannot avoid it fully anyway).
Another reason why not to crawl blindly parked domains: every host/domain
requires first to resolve the DNS and to fetch the robots.txt, doing this
for a single page means a waste of resources.
> Thanks and keep up the great work! I will be contributing as soon as I can!
If you have a list of domain names you are able to share, let us know.
Even if we do not crawl the entire list, it may be a precious resource
to evaluate what is missing.
Thanks,
Sebastian
[1]
https://github.com/google/guava/wiki/InternetDomainNameExplained
[2]
http://commoncrawl.org/2018/08/webgraphs-may-june-july-2018/
[3] calculated using the columnar index over 11 months,
see
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
SELECT COUNT( DISTINCT url_host_registered_domain ) as count
FROM "ccindex"."ccindex"
WHERE subset = 'warc';
-- leave the WHERE clause away to include redirects and 404s
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.