New to Common Crawl - Missing domains?

343 views
Skip to first unread message

Player One

unread,
Aug 23, 2018, 4:14:11 PM8/23/18
to Common Crawl
Hello! I'm new to the Common Crawl data set. I've seen the vertices.txt file from the Webgraph that contains about 82 Million unique domains but when I roll up the WARC files by TLD I only get ~35 Million TLD's. Two questions:

1.) Why do you only crawl 90 million domains? There should be close to 300 million domains out there. Is this just a staffing/resource issue? If I can get you a list of more would you crawl it?
2.) I'm getting odd results where a domain exists in the vertices.txt list and the domain resolves but I get no data from the WARC files or there isn't an entry in the cc-index. Any ideas why this could be happening? Maybe domain was offline during the last crawl?

Thanks and keep up the great work! I will be contributing as soon as I can!

Sebastian Nagel

unread,
Aug 23, 2018, 5:44:14 PM8/23/18
to common...@googlegroups.com
Hi,

first, let's state the numbers and agree on terminology (but I think we mean the same):

- "domain" is the part of a host name one level below the ICANN registry suffix,
e.g., "example.com" or "example.co.uk" but not "www.example.com", cf. [1].

- about 30 million different domains are crawled every month. This number only includes
successful fetches but not failures and redirects.

- there is a large overlap among the 30 million domains between monthly crawls,
however, the number of unique domains increases if multiple crawls are combined:
for the last year it's 55 million resp. 65 millions if also redirects and 404s
are counted [3]

- the domain-level webgraphs [2] contain more domains (around 90 million) because they
also include those domains which are linked from the crawled ones but haven't
been visited by the crawler

There are multiple reasons why a domain is not crawled:
- disallows crawling in the robots.txt
- unreachable (DNS failed, no response)
- not known to the crawler
- excluded as spam domain (we know around 350,000)
- a low-ranking domain, not selected for the frontier

Yes, there are limits regarding "staffing/resource", 3 billion pages per month
is the current limit. We also need to make sure that we provide representative
samples in our "snapshot" crawls. That means we should include more pages from
well-known, frequently visited domains, low-ranking domains are included with
low probability. That's to somehow emulate the "random surfer" which will only
occasionally visit a parked domain. And my guess would be that around 50% of the
300 million registered domains are parked. In the worst case they host keyword
and link spam which we try not to include (we cannot avoid it fully anyway).

Another reason why not to crawl blindly parked domains: every host/domain
requires first to resolve the DNS and to fetch the robots.txt, doing this
for a single page means a waste of resources.

> Thanks and keep up the great work! I will be contributing as soon as I can!

If you have a list of domain names you are able to share, let us know.
Even if we do not crawl the entire list, it may be a precious resource
to evaluate what is missing.

Thanks,
Sebastian


[1] https://github.com/google/guava/wiki/InternetDomainNameExplained
[2] http://commoncrawl.org/2018/08/webgraphs-may-june-july-2018/
[3] calculated using the columnar index over 11 months,
see http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
SELECT COUNT( DISTINCT url_host_registered_domain ) as count
FROM "ccindex"."ccindex"
WHERE subset = 'warc';
-- leave the WHERE clause away to include redirects and 404s
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Player One

unread,
Aug 23, 2018, 6:38:28 PM8/23/18
to Common Crawl
Fantastic! Thank you for the quick response! I apologize if I came off as attacking the data as that was not my intention at all. I'm just curious how things are currently working in place and wondering how I can help contribute resources to your cause. I work for a large web hosting company so I think there is plenty of opportunity for us to help and I am very knowledgable with Spark and Amazon EMR.

I will have my company get in touch with Sara Crouse to see if we can collaborate. I will also check with our legal team to see when I can send over this list of domains I have. 

I will be in touch. Thanks again and looking forward to collaborating with you!
Reply all
Reply to author
Forward
0 new messages