Definition of domain

41 views
Skip to first unread message

Amir Shukayev

unread,
Apr 19, 2024, 11:38:36 AMApr 19
to Common Crawl
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains.html

The domain parsers I have tried from WARC-Target-URI all have slightly differing behaviour, including failing to parse.

What is the method used to extract the domains for this table? Is there anything in Java/Scala we can use to replicate?

Thank you!
Amir

Greg Lindahl

unread,
Apr 19, 2024, 12:29:10 PMApr 19
to common...@googlegroups.com
Amir,

The "registered domains" mention at the top of that webpage is your
clue -- these are domains you can buy from a registrar. The source of
truth is Mozilla's Public Suffix List.

We use pypi's tldextract in our Python code, and in our Nutch fork the
code is in src/java/org/apache/nutch/util/domain/

Please do report back if this clue helps, I'd like to improve the
documentation on this page and also in our columnar index.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/a87e26ea-b4b5-4627-af1b-11c3c56f1432n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages