Amir,
The "registered domains" mention at the top of that webpage is your
clue -- these are domains you can buy from a registrar. The source of
truth is Mozilla's Public Suffix List.
We use pypi's tldextract in our Python code, and in our Nutch fork the
code is in src/java/org/apache/nutch/util/domain/
Please do report back if this clue helps, I'd like to improve the
documentation on this page and also in our columnar index.
-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/common-crawl/a87e26ea-b4b5-4627-af1b-11c3c56f1432n%40googlegroups.com.