Extract Domain Names from CommonCrawl

175 views
Skip to first unread message

Rémi Lécussan

unread,
Jun 20, 2023, 12:37:59 PM6/20/23
to Common Crawl
Hello, 

I'm new here, and a bit lost to be honest. I'd like to extract domain names from CommonCrawl and sort them like this: 


Only domain names, no subdomains or anything else.
Also, if I extract them from a recent crawl (2023), will I get domain names created in 2023 or just any kind? My ultimate goal would be to extract recently created domain, I don't know if it is doable. 

Thanks very much for your answers,

Have a nice day,

Rémi

 

Sebastian Nagel

unread,
Jun 23, 2023, 3:00:27 AM6/23/23
to common...@googlegroups.com
Hi Rémi,

there are two options:
- extract the domain names from the columnar index [1]
- use the the latest of the webgraphs [2]

Note that the webgraphs also include domains which were not crawled
(excluded by robots.txt, not sampled, etc.) but known from links.
But domains are not verified whether they are registered, only
the format of the domain name in the links is verified.
If the crawler visited a page, than the URL including the domain
name is automatically verified.

Best,
Sebastian

[1]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2] https://commoncrawl.org/tag/webgraph/
Reply all
Reply to author
Forward
0 new messages