Hi Sam,
thanks for sharing your insights!
> It is a bit expensive though to check active status regularly and I
> wondered if common crawls could derive such a list as a byproduct of
> its crawling.
> E.g. let us say that it crawls
domain1.com/subpage_something. If it
> has content
domain1.com can be automatically flagged as active,
> without checking the root domain itself.
> Is this perhaps already available somewhere within commoncrawl or
> could be done easily as feature. It would save us a great deal of
> time.
Actually, this is already possible with little effort using the columnar
index [1,2], modifying the example query [3].
- if it's about registered or pay-level domains (on level below the
public suffix), use "url_host_registered_domain"
- for any host name: "url_host_name"
- the subset "warc" includes the successful fetches;
"robotstxt" or "crawldiagnostics" would also include domains which
have been accessed by the crawler, but were disallowed or responded
with a redirect, 404, etc.
Best,
Sebastian
[1]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2]
https://github.com/commoncrawl/cc-index-table
[3]
https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/count-domains-of-tld.sql
On 10/24/24 13:27, Sam wrote:
> Hi
> I was revisiting some of my older posts from 2022 on a particular task
> (obtaining list of active domains) and wanted to ask if there is
> something available in the meantime. Will explain in a bit more detail.
>
> We are collecting domains from various sources:
> - common crawl
> - various top 1 million lists like Cloudflare radar - https://
>
radar.cloudflare.com/domains <
https://radar.cloudflare.com/domains>,
> Tranco list:
https://tranco-list.eu/ <
https://tranco-list.eu/>, Cisco, etc.
> - Google CrUx report <
https://developer.chrome.com/docs/crux>(15 million
> domains):
https://developer.chrome.com/docs/crux
> - other sources
>
> We regularly check the domains from our list if they are active, i.e.
> resolvable and having some relevant content (no errors, 404, etc.). We
> do this because we categorized these domains for two of our platforms
> ( ecommerce ones for
www.productcategorization.com <https://
>
www.productcategorization.com/> ) and general domains for
>
www.websitecategorizationapi.com <https://
>
www.websitecategorizationapi.com/> . At the moment we have 31 million