Feature suggestion - active domains

133 views
Skip to first unread message

Sam

unread,
Oct 24, 2024, 1:02:01 PM10/24/24
to Common Crawl
Hi
I was revisiting some of my older posts from 2022 on a particular task (obtaining list of active domains) and wanted to ask if there is something available in the meantime. Will explain in a bit more detail.

We are collecting domains from various sources:
- common crawl
- various top 1 million lists like Cloudflare radar - https://radar.cloudflare.com/domains, Tranco list: https://tranco-list.eu/, Cisco, etc.
- other sources

We regularly check the domains from our list if they are active, i.e. resolvable and having some relevant content (no errors, 404, etc.). We do this because we categorized these domains for two of our platforms ( ecommerce ones for www.productcategorization.com ) and general domains for www.websitecategorizationapi.com . At the moment we have 31 million active domains and 5 million ecommerce ones.

It is a bit expensive though to check active status regularly and I wondered if common crawls could derive such a list as a byproduct of its crawling.

E.g. let us say that it crawls domain1.com/subpage_something. If it has content domain1.com can be automatically flagged as active, without checking the root domain itself.

Is this perhaps already available somewhere within commoncrawl or could be done easily as feature. It would save us a great deal of time.

Thanks.

Sebastian Nagel

unread,
Oct 25, 2024, 8:59:04 AM10/25/24
to common...@googlegroups.com
Hi Sam,

thanks for sharing your insights!

> It is a bit expensive though to check active status regularly and I
> wondered if common crawls could derive such a list as a byproduct of
> its crawling.

> E.g. let us say that it crawls domain1.com/subpage_something. If it
> has content domain1.com can be automatically flagged as active,
> without checking the root domain itself.

> Is this perhaps already available somewhere within commoncrawl or
> could be done easily as feature. It would save us a great deal of
> time.

Actually, this is already possible with little effort using the columnar
index [1,2], modifying the example query [3].

- if it's about registered or pay-level domains (on level below the
public suffix), use "url_host_registered_domain"
- for any host name: "url_host_name"
- the subset "warc" includes the successful fetches;
"robotstxt" or "crawldiagnostics" would also include domains which
have been accessed by the crawler, but were disallowed or responded
with a redirect, 404, etc.

Best,
Sebastian

[1]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2] https://github.com/commoncrawl/cc-index-table
[3]
https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/count-domains-of-tld.sql

On 10/24/24 13:27, Sam wrote:
> Hi
> I was revisiting some of my older posts from 2022 on a particular task
> (obtaining list of active domains) and wanted to ask if there is
> something available in the meantime. Will explain in a bit more detail.
>
> We are collecting domains from various sources:
> - common crawl
> - various top 1 million lists like Cloudflare radar - https://
> radar.cloudflare.com/domains <https://radar.cloudflare.com/domains>,
> Tranco list: https://tranco-list.eu/ <https://tranco-list.eu/>, Cisco, etc.
> - Google CrUx report <https://developer.chrome.com/docs/crux>(15 million
> domains): https://developer.chrome.com/docs/crux
> - other sources
>
> We regularly check the domains from our list if they are active, i.e.
> resolvable and having some relevant content (no errors, 404, etc.). We
> do this because we categorized these domains for two of our platforms
> ( ecommerce ones for www.productcategorization.com <https://
> www.productcategorization.com/> ) and general domains for
> www.websitecategorizationapi.com <https://
> www.websitecategorizationapi.com/> . At the moment we have 31 million

Sam

unread,
Oct 28, 2024, 11:11:32 AM10/28/24
to Common Crawl
Hi,
thanks for your help, will check it out. I do have some experience with columnarindex + warc, as I set up a script in 2022 to find all URLs across dataset that mentioned somewhere "text1".
It worked greatly by querying for "pain points" and based on that getting ideas for new tools.

I wonder if commoncrawl will get increased popularity due to specific reason (may be wrong). LLMs companies use it probably as one of default sources of data, so getting ranked in LLMs (a kind of LLM SEO) would involve checking on what are the popular domains in commoncrawl for specific field. Just a thought.

Thanks again for help.
Reply all
Reply to author
Forward
0 new messages