Hi Gianni,
using the columnar index [1] the list of subdomains is just one query away:
SELECT COUNT(*) AS pages,
COUNT(DISTINCT url_host_registered_domain) AS domains,
substr(url_host_name, 1,
(length(url_host_name) - length(url_host_registered_domain))) AS subdomain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-25'
GROUP BY substr(url_host_name, 1,
(length(url_host_name) - length(url_host_registered_domain)))
ORDER BY pages DESC;
Note:
- that's for the June crawl only: crawl = 'CC-MAIN-2021-25'
- cf. Presto's string functions [2]
Here the top-10 results:
pages domains subdomain
1208961605 19898627 www.
1090634484 26585731
16497523 32654 forum.
10375685 152410 m.
10232964 175212 blog.
9708356 97375 shop.
6730975 5648 forums.
4623635 12810 wiki.
3742254 7528 community.
Alternatively, the host-level web graphs [3] could be used:
+ includes also domain names seen as links but not crawled
-/+ no page counts, page rank and harmonic centrality scores instead
- would need to write a program to extract the subdomain prefixes
from the host name and do the aggregation
Finally:
- maybe ask the authors of [4] how they created their
subdomain lists?
- the approach of [5] is acceptable for a single domain with not
too many pages but will definitely not scale to your use case
"extract all subdomains"
Best,
Sebastian
[1]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2]
https://prestodb.io/docs/current/functions/string.html
[3]
https://commoncrawl.org/2021/05/host-and-domain-level-web-graphs-feb-apr-may-2021/
[4]
https://github.com/carlbordum/common-crawl-subdomains
[5]
https://github.com/lgandx/CCrawlDNS/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/89c8900b-01d9-4dda-b79e-7585e596205bn%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/89c8900b-01d9-4dda-b79e-7585e596205bn%40googlegroups.com?utm_medium=email&utm_source=footer>.