Extracting subdomains from all URLs

213 views
Skip to first unread message

Gianni Perez

unread,
Jul 26, 2021, 5:45:36 PM7/26/21
to Common Crawl

Hi all,

Is there an intuitive and resource-friendly way to extract all subdomains from all crawled URLs so that I can end up with a list similar to this:


Rgds,
G

Sebastian Nagel

unread,
Jul 27, 2021, 5:42:37 AM7/27/21
to common...@googlegroups.com
Hi Gianni,

using the columnar index [1] the list of subdomains is just one query away:

SELECT COUNT(*) AS pages,
COUNT(DISTINCT url_host_registered_domain) AS domains,
substr(url_host_name, 1,
(length(url_host_name) - length(url_host_registered_domain))) AS subdomain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-25'
GROUP BY substr(url_host_name, 1,
(length(url_host_name) - length(url_host_registered_domain)))
ORDER BY pages DESC;

Note:
- that's for the June crawl only: crawl = 'CC-MAIN-2021-25'
- cf. Presto's string functions [2]

Here the top-10 results:

pages domains subdomain
1208961605 19898627 www.
1090634484 26585731
16497523 32654 forum.
10375685 152410 m.
10232964 175212 blog.
9708356 97375 shop.
6730975 5648 forums.
4623635 12810 wiki.
3742254 7528 community.


Alternatively, the host-level web graphs [3] could be used:
+ includes also domain names seen as links but not crawled
-/+ no page counts, page rank and harmonic centrality scores instead
- would need to write a program to extract the subdomain prefixes
from the host name and do the aggregation


Finally:
- maybe ask the authors of [4] how they created their
subdomain lists?
- the approach of [5] is acceptable for a single domain with not
too many pages but will definitely not scale to your use case
"extract all subdomains"


Best,
Sebastian


[1] https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2] https://prestodb.io/docs/current/functions/string.html
[3] https://commoncrawl.org/2021/05/host-and-domain-level-web-graphs-feb-apr-may-2021/
[4] https://github.com/carlbordum/common-crawl-subdomains
[5] https://github.com/lgandx/CCrawlDNS/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/89c8900b-01d9-4dda-b79e-7585e596205bn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/89c8900b-01d9-4dda-b79e-7585e596205bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Gianni Perez

unread,
Jul 29, 2021, 3:27:10 PM7/29/21
to Common Crawl
Thanks!!
Reply all
Reply to author
Forward
0 new messages