Retrieving all URLs and endpoints associated with a specific domain.

50 views
Skip to first unread message

Vansh Devgan

unread,
Jun 27, 2025, 10:59:15 AMJun 27
to Common Crawl

Hello Team,

I wanted to inquire whether it is feasible to programmatically extract all endpoints, URLs, and subdomains associated with a specific root domain. Additionally, is it possible to directly enumerate all endpoints or URLs that contain a particular subdomain, such as api.example.com? If so, could you please advise on the most efficient and optimized approaches or tools to achieve this?

Thank you for your guidance.

Best regards,
Vansh Devgan

Sebastian Nagel

unread,
Jun 27, 2025, 11:22:40 AMJun 27
to common...@googlegroups.com
Hi,

you should have a look at the columnar URL index:

https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format

It's easy to filter by host name (subdomain) and group by pay-level
domain ("registered domain").

Let us know if you need more advice!
Thanks for your question!

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages