I'm looking for the worldwide list of domains (and subdomains), without internal pages (only the homepage).
Is there a way to get it?
If so, is there a way to get them with php?
Thank you
Amol Khade
未讀,
2017年5月12日 凌晨1:01:302017/5/12
回覆作者
登入以回覆作者
轉寄
登入以轉寄訊息
刪除
你的權限不足,無法在這個群組刪除訊息
複製連結
檢舉訊息
登入以檢舉訊息
顯示原始貼文
該群組的電子郵件地址為匿名,或你需要檢視成員電子郵件地址的權限才能查看原始貼文
收件者:Common Crawl
Hello Luca, I have seen your query and I am having multiple solutions too. As I had already done so many operation on Common Crawl, I can get almost anything required from Common Crawl Dataset, Like URLs, Phone Numbers, Possible Addresses, Email Addresses, Cities & Countries, IP Address, Servers,Title, Keywords, Website technologies and many more.
I would love to discuss with you in details. Looking forward for your response.. :)
all domains content is crawled are already published every month as part of the statistics and
counts. See this thread how to access the lists:
https://groups.google.com/d/topic/common-crawl/vsD4vBpDdG0/discussion Of course, hosts which are not crawled are not contained in this list - for example because they
disallow the access in their robots.txt.
We are currently preparing a host-level webgraph of the last 3 month. It will also contain hosts
which are known only by links. But host names are only weakly verified (looks valid), but not
whether the DNS look-up resolves, etc. The release is planned for next week and will be announced on
this list.
> without internal pages (only the homepage).
The "homepage" is in most cases accessible by
http://hostname/ If you really need the correct form, eg.
https://hostname/index.jsp the easiest way is to mine these from the URL index files.