You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hi all,
I'm looking for the worldwide list of domains (and subdomains), without internal pages (only the homepage).
Is there a way to get it?
If so, is there a way to get them with php?
Thank you
Amol Khade
unread,
May 12, 2017, 1:01:30 AM5/12/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello Luca, I have seen your query and I am having multiple solutions too. As I had already done so many operation on Common Crawl, I can get almost anything required from Common Crawl Dataset, Like URLs, Phone Numbers, Possible Addresses, Email Addresses, Cities & Countries, IP Address, Servers,Title, Keywords, Website technologies and many more.
I would love to discuss with you in details. Looking forward for your response.. :)
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Luca,
all domains content is crawled are already published every month as part of the statistics and
counts. See this thread how to access the lists:
https://groups.google.com/d/topic/common-crawl/vsD4vBpDdG0/discussion Of course, hosts which are not crawled are not contained in this list - for example because they
disallow the access in their robots.txt.
We are currently preparing a host-level webgraph of the last 3 month. It will also contain hosts
which are known only by links. But host names are only weakly verified (looks valid), but not
whether the DNS look-up resolves, etc. The release is planned for next week and will be announced on
this list.
> without internal pages (only the homepage).
The "homepage" is in most cases accessible by
http://hostname/ If you really need the correct form, eg.
https://hostname/index.jsp the easiest way is to mine these from the URL index files.