How to get the full list of domains

瀏覽次數:130 次
跳到第一則未讀訊息

Luca

未讀,
2017年5月11日 下午4:39:002017/5/11
收件者:Common Crawl

Hi all,
I'm looking for the worldwide list of domains (and subdomains), without internal pages (only the homepage).
Is there a way to get it?
If so, is there a way to get them with php?

Thank you

Amol Khade

未讀,
2017年5月12日 凌晨1:01:302017/5/12
收件者:Common Crawl
Hello Luca,
I have seen your query and I am having multiple solutions too. As I had already done so many operation on Common Crawl, I can get almost anything required from Common Crawl Dataset, Like URLs, Phone Numbers, Possible Addresses, Email Addresses, Cities & Countries, IP Address, Servers,Title, Keywords, Website technologies and many more.

I would love to discuss with you in details.  Looking forward for  your response..  :)

My Skype ID : amol.l.khade
Email ID : amol.l...@gmail.com

Best Regards
Amol Khade
Linkedin : https://in.linkedin.com/in/amollkhade

Sebastian Nagel

未讀,
2017年5月12日 凌晨4:41:002017/5/12
收件者:common...@googlegroups.com
Hi Luca,

all domains content is crawled are already published every month as part of the statistics and
counts. See this thread how to access the lists:
https://groups.google.com/d/topic/common-crawl/vsD4vBpDdG0/discussion
Of course, hosts which are not crawled are not contained in this list - for example because they
disallow the access in their robots.txt.

We are currently preparing a host-level webgraph of the last 3 month. It will also contain hosts
which are known only by links. But host names are only weakly verified (looks valid), but not
whether the DNS look-up resolves, etc. The release is planned for next week and will be announced on
this list.

> without internal pages (only the homepage).

The "homepage" is in most cases accessible by
http://hostname/
If you really need the correct form, eg.
https://hostname/index.jsp
the easiest way is to mine these from the URL index files.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

回覆所有人
回覆作者
轉寄
0 則新訊息