Missing URLs from CommonCrawl Index

Hassan Amir

не прочитано,

19 мая 2015 г., 10:15:3519.05.2015

– common...@googlegroups.com

Hello CommonCrawl,

I noticed a lot of missing urls from CommonCrawl index, comapred to Alexa top million here i found many domains not available in the index

Does that means its crawled but not in index ? or its not crawled at all ?

Thanks in advance

Tom Morris

не прочитано,

19 мая 2015 г., 15:43:1919.05.2015

– common...@googlegroups.com

On Tue, May 19, 2015 at 10:15 AM, Hassan Amir <hsn.e...@gmail.com> wrote:

I noticed a lot of missing urls from CommonCrawl index, comapred to Alexa top million here i found many domains not available in the index

Does that means its crawled but not in index ? or its not crawled at all ?

I haven't heard any reports that the current index is missing pages as was sometimes the case with the old index.

The URL list comes from Blekko, not Alexa, and I don't think they've disclosed how it's generated, so it's not too surprising that it doesn't match up.

If you want the Alexa sites and you're happy with just their home pages, the HTTP Archive crawls the top 500K sites.

http://httparchive.org/urls.php

If you can get the results that you want from just the metadata, it's all loaded up in Big Query for easy access.

http://bigqueri.es/c/http-archive

Tom

Stephen Merity

не прочитано,

19 мая 2015 г., 15:49:5319.05.2015

– common...@googlegroups.com

Hi Hassan,

The most likely reason that some of the domains you are looking for are not in the Common Crawl archive is that they've been asked not to be crawled using the robots.txt directive. LinkedIn is a good example of that as they only whitelist very specific crawlers, disallowing all other crawlers from accessing their data.

To ensure we're good net citizens, we also obey robots.txt when it is specified.

Another possible reason is that the Alexa top million list is somewhat old now and some of the web properties might have disappeared.

On a tangent, I've found it odd that the Alexa top million domains has been so popular for so long. I went to investigate it myself some time ago and discovered many of the domains are actually URLs. For example:
999995,jocolibrary.bibliocommons.com/user/login

999879,youtube.com/user/RhettandLink
999854,youtube.com/user/AContrariProject
999780,learn.greycampus.com/user/login

999768,youtube.com/user/OfficialTrapCity

999747,reddit.com/user/rauelius

18573,grokbase.com

980073,grokbase.com/user/%D8%A7%D9%84%D8%A8%D8%AD%D8%A7%D8%B1-%D8%B3%D8%B9%D8%AF

etc

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--

Regards,

Stephen Merity

Data Scientist @ Common Crawl

Tom Morris

не прочитано,

19 мая 2015 г., 17:52:0419.05.2015

– common...@googlegroups.com

On Tue, May 19, 2015 at 3:49 PM, Stephen Merity <ste...@commoncrawl.org> wrote:

The most likely reason that some of the domains you are looking for are not in the Common Crawl archive is that they've been asked not to be crawled using the robots.txt directive. LinkedIn is a good example of that as they only whitelist very specific crawlers, disallowing all other crawlers from accessing their data.

Does the crawl archive the robots.txt files in these cases?

Another possible reason is that the Alexa top million list is somewhat old now and some of the web properties might have disappeared.

This help page at Alexa says that the list is updated daily based on a 1-month average ranking. Is it out of date? How long ago did the list stop getting updated?

The HTTP Archive publishes the list of URLs that they used for the latest crawl:

curl http://httparchive.org/downloads/httparchive_urls.gz | zgrep -o -e "[^']*://[^']*" | gzip > httparchive_urls.txt.gz

Perhaps some portion of that list would be useful to include in the Common Crawl crawls - for some degree of comparability, if no other reason.

On a tangent, I've found it odd that the Alexa top million domains has been so popular for so long. I went to investigate it myself some time ago and discovered many of the domains are actually URLs. For example:
999995,jocolibrary.bibliocommons.com/user/login
999879,youtube.com/user/RhettandLink
999854,youtube.com/user/AContrariProject
999780,learn.greycampus.com/user/login
999768,youtube.com/user/OfficialTrapCity
999747,reddit.com/user/rauelius
18573,grokbase.com
980073,grokbase.com/user/%D8%A7%D9%84%D8%A8%D8%AD%D8%A7%D8%B1-%D8%B3%D8%B9%D8%AF

I actually looked at that a couple of weeks ago and found it strange as well, but it's less than 8,000 URLs with many of them being on just a few domains:

youtube.com 1081

my.tv.sohu.com 412

reddit.com 220

sites.google.com 136

nicovideo.jp 128

themeforest.net 119

xhamster.com 113

blogs.technet.com 97

dailymotion.com 92

empowernetwork.com 88

Tom

Hassan Amir

не прочитано,

19 мая 2015 г., 18:31:0319.05.2015

– common...@googlegroups.com

Thanks folks, i really appreciate it.

After more digging comparing CC Index with httparchive and Alexa top million found that some URLs not appear in CC Index (some have robots files and domain age over a year)

Example:

keywordrevealer.com

imalimedia.net

calendarwiz.com
...
..

Regards

Tom Morris

не прочитано,

20 мая 2015 г., 18:10:1920.05.2015

– common...@googlegroups.com

On Tue, May 19, 2015 at 6:31 PM, Hassan Amir <hsn.e...@gmail.com> wrote:

After more digging comparing CC Index with httparchive and Alexa top million found that some URLs not appear in CC Index (some have robots files and domain age over a year)

Example:
keywordrevealer.com
imalimedia.net
calendarwiz.com

Calendarwiz.com is in the latest crawl:

http://index.commoncrawl.org/CC-MAIN-2015-14-index?url=http%3A%2F%2Fwww.calendarwiz.com%2F*&output=json#

The other two aren't, but they're also pretty far down in the Alexa rankings (rank of ~40K)

The crawl, by it's very nature, is a sampling of the web, so it's never going to be complete. I don't think the Common Crawl folks have said anything about how it aligns with any other sample. The Internet Archive may be better aligned with Alexa ranking, so you could try checking there. e.g. https://web.archive.org/web/*/http://www.keywordrevealer.com/*

One weird thing is that imalimedia has no robots.txt. I wonder if the crawler is confused by that (although it shouldn't be).

Tom

Hassan Amir

не прочитано,

21 мая 2015 г., 08:23:0521.05.2015

– common...@googlegroups.com

I'm not sure how the indexing or the crawler work, however i found that portent.com mentioned keyword revealer here

so i inserted portent.com in CC Index and looked at the results here, and it appear that the above link (http://www.portent.com/blog/ppc/8-free-keyword-research-tools-ppc-advertising.htm) is not found in the CC index, maybe that why keywordrevealer it self is not in the index.

same applies for these links

http://www.webdeveloper.com/forum/showthread.php?300323-What-is-Suitable-Keyword-search-tool&s=81e9097953509bb9b229fd2e7b548174

http://wizzley.com/keyword-research-tool/

I hope that might help improving the engine.

Regards

Tom Morris

не прочитано,

22 мая 2015 г., 13:21:5322.05.2015

– common...@googlegroups.com

On Thu, May 21, 2015 at 8:23 AM, Hassan Amir <hsn.e...@gmail.com> wrote:

I'm not sure how the indexing or the crawler work, however i found that portent.com mentioned keyword revealer here

so i inserted portent.com in CC Index and looked at the results here, and it appear that the above link (http://www.portent.com/blog/ppc/8-free-keyword-research-tools-ppc-advertising.htm) is not found in the CC index, maybe that why keywordrevealer it self is not in the index.

Whether the crawler is working off a fixed list of URLs or starting with a seed list and maintaining a frontier of new URLs encountered, it's eventually going to run out of time, bandwidth, money, whatever and have to stop. Not all URLs encountered will be crawled.

Tom

Greg Lindahl

не прочитано,

22 мая 2015 г., 13:42:1622.05.2015

– common...@googlegroups.com

On Tue, May 19, 2015 at 03:43:18PM -0400, Tom Morris wrote:

> The URL list comes from Blekko, not Alexa, and I don't think they've
> disclosed how it's generated, so it's not too surprising that it doesn't
> match up.

There's not much to disclose -- Blekko, as a search engine, has quite
different opinions about websites and pages than Alexa's
toolbar-generated stats. Alexa users visit lots of websites that
blekko thinks are "bad". SEO that fools Google but not blekko results
in a lot of sites being in Alexa's top million, but not Blekko's crawl
frontier. On the flip side, there are probably plenty of sites whose
SEO fooled Blekko and not Google.

-- greg

Hassan Amir

не прочитано,

22 мая 2015 г., 14:28:2322.05.2015

– common...@googlegroups.com

Being Said,

How come we still find un-index fresh urls (posted 2015) within same indexed domain in CC INDEX ?

Ответить всем

Отправить сообщение автору

Переслать