I noticed a lot of missing urls from CommonCrawl index, comapred to Alexa top million here i found many domains not available in the indexDoes that means its crawled but not in index ? or its not crawled at all ?
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
The most likely reason that some of the domains you are looking for are not in the Common Crawl archive is that they've been asked not to be crawled using the robots.txt directive. LinkedIn is a good example of that as they only whitelist very specific crawlers, disallowing all other crawlers from accessing their data.
Another possible reason is that the Alexa top million list is somewhat old now and some of the web properties might have disappeared.
On a tangent, I've found it odd that the Alexa top million domains has been so popular for so long. I went to investigate it myself some time ago and discovered many of the domains are actually URLs. For example:
999995,jocolibrary.bibliocommons.com/user/login
imalimedia.net
|
After more digging comparing CC Index with httparchive and Alexa top million found that some URLs not appear in CC Index (some have robots files and domain age over a year)Example:
I'm not sure how the indexing or the crawler work, however i found that portent.com mentioned keyword revealer hereso i inserted portent.com in CC Index and looked at the results here, and it appear that the above link (http://www.portent.com/blog/ppc/8-free-keyword-research-tools-ppc-advertising.htm) is not found in the CC index, maybe that why keywordrevealer it self is not in the index.