I’m curious as to the number of sites that have disallowed Common Crawl in their robots.txt. This topic was touched on in these two posts:
https://groups.google.com/forum/#!topic/common-crawl/3KSsO2riUVE
https://groups.google.com/forum/#!topic/common-crawl/HypfDOpdH5A
but I still don’t know if there’s any way to gather some data/stats on which sites that rejected the crawler. (Specifically, I’m trying to find a rough percentage of sites that allowed the Google, Bing, etc. crawlers through but not Common Crawl’s.) Any help would be much appreciated. Thanks!--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
Stephen, thanks again for that. Interesting article, too.