How often is the Common Crawl rejected from a website?

116 views
Skip to first unread message

Logan Scovil

unread,
Jun 19, 2015, 4:15:05 PM6/19/15
to common...@googlegroups.com

I’m curious as to the number of sites that have disallowed Common Crawl in their robots.txt.  This topic was touched on in these two posts:

https://groups.google.com/forum/#!topic/common-crawl/3KSsO2riUVE

https://groups.google.com/forum/#!topic/common-crawl/HypfDOpdH5A

but I still don’t know if there’s any way to gather some data/stats on which sites that rejected the crawler.  (Specifically, I’m trying to find a rough percentage of sites that allowed the Google, Bing, etc. crawlers through but not Common Crawl’s.)  Any help would be much appreciated.  Thanks!

Tom Morris

unread,
Jun 19, 2015, 5:36:52 PM6/19/15
to common...@googlegroups.com
I think it would be useful to include the robots.txt files in the
crawl archives, but that doesn't appear to be the case currently.

There are a smattering of them (69 in the sample of 2.2M hosts & 120M
unique URLs that I looked at in the 2015-18 crawl), but they appear to
only get included by accident (when they're explicitly linked in the
HTML of a page?) rather than as a matter of course.

Tom

Logan Scovil

unread,
Jun 22, 2015, 9:18:56 PM6/22/15
to common...@googlegroups.com
Thanks for the reply!  But, yeah, I guess I'm out of luck for the moment :( 

Stephen Merity

unread,
Jun 22, 2015, 10:28:51 PM6/22/15
to common...@googlegroups.com
Hi Logan,

During the crawl I keep statistics on the URLs we attempt to grab. Out of ~20 million successfully crawled URLs, we'll have around 5 million that are denied by robots.txt and approximately 150k that we decide to avoid as the robots.txt specifies an extreme delay (dozens of seconds between individual pages for example).

For seeing how crawlers are permitted, as Tom says analyzing robots.txt is your best bet. I don't have an answer on hand but I do have 726,880 robots.txt files from the top million domains as provided by Alexa's top-1m.csv. I was planning on expanding this to a few hundred million robots.txt files and then releasing it but haven't had the time. If you're interested in the latter, I can provide you a Redis database of the 700k robots.txt files which weighs in around 1GB on disk.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Logan Scovil

unread,
Jun 23, 2015, 2:07:38 PM6/23/15
to common...@googlegroups.com
Yes, if you could provide me with those robots.txt files that would be fantastic!  Many thanks for your help.  

Tom Morris

unread,
Jun 23, 2015, 4:25:24 PM6/23/15
to common...@googlegroups.com
It would be great if you would publish your analysis (and any code
used to do the analysis) to save others the trouble of redoing the
work (and to allow them to validate the methodology).

Tom

Stephen Merity

unread,
Jun 25, 2015, 5:13:20 AM6/25/15
to common...@googlegroups.com
Linked is the 826MB Redis database dump (rdb) for 726,880 robots.txt files from the Alexa Top Million.

The format is:
  • The hash "url" maps the domain to the final robots.txt URL
  • The hash "redirects" maps the domain to the number of redirects before finding the final robots.txt URL (personal interest)
  • Standard key value where the key is the domain and the value is the robots.txt contents
I'd be curious to know how useful this is to people or what they'd be interested in using it for as I wasn't necessarily intending to release it as a dataset. The closest I've seen has been What one may find in robots.txt which used Common Crawl as a domain source but the collected data was not released.

Logan Scovil

unread,
Jul 6, 2015, 4:51:19 PM7/6/15
to common...@googlegroups.com

Stephen, thanks again for that.  Interesting article, too.


Something to note, I believe the Alexa crawler missed certain robots.txt files.  For certain domains, when one attempts to navigate to “domain”/robots.txt, they are simply redirected to the domain’s home page.  As far as I can tell, this is caused by one of two things: 1.)  the robots.txt file doesn’t exist  2.) the robots.txt file does exist but must be accessed by specifically navigating to http://”domain”/robots.txt or www.”domain”/robots.txt, etc.  In the robots.txt dump, these kinds of domains were accompanied by their home page’s html in lieu of the robots.txt text.

PS: Tom, I will be sure to do that.
Reply all
Reply to author
Forward
0 new messages