robot.txt files for denied sites

50 views
Skip to first unread message

Michael Pastore

unread,
Feb 7, 2015, 8:09:19 AM2/7/15
to common...@googlegroups.com
When the crawler(s) attempts to crawl sites when those sites have a robot.txt file blocking the CCBot or crawling in general, or in some way limiting what and how much of the site is crawled, are those robot.txt files stored?  Basically, is there some indicator of sites not crawled?  The reason I am asking is follows.  I am interesting in studying certain industries and part of this entails understanding their web presence.  How would I know if the data for such an industry is present in the body of data?  Say I was interested in soft-drinks manufacturers and a big player in this industry is the company behind the soft-drink "slurm".  Is there some indicator if the parent site (e.g slurm-soda.com) blocked any crawls or limited them?  Then this could be accounted for in an analysis if the data.

Thanks,

Mike

Stephen Merity

unread,
Feb 9, 2015, 9:43:45 PM2/9/15
to common...@googlegroups.com
Hi Michael,

I do believe Nutch keeps all the robots.txt files it comes across, though that would be in the raw (and not publicly distributed) Nutch crawler output that we have before we process it to become the publicly available WARC/WAT/WET files. I also don't remember off the top of my head how simple it is to pull that information out - at the very least it would require processing the raw datasets again, which is a fairly large task.

I do feel that a dataset of robots.txt files could be valuable resource however and for that very reason (and for my own enjoyment) I've played around with writing a robots.txt crawler in Go. Grabbing robots.txt files is a fun and relatively simple problem as you don't need to worry too much about rate limiting, as most web domains are entirely separate!

For your purpose, I'm curious as to how indicative robots.txt would be though. To have an idea of which companies from a certain industry don't have a web presence (and which robots.txt files you'd be interested in looking at), you'd likely already have a list of them, as "discovering" their absence from the Common Crawl data would be quite a hard problem. If you do have a small list that you can enumerate easily, getting the robots.txt files from their domains and parsing them with one of the many available robots parsing tools would lead to the results you're interested in without having to wade through hundreds of millions of domains worth of robots.txt data.

P.S. I loved the reference to Slurm Soda! All glory to the hypnotoad, of course.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl
Reply all
Reply to author
Forward
0 new messages