Questions

109 views
Skip to first unread message

Martin Streicher

unread,
Feb 13, 2013, 9:52:22 PM2/13/13
to sp...@googlegroups.com
I have a few questions:

  1. Do I run the spider and let it collect up data about all followed URLs and then use some of the method to cull, say, all 301s, from the results? 
  2. Or, can I get a callback for each URL and its results?
  3. Can I tell the spider to ignore or not follow links off the site or host?
  4. Is the data from a run of the spider accessible in any other way?
  5. Is the data maintained so I can compare one run to the next?
  6. Are there examples or more extensive documentation on how to to achieve any of the aforementioned steps?

postmodern

unread,
Feb 13, 2013, 10:25:30 PM2/13/13
to sp...@googlegroups.com
1 and 2: you can use the `every_page` callback [1] for every response or you could use `every_ok_page` [2] to only get the 200 OK responses.
3. You can use the Spidr.host [3] or Spidr.site [4] methods to restrict Spidr to a given domain.
4. Spidr::Agent#history [5] and Spidr::Agent#failures [6] stores every URL the agent has visited. Response/page data is garbage collected otherwise.
5. Depends on what you want to compare?
6. Examples and API documentation are all there is currently. http://rubydoc.info/gems/spidr/frames

[1]: http://rubydoc.info/gems/spidr/Spidr/Events#every_page-instance_method
[2]: http://rubydoc.info/gems/spidr/Spidr/Events#every_ok_page-instance_method
[3]: http://rubydoc.info/gems/spidr/Spidr#host-class_method
[4]: http://rubydoc.info/gems/spidr/Spidr#site-class_method
[5]: http://rubydoc.info/gems/spidr/Spidr/Agent#history-instance_method
[6]: http://rubydoc.info/gems/spidr/Spidr/Agent#failures-instance_method
--
 
---
You received this message because you are subscribed to the Google Groups "Spidr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spidr+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


-- 
Blog: http://postmodern.github.com/
GitHub: https://github.com/postmodern
Twitter: @postmodern_mod3
PGP: 0xB9515E77
signature.asc
Reply all
Reply to author
Forward
0 new messages