Hey,
I'm wondering how should I go about getting redirect destinations in
my spider. The idea is that I'm crawling multiple sites that have
overlapping items (links to sites) and the only way I can check if the
items are the same is to compare the urls (e.g. site1 has item with
url 
http://example.com and site2 has item with url 
http://example.com).
For now I just wrote a redirecthandler for urllib2 in my pipleline to
get the destination url and that works fine as I only issue a HEAD
request to parse the headers. But I would like it to be done in a
scrapy-like way so it would handle settings like DOWNLOAD_DELAY etc.
I was thinking about creating a custom request to send a head request
and then somehow return the header location bit, but that might not
work as I'd like because when I yield RedirectRequest in parse() then
I can't return the items any more.
Another way would be to write a custom redirect handler as described
in:
http://groups.google.com/group/scrapy-users/browse_thread/thread/60577c2b3aec6a17/10c4d069d3cebca4?lnk=gst&q=redirect#
But my question would be how to get the response (redirect
destination) back in the parse() method so I could pass all the parsed
items to my pipeline? Can anybody point me in the right direction on
how I would go about doing this?
class SiteSpider(MyCustomSpider):
	name = '
site.com'
	allowed_domains = ['
site.com']
	start_urls = ['
http://www.site.com']
	def parse(self, response):
		items = []
		hxs = HtmlXPathSelector(responsse)
		crawled = hxs.select("...")
		for crawled_item in crawled:
			item = CrawledItem()
			item['name']	= crawled_item.select("...").extract()
			item['url']		= crawled_item.select("...").extract()
			""" If the url is a redirect then get its destination """
			if self.is_redirect(item['url']):
				item['url'] = REDIRECT_DESTINATION
			items.append(item)
		# Results
		return items
Thanks