Hey,
I'm wondering how should I go about getting redirect destinations in
my spider. The idea is that I'm crawling multiple sites that have
overlapping items (links to sites) and the only way I can check if the
items are the same is to compare the urls (e.g. site1 has item with
url
http://example.com and site2 has item with url
http://example.com).
For now I just wrote a redirecthandler for urllib2 in my pipleline to
get the destination url and that works fine as I only issue a HEAD
request to parse the headers. But I would like it to be done in a
scrapy-like way so it would handle settings like DOWNLOAD_DELAY etc.
I was thinking about creating a custom request to send a head request
and then somehow return the header location bit, but that might not
work as I'd like because when I yield RedirectRequest in parse() then
I can't return the items any more.
Another way would be to write a custom redirect handler as described
in:
http://groups.google.com/group/scrapy-users/browse_thread/thread/60577c2b3aec6a17/10c4d069d3cebca4?lnk=gst&q=redirect#
But my question would be how to get the response (redirect
destination) back in the parse() method so I could pass all the parsed
items to my pipeline? Can anybody point me in the right direction on
how I would go about doing this?
class SiteSpider(MyCustomSpider):
name = '
site.com'
allowed_domains = ['
site.com']
start_urls = ['
http://www.site.com']
def parse(self, response):
items = []
hxs = HtmlXPathSelector(responsse)
crawled = hxs.select("...")
for crawled_item in crawled:
item = CrawledItem()
item['name'] = crawled_item.select("...").extract()
item['url'] = crawled_item.select("...").extract()
""" If the url is a redirect then get its destination """
if self.is_redirect(item['url']):
item['url'] = REDIRECT_DESTINATION
items.append(item)
# Results
return items
Thanks