How to get the redirect destination?

1,106 views
Skip to first unread message

tuisu

unread,
Jul 28, 2010, 4:54:19 AM7/28/10
to scrapy-users
Hey,

I'm wondering how should I go about getting redirect destinations in
my spider. The idea is that I'm crawling multiple sites that have
overlapping items (links to sites) and the only way I can check if the
items are the same is to compare the urls (e.g. site1 has item with
url http://example.com and site2 has item with url http://example.com).

For now I just wrote a redirecthandler for urllib2 in my pipleline to
get the destination url and that works fine as I only issue a HEAD
request to parse the headers. But I would like it to be done in a
scrapy-like way so it would handle settings like DOWNLOAD_DELAY etc.

I was thinking about creating a custom request to send a head request
and then somehow return the header location bit, but that might not
work as I'd like because when I yield RedirectRequest in parse() then
I can't return the items any more.

Another way would be to write a custom redirect handler as described
in:
http://groups.google.com/group/scrapy-users/browse_thread/thread/60577c2b3aec6a17/10c4d069d3cebca4?lnk=gst&q=redirect#

But my question would be how to get the response (redirect
destination) back in the parse() method so I could pass all the parsed
items to my pipeline? Can anybody point me in the right direction on
how I would go about doing this?

class SiteSpider(MyCustomSpider):

name = 'site.com'
allowed_domains = ['site.com']
start_urls = ['http://www.site.com']

def parse(self, response):
items = []

hxs = HtmlXPathSelector(responsse)
crawled = hxs.select("...")

for crawled_item in crawled:
item = CrawledItem()

item['name'] = crawled_item.select("...").extract()
item['url'] = crawled_item.select("...").extract()

""" If the url is a redirect then get its destination """
if self.is_redirect(item['url']):
item['url'] = REDIRECT_DESTINATION

items.append(item)

# Results
return items

Thanks

Pablo Hoffman

unread,
Aug 27, 2010, 6:07:36 PM8/27/10
to scrapy...@googlegroups.com
On Wed, Jul 28, 2010 at 01:54:19AM -0700, tuisu wrote:
> But my question would be how to get the response (redirect
> destination) back in the parse() method so I could pass all the parsed
> items to my pipeline? Can anybody point me in the right direction on
> how I would go about doing this?

I think disabling redirect middleware and defining `handle_httpstatus_list` in
your spider could be enough for your needs:

Put in your project settings:

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
}

And in your spider:

class SiteSpider(MyCustomSpider):

handle_httpstatus_list = [301, 302, 303]

[ ... rest of spider code ... ]


See HttpErrorMiddleware for more info about the `handle_httpstatus_list`
attribute:
http://doc.scrapy.org/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

Pablo.

Pablo Hoffman

unread,
Aug 27, 2010, 6:08:56 PM8/27/10
to scrapy...@googlegroups.com
Forgot to mention: you would be extracting the redirect destination in your
spider from response.headers['Location']
Reply all
Reply to author
Forward
0 new messages