Follow redirect and get target URL

911 views
Skip to first unread message

Manuel 'naeg' Rotter

unread,
Jul 19, 2011, 11:50:57 AM7/19/11
to scrapy...@googlegroups.com
Hello

I'm trying to follow a link which is a 301 redirect, just to find out where it redirects me. The problem is, the approach has to work in many different cases.
It's not like the target url is always in a meta tag, or that there is just one redirect, sometimes there are two redirects. So the easiest way is just to follow the redirect until there is no more redirect.

After following the redirect, I'd like to go back to the original site where the redirect was scraped, by saving the link in the meta dict before.

So how I can define a function that follows a link til the end like a browser does?
Also, is it possible that I have to do something after following a redirect? The crawler simply seems to stop (allowed_domains?).


Greetings naeg


Pablo Hoffman

unread,
Jul 20, 2011, 10:39:32 AM7/20/11
to scrapy...@googlegroups.com
Redirects should be followed ignoring the allowed_domains attribute - they're
resolved at the downloader level, they don't get back to the spider.

After you receive the final response, in the spider, you can check:

response.meta['redirect_urls']

To find out what urls the redirect has passed thru.

You may want to increase REDIRECT_MAX_TIMES setting if the default (20) is not
enough, but it should be enough in most cases.

Pablo.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/QYvF8zPSHi0J.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>

Reply all
Reply to author
Forward
0 new messages