redirecting issue with scrapy

25 views
Skip to first unread message

Gaurang shah

unread,
Mar 5, 2015, 4:41:43 AM3/5/15
to scrapy...@googlegroups.com
Hi Guys, 

I am trying scrapy a website, however the problem is whenever I try to visit the page from which I have to scrap data it redirects to some other page. if I visit that page manually in the the browser it's not being redirected anyway, I checked the response code as well, it shows 200. 

However with scrapy it's being redirected and I am able to see the code 302. 

Following is the website I am trying to scrap. 

In the scrapy logs I am able to see following entries.
2015-03-05 15:08:36+0530 [lonamrk] DEBUG: Redirecting (302) to <GET http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/membership/directory/partners>
2015-03-05 15:08:37+0530 [lonamrk] DEBUG: Redirecting (302) to <GET http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/sitemap>
2015-03-05 15:08:37+0530 [lonamrk] DEBUG: Redirecting (302) to <GET http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/sitemap>
2015-03-05 15:08:41+0530 [lonamrk] DEBUG: Redirecting (302) to <GET http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/sitemap>

Following the code. 
class Spider(BaseSpider):
    name = "lonamrk"
    allowed_domains = ["lonmark.org"]
    # Request.meta = {'dont_redirect': True,
    #                 'handle_httpstatus_list': [302]}


    def parse(self, response):
        print response.url
        hxs = HtmlXPathSelector(response)
        company_links = hxs.select("//*[@id='page_content']/table/tbody/tr[1]/td[1]/a/@href")
        for link in company_links:
            yield Request("http://www.lonmark.org/membership/directory/"+link._root, callback=self.parse_company_info)



If I uncomment the code, and stop redirection. Then I am not getting anything in the response body. 

would someone please help me what to do ???

Travis Leleu

unread,
Mar 5, 2015, 11:03:31 AM3/5/15
to scrapy-users

Sounds like the site is detecting you're scraping and trying to prevent it. Id suggest looking into user agent middlewares to mimic a browser UA string

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Gaurang shah

unread,
Mar 9, 2015, 1:06:56 PM3/9/15
to scrapy...@googlegroups.com
Hi Travis, 

Thanks for the advise, It worked. Now I am able to scrap the page. 

I have put question on this forums earlier as well, however haven't got any helpful replies, I was thinking this forum is inactive and while posting this question I wasn't having any hope that I will get answers for this, however thanks to you, my problem resolved. 

Thanks a lot.

Gaurang Shah
Blog: qtp-help.blogspot.com
Mobile: +91 738756556

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/Jx-zq7QNw5A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages