request.replace turns into infinte loop.

234 views
Skip to first unread message

Artur Daschevici

unread,
Nov 7, 2012, 12:57:45 PM11/7/12
to scrapy...@googlegroups.com
Hello. I have searched for this but i haven't found any satisfactory answer:

This is my code. I found some replies that say this should work.
But when i run it as part of my spider i get an infinite loop that apparently get stuck on the line that prints rt.url .
I suspect that there may be a problem with the priority of middlewares but i am not sure.

from auctionzip.settings import USER_AGENT_LIST
import random
from scrapy import log

class RandomUserAgentMiddleware(object):
    
    def process_request(self, request, spider):
        ua  = random.choice(USER_AGENT_LIST)
        tmp = "http://www.freelancer.com"
        rt = request.replace(url = tmp)
        #if ua:
        #    request.headers.setdefault('User-Agent', ua)
        #    request.headers.setdefault('Referrer', 'http://www.auctionzip.com')
        print rt.url
        return rt
        #log.msg('>>>> UA %s'%request.headers)

    def process_response(self, request, response, spider):
        print "in response handler"
        return response

This is the order of my middlewares:

DOWNLOADER_MIDDLEWARES = {

    'auctionzip.middlewares.random_user_agent.RandomUserAgentMiddleware': 100,

    #'wines_crawler.middlewares.tor_anonymizer.TorMiddleWare' : 401,
    
    #'wines_crawler.middlewares.random_proxy.ProxyMiddleware' : 401,


    #'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,

    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 402,

    'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware' : 600,

    #'wines_crawler.middlewares.retry_change.RetryTORChangeProxyMiddleWare' : 600,

    #'wines_crawler.middlewares.retry_change.RetryChangeProxyMiddleware' : 600,
}

If anyone has some idea or can point me in the right direction it would be very much appreciated,

Thank you.

Artur Daschevici

unread,
Nov 7, 2012, 4:04:56 PM11/7/12
to scrapy...@googlegroups.com
After further inspection i realized that the request never returns from process_request(). Still need help with this as it's still not working.
From what i've read it should but it doesn't seem to.

Pablo Hoffman

unread,
Nov 7, 2012, 4:36:13 PM11/7/12
to scrapy...@googlegroups.com
You can't "replace" the request object in a downloader middleware (it's too late). Replace would only work in a spider middleware.

If you want to implement a random user agent middleware, just modify the header directly, like builtin UserAgentMiddleware does.



--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/nn_5vlVm6DYJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Artur Daschevici

unread,
Nov 7, 2012, 6:42:30 PM11/7/12
to scrapy...@googlegroups.com
The code was just for testing purpouses the randomUA works fine.

What i am trying to do is load a local file as the response in case the file has already been downloaded and only do that when i pass an argument to the spider.
So i am trying to download a page like so

class AuctionzipSpider(BaseSpider):

    name = 'auctionzip'
    #start_urls = ["http://www.google.com"]
    #number_of_requests = 0
    def __init__(self, name = None, **kwargs):
        BaseSpider.__init__(self, name)
        self.usrs = {}
        self.params = kwargs
        self.search_radius = '30'
        self.zipcode = '0146'
        if kwargs.has_key('debug'):
            self.debug = self.params['debug']
        else:
            self.debug = False
        #print self.st
        if self.debug:
            print self.debug
        if kwargs.has_key('search_radius'):
            self.search_radius = self.params['search_radius']
        if kwargs.has_key('zipcode'):
            self.search_radius = self.params['zipcode']

        #self.number = 0
        #self.number_of_requests = 0
        #self.banned = False


    def start_requests(self):
        #base_url = "http://www.whatismyip.com/"
        base_url = URLS.SEARCH_URL % (self.search_radius, self.zipcode)
        start_url = base_url
        print start_url
        yield self.make_requests_from_url(start_url)

and just as a test instead of going to the url defined in start_urls i want it to download the google home page for example.
--
Regards,
Arthur

Artur Daschevici

unread,
Nov 9, 2012, 11:31:11 PM11/9/12
to scrapy...@googlegroups.com
Thanks. Managed to get that working with the tips.
what i did was:

class ChangeUrl(object):

    def process_start_requests(self, start_requests, spider):
        print "inside spider middleware"

        for r in start_requests:
            print "inside spider middleware for loop"

            if isinstance(r, Request):
                print "inside spider middleware isinstance condition"
                curl = "http://www.google.com"
                r = r.replace(url=curl)
            yield r

This can provide some fine grained control over the caching of various scrapes.
--
Regards,
Arthur

Reply all
Reply to author
Forward
0 new messages