request.replace turns into infinte loop.

Artur Daschevici

unread,

Nov 7, 2012, 12:57:45 PM11/7/12

to scrapy...@googlegroups.com

Hello. I have searched for this but i haven't found any satisfactory answer:

This is my code. I found some replies that say this should work.

But when i run it as part of my spider i get an infinite loop that apparently get stuck on the line that prints rt.url .

I suspect that there may be a problem with the priority of middlewares but i am not sure.

from auctionzip.settings import USER_AGENT_LIST

import random

from scrapy import log

class RandomUserAgentMiddleware(object):

def process_request(self, request, spider):

ua = random.choice(USER_AGENT_LIST)

tmp = "http://www.freelancer.com"

rt = request.replace(url = tmp)

#if ua:

# request.headers.setdefault('User-Agent', ua)

# request.headers.setdefault('Referrer', 'http://www.auctionzip.com')

print rt.url

return rt

#log.msg('>>>> UA %s'%request.headers)

def process_response(self, request, response, spider):

print "in response handler"

return response

This is the order of my middlewares:

DOWNLOADER_MIDDLEWARES = {

'auctionzip.middlewares.random_user_agent.RandomUserAgentMiddleware': 100,

#'wines_crawler.middlewares.tor_anonymizer.TorMiddleWare' : 401,

#'wines_crawler.middlewares.random_proxy.ProxyMiddleware' : 401,

#'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,

'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 402,

'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware' : 600,

#'wines_crawler.middlewares.retry_change.RetryTORChangeProxyMiddleWare' : 600,

#'wines_crawler.middlewares.retry_change.RetryChangeProxyMiddleware' : 600,

}

If anyone has some idea or can point me in the right direction it would be very much appreciated,

Thank you.

Artur Daschevici

unread,

Nov 7, 2012, 4:04:56 PM11/7/12

to scrapy...@googlegroups.com

After further inspection i realized that the request never returns from process_request(). Still need help with this as it's still not working.
From what i've read it should but it doesn't seem to.

Pablo Hoffman

unread,

Nov 7, 2012, 4:36:13 PM11/7/12

to scrapy...@googlegroups.com

You can't "replace" the request object in a downloader middleware (it's too late). Replace would only work in a spider middleware.

If you want to implement a random user agent middleware, just modify the header directly, like builtin UserAgentMiddleware does.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/nn_5vlVm6DYJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Artur Daschevici

unread,

Nov 7, 2012, 6:42:30 PM11/7/12

to scrapy...@googlegroups.com

The code was just for testing purpouses the randomUA works fine.

What i am trying to do is load a local file as the response in case the file has already been downloaded and only do that when i pass an argument to the spider.

So i am trying to download a page like so

class AuctionzipSpider(BaseSpider):

name = 'auctionzip'

#start_urls = ["http://www.google.com"]

#number_of_requests = 0

def __init__(self, name = None, **kwargs):

BaseSpider.__init__(self, name)

self.usrs = {}

self.params = kwargs

self.search_radius = '30'

self.zipcode = '0146'

if kwargs.has_key('debug'):

self.debug = self.params['debug']

else:

self.debug = False

#print self.st

if self.debug:

print self.debug

if kwargs.has_key('search_radius'):

self.search_radius = self.params['search_radius']

if kwargs.has_key('zipcode'):

self.search_radius = self.params['zipcode']

#self.number = 0

#self.number_of_requests = 0

#self.banned = False

def start_requests(self):

#base_url = "http://www.whatismyip.com/"

base_url = URLS.SEARCH_URL % (self.search_radius, self.zipcode)

start_url = base_url

print start_url

yield self.make_requests_from_url(start_url)

and just as a test instead of going to the url defined in start_urls i want it to download the google home page for example.

--
Regards,

Arthur

Artur Daschevici

unread,

Nov 9, 2012, 11:31:11 PM11/9/12

to scrapy...@googlegroups.com

Thanks. Managed to get that working with the tips.

what i did was:

class ChangeUrl(object):

def process_start_requests(self, start_requests, spider):

print "inside spider middleware"

for r in start_requests:

print "inside spider middleware for loop"

if isinstance(r, Request):

print "inside spider middleware isinstance condition"

curl = "http://www.google.com"

r = r.replace(url=curl)

yield r

This can provide some fine grained control over the caching of various scrapes.

--
Regards,

Arthur

Reply all

Reply to author

Forward