Re: Reading start_urls from redis queue

396 views
Skip to first unread message

Pedro

unread,
Apr 19, 2013, 7:20:16 AM4/19/13
to scrapy...@googlegroups.com
It is not possible to know the type of your 'server' attribute from the code you posted, but if it's a Redis object as defined in the redis package present in pypi, blpop takes a string as argument, you're passing a list.


On Thursday, 18 April 2013 20:14:22 UTC+2, Roy Klopper wrote:
Hi There,

This is my first post here in this group, I haven't had the chance to find the right answer to my question yet so here it is. I'm trying to implement a CrawlSpider which crawls indefinately and I would like to fill new domains to crawl dynamically using a Redis list using the blocking lpop method.

Now when I try to implement it as following the spider does not populate it iterates through the list but does not handle the yielded Request objects. I use an empty start_urls list with a class definition of start_requests as following:

    def start_requests(self):
        while True:
            source, domain = self.server.blpop(['domains'])
            if not domain:
                continue
            self.log(domain)
            if domain:
                yield Request("http://%s" % domain)

My Python skills are not that fabulous so I'm wondering if someone could point me in the right direction.

Regards,

Roy

Roy Klopper

unread,
Apr 19, 2013, 8:05:54 AM4/19/13
to scrapy...@googlegroups.com
Hi Pedro,

The Redis connection is working properly, server is indeed an instance of redis.Redis(). The blpop method takes a list of keys which to pop from, the values gathered from the blpop is also correct. My problem is blpop is blocking (just as the method is intended), hence the crawler would not start any Request.

FYI (I use redis-py):
blpop(keys, timeout=0)
- LPOP a value off of the first non-empty list named in the keys list.



Op vrijdag 19 april 2013 14:20:16 UTC+3 schreef Pedro het volgende:

Rolando Espinoza La Fuente

unread,
Apr 19, 2013, 2:14:27 PM4/19/13
to scrapy...@googlegroups.com
On Fri, Apr 19, 2013 at 8:05 AM, Roy Klopper <r.kl...@qbikz.com> wrote:
Hi Pedro,

The Redis connection is working properly, server is indeed an instance of redis.Redis(). The blpop method takes a list of keys which to pop from, the values gathered from the blpop is also correct. My problem is blpop is blocking (just as the method is intended), hence the crawler would not start any Request.

You are looking for spider_idle signal and the DontCloseSpider exception. You can see an implementation of this pattern here:

If it is suitable for your use case, you can use scrapy_redis's RedisSpider class:

Regards,
Rolando

Roy Klopper

unread,
Apr 24, 2013, 8:29:00 AM4/24/13
to scrapy...@googlegroups.com
Sweet! Exactly what I'm looking for. I already use the scrapy-redis for the scheduler. Thanks for developing Rolando! Hero of the day :-)

Regards,

Roy

Op vrijdag 19 april 2013 21:14:27 UTC+3 schreef Rolando Espinoza La fuente het volgende:

Roy Klopper

unread,
Apr 26, 2013, 4:48:23 AM4/26/13
to scrapy...@googlegroups.com
I haven't had the time to implement it yet, I will do this weekend. I have a question though, I see you extend the BaseSpider, is it also possible to use the crawlspider in combination with the redis scheduler and the redis spider?



Op woensdag 24 april 2013 15:29:00 UTC+3 schreef Roy Klopper het volgende:

Rolando Espinoza La Fuente

unread,
Apr 26, 2013, 5:34:04 AM4/26/13
to scrapy...@googlegroups.com


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages