Implementation of persistent request queue

211 views
Skip to first unread message

Milan Munzar

unread,
Jun 21, 2012, 1:52:57 PM6/21/12
to scrapy-users
Hi

In my project I'm following all links on actual page (hundreds of
request in one crawl) and I do this for all relevant pages I found.
With this project It seems I have to implement persistent request
queue for scheduler. I appreciate any suggestions on my proposal.

My solution would be simply implement url storing database which will
be feeded by spider middleware. This database will feed memory queue
with limited lenght and from here it would be like normal.

First i would try it without memory queue see if it is fast enough. I
thought to use SQLite.

I also wonder what SCHEDULER_DISC_QUEUE does.

Thanks




Milan Munzar

unread,
Jun 21, 2012, 1:44:43 PM6/21/12
to scrapy-users
Hi

In my project I'm following all links on actual page (hundreds of
request in one crawl) and I do this for all relevant pages I found.
With this project It seems I have to implement persistent request
queue for scheduler. I appreciate any suggestions on my proposal.

My solution would be simply implement url storing database which will
be feeded by spider middleware. This database will feed memory queue
with limited lenght and from here it would be like normal.

And I also dont know what SCHEDULER_DISC_QUEUE




Martin Loy

unread,
Jun 25, 2012, 5:27:08 PM6/25/12
to scrapy...@googlegroups.com
hi, try scrapy-redis :) it's does what you are looking for :)

pip install scrapy-redis

Regards

M





--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.




--
Nunca hubo un amigo que hiciese un favor a un enano, ni un enemigo que le hiciese un mal, que no se viese recompensado por entero.

Shane Evans

unread,
Jul 2, 2012, 9:45:29 PM7/2/12
to scrapy...@googlegroups.com
Depending on available memory, and your request object size, you should typically get anywhere from tens of thousands to tens of millions of requests in memory. If you are worried about memory pressure, I think it's probably wise to test unless your numbers are clearly in excess of this.

The latest scrapy has support for serializing requests to disk, which happens during the crawl and will reduce memory usage if the number of outstanding requests is large. See http://scrapy.readthedocs.org/en/latest/topics/jobs.html for more details.

As Martin mentioned, scrapy-redis is another option. It's particularly useful if you want to share state between more than one scrapy process.

If you do want to implement something yourself, it's worth looking at the code for scrapy-redis and the persistent job state, as they provide good examples of how to hook into scrapy to manage the requests.

Reply all
Reply to author
Forward
0 new messages