Do pipelines block Scrapy from crawling?

604 views
Skip to first unread message

Lee H.

unread,
Aug 29, 2015, 11:49:01 AM8/29/15
to scrapy-users
If I have a really slow pipeline, like let's say I'm writing out items to a database on a remote server that is really slow, what would happen? Would the items just stack up in memory until finally they are processed (meaning my only problem might be memory) or would Scrapy crawling of pages halt because of this too?

I'm thinking that when an item is passed to `process_item` method of the pipeline, Scrapy just carries on to the next request regardless of what happens in the pipeline? 

I'm using a MS-SQL writer pipeline based on dirbot-mysql, but adapted to MS-SQL. I'm trying to understand the real advantages of using twisted adbapi though. I understand it will speed up the writing of items to the db: since asynchronously  it will switch between connections in the pool; if a connection starts blocking, jump to another that isn't blocking. But this is just the writing of items phase right? If the pipeline isn't blocking the crawl then so what? 


Lee H.

unread,
Aug 30, 2015, 12:00:17 AM8/30/15
to scrapy-users
OK, I see now that if I didn't use Twisted adbapi and blocking occured in the pipeline (e.g. if I artificially just added a `time.sleep(100)` the whole of Scrapy stops until the 100s is over. Since after all Scrapy is single-threaded just asynchronous, so if the pipeline blocks like this everything blocks. Whereas if I use twisted's adbapi and add an artificial block like this, then twisted just moves on to a non-blocking task (like Scrapy a Scrapy Request or something) and the spider can march onward.

I'm still curious though. If I had a really slow db and used adbapi, what would happen? In my experiments it seems simply that all items just pileup at the end, and carry on getting written to db (with the total extra writing time -- perhaps from my artificial delays-- getting added on to scrape time without the pipeline). Are there any other concerns?

Particularly I'm worried about:

1) Does scrapy autothrottle the crawler if too many items pileup, and if so is that a concern anyway?
2) Could this lead to memory issues? (is it just items that would pileup or would Requests/Responses end up hanging around too?)

Artur Gaspar

unread,
Aug 31, 2015, 9:07:31 PM8/31/15
to scrapy-users
Yes, Scrapy does stop the crawler if too many requests are being processed including pipelines. I have had it happen once: a service that a pipeline depended on stopped, and so did the entire crawler, and as soon as the service was up again, crawling resumed. 

The code responsible for it is in scraper.py. Requests to be processed will be queued until the slot is at max_active_size. A request is only dequeued after the callback has been called and its output has been processed, including through pipelines. Memory usage will not grow forever because of max_active_size.
Reply all
Reply to author
Forward
0 new messages