CloseSpider extension does not close spider immediatly

450 views
Skip to first unread message

Balthazar Rouberol

unread,
Mar 25, 2013, 9:06:29 AM3/25/13
to scrapy...@googlegroups.com
Hi all,

I'm writing a small test spider that only scrapes 5 pages of a website and the shuts itself down.
To do so, I'm using the standard extension scrapy.contrib.closespider.CloseSpider, with CLOSESPIDER_PAGECOUNT = 5 defined in settings.py.

My spider indeed closes itself, but only after having crawled 20 pages:

    {'downloader/request_bytes': 5932,
     'downloader/request_count': 20,
     'downloader/request_method_count/GET': 20,
     'downloader/response_bytes': 693738,
     'downloader/response_count': 20,
     'downloader/response_status_count/200': 19,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'closespider_pagecount',
     'finish_time': datetime.datetime(2013, 3, 25, 13, 1, 27, 182406),
     'item_scraped_count': 18,
     'log_count/DEBUG': 44,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 19,
     'scheduler/dequeued': 20,
     'scheduler/dequeued/memory': 20,
     'scheduler/enqueued': 75,
     'scheduler/enqueued/memory': 75,
     'start_time': datetime.datetime(2013, 3, 25, 13, 1, 23, 337858)}

Is this behaviour normal?

Thanks in advance

--
Balthazar



Balthazar Rouberol

unread,
Mar 25, 2013, 9:20:15 AM3/25/13
to scrapy...@googlegroups.com
I've done the same thing but with CLOSESPIDER_PAGECOUNT = 100 and 135.
The spider stopped itself with a 'downloader/request_count' value of respectively 115 and 150.

It seems that whatever the value of CLOSESPIDER_PAGECOUNT I chosse, there is always an offset of 15 when the spider stops.

Pablo Hoffman

unread,
Apr 11, 2013, 6:50:15 PM4/11/13
to scrapy...@googlegroups.com
It depends on how many concurrent requests it runs. What happens when the page count reaches 100 is that a *request* to stop the spider is triggered, but it need to finish the currently ongoing downloads before shutting down. This is for the spider to finish in a more consistent state, so that all requests are either processed or not taken out of the scheduler (ie. no partially processed requests).


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages