Splash spider never completes

Sean Keane

unread,

Dec 7, 2016, 10:30:22 PM12/7/16

to scrapy...@googlegroups.com

I have a spider that I created that use splash and it seems to never complete, ie it runs for two days and thenI finally stop it.

I have the following settings for my spider:

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100

}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Can someone provide some advice on how I should debug the issue?

Thanks

Sean

Paul Tremberth

unread,

Dec 12, 2016, 4:19:16 AM12/12/16

to scrapy-users

Hi Sean Keane,

I believe you need to tell us a bit more on the type of crawl you are doing.

Is it a broad crawl with lots of domains?
Is it a CrawlSpider with rules that can pick up a lot of pages?

What about the download rate: do you see it stable or does the crawl slow down?

While running the crawl, if you're on Python 2, you could also check with the telnet console what going on

https://doc.scrapy.org/en/latest/topics/telnetconsole.html

Hope this helps,

/Paul

Sean Keane

unread,

Dec 12, 2016, 10:14:13 PM12/12/16

to scrapy...@googlegroups.com

Paul,

Its a crawl spider. With the rules defined I am only expecting it to crawl about 1500 pages, but after two days I can see that its crawled 8000 pages . I'm using scrapyd to run my spider and track whats going on via the logs. The download rate of the spider looks be consistent. I suspect that its crawling the same page multiple times. I guess I will have to log what page its crawling to determine if that is the case.