Splash spider never completes

24 views
Skip to first unread message

Sean Keane

unread,
Dec 7, 2016, 10:30:22 PM12/7/16
to scrapy...@googlegroups.com
I have a spider that I created that use splash and it seems to never complete, ie it runs for two days and thenI finally stop it. 

I have the following settings for my spider:

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'


Can someone provide some advice on how I should debug the issue?

Thanks

Sean 

Paul Tremberth

unread,
Dec 12, 2016, 4:19:16 AM12/12/16
to scrapy-users
Hi Sean Keane,

I believe you need to tell us a bit more on the type of crawl you are doing.
Is it a broad crawl with lots of domains?
Is it a CrawlSpider with rules that can pick up a lot of pages?

What about the download rate: do you see it stable or does the crawl slow down?

While running the crawl, if you're on Python 2, you could also check with the telnet console what going on

/Paul

Sean Keane

unread,
Dec 12, 2016, 10:14:13 PM12/12/16
to scrapy...@googlegroups.com
Paul,

Its a crawl spider. With the rules defined I am only expecting it to crawl about 1500 pages, but after two days I can see that its crawled 8000 pages . I'm using scrapyd to run my spider and track whats going on via the logs. The download rate of the spider looks be consistent. I suspect that its crawling the same page multiple times. I guess I will have to log what page its crawling to determine if that is the case. 

thank you,

Sean
Reply all
Reply to author
Forward
0 new messages