Splash spider never completes

24 peržiūros
Praleisti ir pereiti prie pirmo neskaityto pranešimo

Sean Keane

2016-12-07 22:30:222016-12-07
kam: scrapy...@googlegroups.com
I have a spider that I created that use splash and it seems to never complete, ie it runs for two days and thenI finally stop it. 

I have the following settings for my spider:

    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Can someone provide some advice on how I should debug the issue?



Paul Tremberth

2016-12-12 04:19:162016-12-12
kam: scrapy-users
Hi Sean Keane,

I believe you need to tell us a bit more on the type of crawl you are doing.
Is it a broad crawl with lots of domains?
Is it a CrawlSpider with rules that can pick up a lot of pages?

What about the download rate: do you see it stable or does the crawl slow down?

While running the crawl, if you're on Python 2, you could also check with the telnet console what going on


Sean Keane

2016-12-12 22:14:132016-12-12
kam: scrapy...@googlegroups.com

Its a crawl spider. With the rules defined I am only expecting it to crawl about 1500 pages, but after two days I can see that its crawled 8000 pages . I'm using scrapyd to run my spider and track whats going on via the logs. The download rate of the spider looks be consistent. I suspect that its crawling the same page multiple times. I guess I will have to log what page its crawling to determine if that is the case. 

thank you,

Atsakyti visiems
Atsakyti autoriui
0 naujų pranešimų