Splash spider never completes

24 peržiūros
Praleisti ir pereiti prie pirmo neskaityto pranešimo

Sean Keane

neskaityta,
2016-12-07 22:30:222016-12-07
kam: scrapy...@googlegroups.com
I have a spider that I created that use splash and it seems to never complete, ie it runs for two days and thenI finally stop it. 

I have the following settings for my spider:

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'


Can someone provide some advice on how I should debug the issue?

Thanks

Sean 

Paul Tremberth

neskaityta,
2016-12-12 04:19:162016-12-12
kam: scrapy-users
Hi Sean Keane,

I believe you need to tell us a bit more on the type of crawl you are doing.
Is it a broad crawl with lots of domains?
Is it a CrawlSpider with rules that can pick up a lot of pages?

What about the download rate: do you see it stable or does the crawl slow down?

While running the crawl, if you're on Python 2, you could also check with the telnet console what going on

/Paul

Sean Keane

neskaityta,
2016-12-12 22:14:132016-12-12
kam: scrapy...@googlegroups.com
Paul,

Its a crawl spider. With the rules defined I am only expecting it to crawl about 1500 pages, but after two days I can see that its crawled 8000 pages . I'm using scrapyd to run my spider and track whats going on via the logs. The download rate of the spider looks be consistent. I suspect that its crawling the same page multiple times. I guess I will have to log what page its crawling to determine if that is the case. 

thank you,

Sean
Atsakyti visiems
Atsakyti autoriui
Persiųsti
0 naujų pranešimų