Hi all,
I'm implementing a spider working over proxy, so I've overridden proxymiddleware. it works so far so good.
What I want to ultimately achieve is that,
1) assign a proxy
2) start scraping
3) when proxy address is out-dated, broken, etc., apply new healthy proxy.
4) continue scraping
The problem is that, whenever a proxy address becomes corrupted, scrapy just hangs there waiting for TCP response.
I wanted to utilize httpRetryMiddleware but it doesn't help as scrapy doesn't return response.status.
2014-10-13 16:46:22-0700 [proxy_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-10-13 16:46:53-0700 [proxy_test] DEBUG: Retrying <GET
http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:54-0700 [proxy_test] DEBUG: Retrying <GET
http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:54-0700 [proxy_test] DEBUG: Retrying <GET
http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:55-0700 [proxy_test] DEBUG: Retrying <GET
http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:56-0700 [proxy_test] DEBUG: Retrying <GET
http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:57-0700 [proxy_test] DEBUG: Retrying <GET
http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
Is there any way that I can handle this timeout issue?
Thanks!