TCP connection timed out problem

1,020 views
Skip to first unread message

Sungmin Lee

unread,
Oct 13, 2014, 8:32:43 PM10/13/14
to scrapy...@googlegroups.com
Hi all,

I'm implementing a spider working over proxy, so I've overridden proxymiddleware. it works so far so good.

What I want to ultimately achieve is that,

1) assign a proxy
2) start scraping
3) when proxy address is out-dated, broken, etc., apply new healthy proxy.
4) continue scraping


The problem is that, whenever a proxy address becomes corrupted, scrapy just hangs there waiting for TCP response.
I wanted to utilize httpRetryMiddleware but it doesn't help as scrapy doesn't return response.status.

2014-10-13 16:46:22-0700 [proxy_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-10-13 16:46:53-0700 [proxy_test] DEBUG: Retrying <GET http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:54-0700 [proxy_test] DEBUG: Retrying <GET http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:54-0700 [proxy_test] DEBUG: Retrying <GET http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:55-0700 [proxy_test] DEBUG: Retrying <GET http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:56-0700 [proxy_test] DEBUG: Retrying <GET http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2014-10-13 16:46:57-0700 [proxy_test] DEBUG: Retrying <GET http://some/website> (failed 1 times): TCP connection timed out: 60: Operation timed out.


Is there any way that I can handle this timeout issue?

Thanks!

lnxpgn

unread,
Oct 13, 2014, 10:10:45 PM10/13/14
to scrapy...@googlegroups.com
Implement process_request(), process_response(), process_exception() in your own proxy middleware and disable build-in proxy middleware, if a proxy is outdated, return the Request again in process_response() according HTTP status code or in process_exception() , the Request will be processed again including be assigned a new proxy.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages