Scrapy with https proxy

615 views
Skip to first unread message

Oana Goga

unread,
Aug 25, 2011, 11:04:23 PM8/25/11
to scrapy-users, oana...@lip6.fr
Hi,

I am trying to use scrapy to access https web pages over a proxy and I have some problems getting it to work.
When I am trying to fetch/view https://www.paypal.com with scrapy I am getting the 501 error (Not Implemented), but when I fetch the page with wget everything is working well.  Here are the steps that I am doing:

$ export http_proxy="http://us.proxymesh.com:31280"
$ export https_proxy="http://us.proxymesh.com:31280"

$ scrapy view https://www.paypal.com
2011-08-25 19:41:43-0700 [scrapy] INFO: Scrapy 0.12.0.2545 started (bot: nice_bot)
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlCanonicalizerMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled item pipelines:
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-08-25 19:41:43-0700 [default] INFO: Spider opened
2011-08-25 19:41:43-0700 [scrapy] DEBUG: Cookie: None for https://www.paypal.com
2011-08-25 19:41:44-0700 [scrapy] INFO: Set-Cookie: [] from https://www.paypal.com
2011-08-25 19:41:44-0700 [default] DEBUG: Crawled (501) <GET https://www.paypal.com> (referer: None)
2011-08-25 19:41:44-0700 [default] INFO: Closing spider (finished)
2011-08-25 19:41:48-0700 [default] INFO: Spider closed (finished)



$ wget https://www.paypal.com
--2011-08-25 19:44:08--  https://www.paypal.com/
Resolving us.proxymesh.com... 184.106.76.204
Connecting to us.proxymesh.com|184.106.76.204|:31280... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html'


I have scrapy 0.12.0.2545 , twisted 11.0.0 and python 2.7.

After some investigation, it appears that scrapy instead of issuing a CONNECT method and then doing a GET it is only issuing a GET requests which causes the fetch to fail.

Do you have any idea why this happens and how it can be fixed?

Thanks,
Oana







Pablo Hoffman

unread,
Aug 26, 2011, 1:44:47 PM8/26/11
to scrapy...@googlegroups.com
https proxies are not supported yet. There's more information on this ticket:
http://dev.scrapy.org/ticket/159

> 2011-08-25 19:41:44-0700 [default] *DEBUG: Crawled (501) <GET
> https://www.paypal.com>* (referer: None)


> 2011-08-25 19:41:44-0700 [default] INFO: Closing spider (finished)
> 2011-08-25 19:41:48-0700 [default] INFO: Spider closed (finished)
>
>
> $ wget https://www.paypal.com
> --2011-08-25 19:44:08-- https://www.paypal.com/
> Resolving us.proxymesh.com... 184.106.76.204
> Connecting to us.proxymesh.com|184.106.76.204|:31280... connected.

> Proxy request sent, awaiting response*... 200 OK*


> Length: unspecified [text/html]
> Saving to: `index.html'
>
> I have scrapy 0.12.0.2545 , twisted 11.0.0 and python 2.7.
>
> After some investigation, it appears that scrapy instead of issuing
> a CONNECT method and then doing a GET it is only issuing a GET
> requests which causes the fetch to fail.
>
> Do you have any idea why this happens and how it can be fixed?
>
> Thanks,
> Oana
>
>
>
>
>
>
>

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>

Palash Jain

unread,
Jun 21, 2016, 4:05:22 AM6/21/16
to scrapy-users
Hi, could you get it to work?
I am facing the same issue, can't get it to work. Any help would be appreciated.

陈伟伟

unread,
Jun 21, 2016, 7:23:46 PM6/21/16
to scrapy-users, oana...@lip6.fr


在 2011年8月26日星期五 UTC+8上午11:04:23,Oana Goga写道:

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware. 
Reply all
Reply to author
Forward
0 new messages