Best way to set http_proxy via scrapy program?

Edward

unread,

Feb 21, 2012, 9:22:49 AM2/21/12

to scrapy-users

Hi,

I run all my spiders programatically using a scheduler and Amazon EC2
instances. Typically, I launch them using Fabric and an SSH
connection.

This means that I don't easily get the ability to set an environment
variable before my scheduler calls "scrapy crawl xxx" over a remote
shell.

I'm looking for a way to set http_proxy from within my spider? Some
notes:

- I've set HttpProxyMiddleware and confirmed everything works if I
manually set http_auth before running my spider manually from the
command line.
- The project I use has multiple spiders (I call different ones via
scrapy crawl xxx), but not all of them should use a proxy, as such I'm
looking for a solution I can do from within a spider class.
- My proxy requires authentication, I set http_proxy normally like
user:pa...@proxy.com:808
- I can pass the proxy string in to my spider via the -a keyword args.

What's the easiest way to set http_proxy? I tried using
os.environ['http_proxy'] = proxy_string, and os.putenv('http_proxy',
proxy_string), bit no dice in either case. . .

Thanks!

Leonardo Lazzaro

unread,

Feb 21, 2012, 1:50:49 PM2/21/12

to scrapy...@googlegroups.com

you can use meta['proxy'] on each request or download middleware proxy (scrapy/contrib/downloadermiddleware/httpproxy.py)

hope it helps
leo

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
http://www.lazzaroleonardo.com.ar/

twitter @llazzaro

Edward

unread,

Feb 21, 2012, 3:57:44 PM2/21/12

to scrapy-users, pabloh...@gmail.com

Hi Leonardo,

Thanks for the reply. However, I don't think that approach works with
proxies that require authentication (as my example shows). This is
what Pablo Hoffman stated in an earlier thread on the mailing list:
http://groups.google.com/group/scrapy-users/browse_frm/thread/9fa92a86d0af2835/

Perhaps Pablo could explain how I take my "user : pa...@proxy.com:808"
string and use my spider with the proxy? For example, I don't think I
can just set Proxy-Authorization because the user/pass has to be
base-64 encoded first doesn't it?

Isn't there just a way I can set the environment variable or otherwise
pass my http_proxy string to urllib2?

Cheers!

On Feb 21, 6:50 pm, Leonardo Lazzaro <lazzaroleona...@gmail.com>
wrote:

> you can use meta['proxy'] on each request or download middleware proxy
> (scrapy/contrib/downloadermiddleware/httpproxy.py)
>
> hope it helps
> leo
>
>
>
>
>
>
>
>
>
> On Tue, Feb 21, 2012 at 10:22 AM, Edward <eddrobin...@gmail.com> wrote:
> > Hi,
>
> > I run all my spiders programatically using a scheduler and Amazon EC2
> > instances. Typically, I launch them using Fabric and an SSH
> > connection.
>
> > This means that I don't easily get the ability to set an environment
> > variable before my scheduler calls "scrapy crawl xxx" over a remote
> > shell.
>
> > I'm looking for a way to set http_proxy from within my spider? Some
> > notes:
>
> > - I've set HttpProxyMiddleware and confirmed everything works if I
> > manually set http_auth before running my spider manually from the
> > command line.
> > - The project I use has multiple spiders (I call different ones via
> > scrapy crawl xxx), but not all of them should use a proxy, as such I'm
> > looking for a solution I can do from within a spider class.
> > - My proxy requires authentication, I set http_proxy normally like

> > user:p...@proxy.com:808

Pablo Hoffman

unread,

Mar 31, 2012, 10:16:40 PM3/31/12

to scrapy-users

You could set the environment variable by implementing your custom Scrapy command:
http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands

But here is a simple middleware that configures the proxy from a Scrapy setting instead with authentication support, which is closer to what you need, I think.

Reply all

Reply to author

Forward