Setting Dynamic Proxies to Scrapy

1,771 views
Skip to first unread message

Kyle Clarke

unread,
Jan 13, 2010, 3:17:10 AM1/13/10
to scrapy-users
Hi all, was wondering if I could get some input into an implementation
I am about to start - I'm not a Python guru (yet... tho I am a php) I
am looking at dynamically setting proxy hosts for scrapy. For example:

For all of the crawling I plan to do, I always plan to be behind a
proxy - as much as I plan to not have the proxy banned from the
websites I am crawling - I must however at least plan for the
eventuality. Eg I'm not going to hammer a site & cause all sorts of
headaches by being blatantly dumb & I want to scrap the publicly
available data without arousing "too much" attention.

So my thoughts were, to first

1) Create a python script (tho it could be php/bash etc) to use scrapy
to crawl a website like http://www.proxy4free.com/list/webproxy_rating1.html
- & store either a single proxy value, or a list etc of proxies to
implement.

2) The script would then set the proxy setting to the server
dynamically as per the urllib and urllib2 & docs

3) The script would then call the "python scrapy-ctl.py project_name
website_url"

4) Hooks inside the spider would define whether the page crawled
returned the appropriate response - if so it would continue.

5) If the incorrect response was found, a "banned exception" for
example would be raised. The scrapy engine would then be stopped/
killed with scrapyengine.stop() / scrapyengine.kill() or similar. The
exception would be returned to the calling script.

6) The calling script would be checking for such a "banned exception"
in it's try/catch block & if found, would either scrape for another
proxy or use one it already scraped earlier. It would then dynamically
set this proxy & then call the "python scrapy-ctl.py project_name
website_url" command & carry on as normal.

What I don't know is whether you can with Python, set OS settings like
a Proxy setting on the fly - perhaps some better Python devs here
could help me here.

The reason I want to do this is for administrative purposes, eg I will
be sending email notifications from Scrapy when exceptions are raised
etc to tweak the engine & it's settings. But I don't want to have to
continually set a proxy setting on the server & it would be much
cleaner if this was automated.

I'd love to hear any thoughts anyone has on this - I'm pretty sure
this is a good way to go about it, I will just need to do some reading
in regards to Python & OS settings etc. All thoughts welcome!
Thanks
Kyle from New Zealand

PS On a completely different note, is it possible to supress the
terminal output from the scrapy command "python scrapy-ctl.py
project_name website_url" - eg I return & store entire web responses
so my terminal window gets a lot of output displayed - I can just use
the shell commands when I want to output something. It's not a biggie
& doesn't really effect anything - was just curious - thanks

Pablo Hoffman

unread,
Jan 13, 2010, 12:54:44 PM1/13/10
to scrapy...@googlegroups.com
Hi Kyle,

I think you could do all that in a custom downloader middleware, similar to the
HttpProxyMiddlware (scrapy/contrib/downloadermiddleware/httpproxy.py).

The difference is that it would load/parse the list of proxy in its constructor
(when Scrapy process is initialized) and then you would also add a
process_response() method that would monitor for the "banned" responses and
(when one is found) switch to the next available proxy. No need to have an
extra process for launching the Scrapy process.

Regarding the scrapy-ctl.py question, if you use a log file the logging will be
stored there. See LOG_FILE setting:
http://doc.scrapy.org/topics/settings.html#log-file
You can also use LOG_STDOUT setting for redirecting your standard output to the
log file too:
http://doc.scrapy.org/topics/settings.html#log-stdout

Pablo.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>

Jacob Perkins

unread,
Jan 14, 2010, 10:35:01 AM1/14/10
to scrapy-users
Hi Kyle,

I've tried doing something similar to what you're thinking, and
relying on open proxies can cause a ton of frustration (because most
of them timeout, drop requests, and are otherwise very slow). An
alternative is to run your own squid proxies on EC2, and rotate them
periodically. I have the proxy instances add & remove themselves to
SimpleDB, then I have my own custom downloader middleware that looks
up available proxies from that same SimpleDB domain, and sets a random
choice to be the request proxy.

Jacob

> > to crawl a website likehttp://www.proxy4free.com/list/webproxy_rating1.html

Leonardo Lazzaro

unread,
Jan 14, 2010, 11:51:59 AM1/14/10
to scrapy...@googlegroups.com
Thats true, but if you use a lot of proxies in parallel, you could get a very good performance.
I dont have EC2! :P

Kyle Clarke

unread,
Jan 15, 2010, 1:47:51 AM1/15/10
to scrapy-users
Thank you both to Pablo & Jacob - good suggestions for my crawler
goinf forward. It is most likely at the start of this application I
will create my own middleware generic enough to lookup any source of
proxies available. I think at first I will look at a open proxy list &
move up to a more stable & better performed option as EC2 - thanks for
your imput here Jacob. I'll try not to get so frustrated!
Thanks again
Kyle

Kyle Clarke

unread,
Jan 20, 2010, 6:45:07 PM1/20/10
to scrapy-users
Additionally - I have to agree with Jacob, open proxies are pretty
flakey & slow. I am therefore ignoring the use of proxies for my
anonymity & instead using https://www.torproject.org/ - hope this
helps anyone else in the same situation. If so, then please support
the project.
Best
Kyle
Reply all
Reply to author
Forward
0 new messages