For all of the crawling I plan to do, I always plan to be behind a
proxy - as much as I plan to not have the proxy banned from the
websites I am crawling - I must however at least plan for the
eventuality. Eg I'm not going to hammer a site & cause all sorts of
headaches by being blatantly dumb & I want to scrap the publicly
available data without arousing "too much" attention.
So my thoughts were, to first
1) Create a python script (tho it could be php/bash etc) to use scrapy
to crawl a website like http://www.proxy4free.com/list/webproxy_rating1.html
- & store either a single proxy value, or a list etc of proxies to
implement.
2) The script would then set the proxy setting to the server
dynamically as per the urllib and urllib2 & docs
3) The script would then call the "python scrapy-ctl.py project_name
website_url"
4) Hooks inside the spider would define whether the page crawled
returned the appropriate response - if so it would continue.
5) If the incorrect response was found, a "banned exception" for
example would be raised. The scrapy engine would then be stopped/
killed with scrapyengine.stop() / scrapyengine.kill() or similar. The
exception would be returned to the calling script.
6) The calling script would be checking for such a "banned exception"
in it's try/catch block & if found, would either scrape for another
proxy or use one it already scraped earlier. It would then dynamically
set this proxy & then call the "python scrapy-ctl.py project_name
website_url" command & carry on as normal.
What I don't know is whether you can with Python, set OS settings like
a Proxy setting on the fly - perhaps some better Python devs here
could help me here.
The reason I want to do this is for administrative purposes, eg I will
be sending email notifications from Scrapy when exceptions are raised
etc to tweak the engine & it's settings. But I don't want to have to
continually set a proxy setting on the server & it would be much
cleaner if this was automated.
I'd love to hear any thoughts anyone has on this - I'm pretty sure
this is a good way to go about it, I will just need to do some reading
in regards to Python & OS settings etc. All thoughts welcome!
Thanks
Kyle from New Zealand
PS On a completely different note, is it possible to supress the
terminal output from the scrapy command "python scrapy-ctl.py
project_name website_url" - eg I return & store entire web responses
so my terminal window gets a lot of output displayed - I can just use
the shell commands when I want to output something. It's not a biggie
& doesn't really effect anything - was just curious - thanks
I think you could do all that in a custom downloader middleware, similar to the
HttpProxyMiddlware (scrapy/contrib/downloadermiddleware/httpproxy.py).
The difference is that it would load/parse the list of proxy in its constructor
(when Scrapy process is initialized) and then you would also add a
process_response() method that would monitor for the "banned" responses and
(when one is found) switch to the next available proxy. No need to have an
extra process for launching the Scrapy process.
Regarding the scrapy-ctl.py question, if you use a log file the logging will be
stored there. See LOG_FILE setting:
http://doc.scrapy.org/topics/settings.html#log-file
You can also use LOG_STDOUT setting for redirecting your standard output to the
log file too:
http://doc.scrapy.org/topics/settings.html#log-stdout
Pablo.
> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>
I've tried doing something similar to what you're thinking, and
relying on open proxies can cause a ton of frustration (because most
of them timeout, drop requests, and are otherwise very slow). An
alternative is to run your own squid proxies on EC2, and rotate them
periodically. I have the proxy instances add & remove themselves to
SimpleDB, then I have my own custom downloader middleware that looks
up available proxies from that same SimpleDB domain, and sets a random
choice to be the request proxy.
Jacob
> > to crawl a website likehttp://www.proxy4free.com/list/webproxy_rating1.html