Scrapy + ProxyMesh to crawl Google News?

818 views
Skip to first unread message

naaboo

unread,
Nov 8, 2012, 5:38:55 AM11/8/12
to scrapy...@googlegroups.com
Heya

I need to scrape Google News Items and therefore a need for some Proxies and rotating proxies has risen.

I've played around with rotating proxie script (kindly provided here: http://mahmoud.abdel-fattah.net/2012/04/16/using-scrapy-with-different-many-proxies/)
and used 150 proxies from hidemyass.com

But Google blocked me as soon as I made a request.

So I tried using ProxyMesh, but the same thing happens :(

When I check with whatsmyip.org, I get a new IP every request (so for me this means, the proxy middleware is cofigured correctly)

Do you have any tips for me to solve this problem?


THANKS!


ps: I'm running a Ubuntu CLI-only EC2

naaboo

unread,
Nov 9, 2012, 2:42:15 PM11/9/12
to scrapy...@googlegroups.com
I got an answer from one of the ProxyMesh guys:

Unfortunately, Google is a popular site to scrape, and ProxyMesh has been blocked by them. So for scraping Google I always recommend http://www.trustedproxies.com/

Jacob

naaboo

unread,
Nov 19, 2012, 2:00:34 PM11/19/12
to scrapy...@googlegroups.com
Hi Ray,

sorry for my late reply.

What exactly is your problem?
The data you show there seem to be fine?

I am not sure as to how I can help you out

Best
Naaboo

On Wednesday, November 14, 2012 3:54:18 PM UTC+1, Ray wrote:
Hi Naaboo,

I am doing a Scrapy to crawl information from website also.
I am new, I tried to scrapy information from below website and can insert the title, link time to MySQL db, I will use Django to show them.
Could you teach me how to crawl a news site ?
Thanks, my email ophc...@yahoo.com


'158', 'Top', '/', '2012-11-14 22:18:58', '\n                '
'159', 'Computers: Programming: Resources', '/Computers/Programming/Resources/', '2012-11-14 22:18:58', '\n                        '
'160', 'Free Python and Zope Hosting Directory', 'http://www.oinko.net/freepython/', '2012-11-14 22:18:58', '\n            \n                    '
'161', 'Social Bug', 'http://win32com.goermezer.de/', '2012-11-14 22:18:58', '\n            \n                    '
'162', 'Computers: Programming: Languages: Python: Resources', '/Computers/Programming/Languages/Python/Resources/', '2012-11-14 22:18:59', '\n                        '
Reply all
Reply to author
Forward
0 new messages