Impossible to scrape google search results.. Right?

elio

unread,

Feb 12, 2013, 2:12:26 PM2/12/13

to scrapy-users

Hi,

I have to find websites that expose the usage of certain cloud
services in their html. And I am trying with scrapy.

piece of code:
class GoogleSpider(BaseSpider):
name = "google"
url = 'https://www.google.com/#q=elio'
rules = (Rule(SgmlLinkExtractor('//h3'),callback="parse_item"),)

def parse_item(self, response):
hxs = HtmlXPathSelector(response)

With:
BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'
But google give no results.

Thanks,
Elio

Sandip Shah

unread,

Feb 12, 2013, 2:18:11 PM2/12/13

to scrapy...@googlegroups.com

You won't be able to use scrapy for that ... you'll need to use something like Selenium ... or pyv8.

SS

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Nhan Ho

unread,

Feb 12, 2013, 2:29:37 PM2/12/13

to scrapy...@googlegroups.com

Why couldn't we use scrapy for such task?

Nhan

Andres Vargas

unread,

Feb 12, 2013, 4:06:33 PM2/12/13

to scrapy...@googlegroups.com

use the google api for that

https://developers.google.com/web-search/

2013/2/12 Nhan Ho <nhanh...@gmail.com>

--
Andres Vargas
www.zodman.com.mx

Randall Morgan

unread,

Feb 12, 2013, 4:18:07 PM2/12/13

to scrapy...@googlegroups.com

Google tracks what browsers are being used and ignores bots to keep
the load down on their servers. You can gain access to Google's search
results by applying for an api key.
Google uses complex methods of deciding if a bot connected or a
browser. Selenium is a browser automation tool suite and therefore
Google cannot determine if a person is controlling the browser or a
script.
Selenium was designed not so much for scraping but for web site
testing. As such it can work with big three browsers and perhaps
others. It is also great for use with sites that contain a lot of
dynamic content as uses the browsers own engines and dom. Items can be
selected using tag name, id, css selectors,or xpath.

If you ask me if it can be done. The answer is YES, it can always be
done. The correct questions however are... What will it cost, and how
long will it take?

Randall Morgan

unread,

Feb 12, 2013, 4:20:59 PM2/12/13

to scrapy...@googlegroups.com

Just a note. Google used to allow limited access by bots by allowing
only a few results to be scrapped at a time. Years ago I did have
success scrapping for multiple terms by using a rotating queue.
However, I have heard this is much harder now. Selenium is easy to use
and there are python bindings for it.

Steven Almeroth

unread,

Feb 15, 2013, 2:47:16 PM2/15/13

to scrapy...@googlegroups.com

It might be a good idea for you to start off with experimenting on the command line:

$ scrapy shell "https://www.google.com/#q=elio"

Also note that the class attribute is called start_urls (not url), and it needs to be a list or a tuple:

start_urls = ('https://www.google.com/#q=elio',)

And finally, there is no BOT_VERSION anymore, try the USER_AGENT setting instead.

John

unread,

Mar 19, 2014, 9:34:24 AM3/19/14

to scrapy...@googlegroups.com

It might be slightly off topic but as you had no success with scrappy:

For Google there is an open source PHP scraper which works very well for the purpose: http://scraping.compunect.com

I also stumbled over a few others but this one is currently the best and it's kept updated.

It has proper IP management, local caching, DOM parsing and other features. Quite all you need actually.

In general scraping google is not impossible but they tend to block IP addresses very fast if they are abused for automated access, that PHP scraper is using proxies and a hard rate limitation to avoid anoying Google.

Paul Tremberth

unread,

Mar 19, 2014, 9:53:06 AM3/19/14

to scrapy...@googlegroups.com

Hi,

It's possible to scrape results from Google using Scrapy.

He's an example scrapy shell session:

paul@wheezy:~$ scrapy shell "https://www.google.com/search?q=scrapy"
2014-03-19 14:49:24+0100 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-03-19 14:49:24+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-03-19 14:49:24+0100 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-03-19 14:49:24+0100 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-19 14:49:25+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-19 14:49:25+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-19 14:49:25+0100 [scrapy] INFO: Enabled item pipelines: 
2014-03-19 14:49:25+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-19 14:49:25+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-19 14:49:25+0100 [default] INFO: Spider opened
2014-03-19 14:49:25+0100 [default] DEBUG: Crawled (200) <GET https://www.google.com/search?q=scrapy> (referer: None)
[s] Available Scrapy objects:
[s]   item       {}
[s]   request    <GET https://www.google.com/search?q=scrapy>
[s]   response   <200 https://www.google.com/search?q=scrapy>
[s]   sel        <Selector xpath=None data=u'<html itemscope="" itemtype="http://sche'>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x34220d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import w3lib.url

In [2]: sel.css("#search ol > li.g > h3.r > a::attr(href)").extract()
Out[2]: 
[u'/url?q=http://scrapy.org/&sa=U&ei=aKApU7LNAYev0QXdzIDYDg&ved=0CCcQFjAA&usg=AFQjCNG43ciztZSGJeZO6TJBDtiPhXPf4A',
 u'/url?q=https://github.com/scrapy/scrapy&sa=U&ei=aKApU7LNAYev0QXdzIDYDg&ved=0CDsQFjAH&usg=AFQjCNE4OssI2iNzGcA8cqH3n8Ye4hvo2A',
 u'/url?q=http://en.wikipedia.org/wiki/Scrapy&sa=U&ei=aKApU7LNAYev0QXdzIDYDg&ved=0CEEQFjAI&usg=AFQjCNE3jE_Hq7Il1N1Q29FiHfGmN0yrLQ',
 u'/url?q=https://pypi.python.org/pypi/Scrapy&sa=U&ei=aKApU7LNAYev0QXdzIDYDg&ved=0CEcQFjAJ&usg=AFQjCNGtvpVEXPT27ZMDF-4Y9udSl8BcLw',
 u'/url?q=http://scrapinghub.com/scrapy-cloud&sa=U&ei=aKApU7LNAYev0QXdzIDYDg&ved=0CFMQFjAL&usg=AFQjCNHAW9H0KOF26qNhcBzq8_HLePAufg',
 u'/url?q=https://groups.google.com/d/forum/scrapy-users&sa=U&ei=aKApU7LNAYev0QXdzIDYDg&ved=0CFgQFjAM&usg=AFQjCNFXCbVszBeaIkMF1ZrabdC-Sr7gmg']

In [3]: [w3lib.url.url_query_parameter(u, "q") for u in sel.css("#search ol > li.g > h3.r > a::attr(href)").extract()]
Out[3]: 
['http://scrapy.org/',
 'https://github.com/scrapy/scrapy',
 'http://en.wikipedia.org/wiki/Scrapy',
 'https://pypi.python.org/pypi/Scrapy',
 'http://scrapinghub.com/scrapy-cloud',
 'https://groups.google.com/d/forum/scrapy-users']

In [4]:

Of course, it's rather minimalist and Google may block you after a few requests.

I'll leave you do the pagination and choose a relevant USER_AGENT setting.

/Paul

Sonal Gupta

unread,

Aug 20, 2015, 4:59:58 AM8/20/15

to scrapy-users

Hello All

I want to scrap the key from google .i am giving you a refrence i need exactly like that.

http://keywordtool.io

Reply all

Reply to author

Forward