doing a Google search from Clojure?

946 views
Skip to first unread message

Rich Morin

unread,
Mar 22, 2013, 3:09:07 AM3/22/13
to clo...@googlegroups.com
I've been successfully using slurp and laser to harvest and pull
apart some web pages. However, I can't figure out how to use
Google Search from my code.

My first thought was to use the Google Search API, but after
a lot of frustration in trying to get and use an API key, I
gave up on that.

My next thought was to slurp in a page from the interactive
Google Search facility, using the URL from Advanced Search:

"http://www.google.com/search?hl=en&as_q=..."

However, this gives me a 403 nastygram:

IOException Server returned HTTP response code: 403 for URL:
https://www.google.com/search?hl=en&as_q=&as_epq=...
sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1436)

Has anyone here, by chance, been able to do this sort of thing?

-r

--
http://www.cfcl.com/rdm Rich Morin
http://www.cfcl.com/rdm/resume r...@cfcl.com
http://www.cfcl.com/rdm/weblog +1 650-873-7841

Software system design, development, and documentation


Cedric Greevey

unread,
Mar 22, 2013, 3:32:33 AM3/22/13
to clo...@googlegroups.com
Change your code to it spoofs a common browser user-agent, change your DHCP-assigned IP address, and try again. They're probably trying to obstruct bots from making overwhelming numbers of requests or something. As long as you don't flood them with requests at a higher rate than a human would generate by clicking, I don't see any ethical issue with circumventing their countermeasures, especially not if the search will be triggered by a user input to your application anyway.


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Jonathan Fischer Friberg

unread,
Mar 22, 2013, 9:26:09 AM3/22/13
to clo...@googlegroups.com

juan.facorro

unread,
Mar 22, 2013, 11:00:22 AM3/22/13
to clo...@googlegroups.com
Setting the user agent did the trick, at least in my case.

(ns google-search
(:import [java.net URL URLEncoder]))
 
(def google-search-url "http://www.google.com/search?q=")
(def user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172")
 
(defn open-connection [url]
(doto (.openConnection url)
(.setRequestProperty "User-Agent" user-agent)))
 
(defn get-response [url]
(let [conn (open-connection url)
in (.getInputStream conn)
sb (StringBuilder.)]
(loop [c (.read in)]
(if (neg? c)
(str sb)
(do
(.append sb (char c))
(recur (.read in)))))))
 
(defn search [query]
(let [url (URL. (str google-search-url (URLEncoder/encode query)))]
(get-response url)))
(spit "response.html" (search "URLEncoder java 7"))

HIH,

Juan

Jim - FooBar();

unread,
Mar 22, 2013, 11:20:23 AM3/22/13
to clo...@googlegroups.com
On 22/03/13 15:00, juan.facorro wrote:
(do
(.append sb (char c))

do you really need the 'do'?

Jim

Juan Martín

unread,
Mar 22, 2013, 11:25:59 AM3/22/13
to clo...@googlegroups.com
Yes, the do is necessary since the character needs to be appended to the StringBuilder and recur needs to be called after doing that.

I actually took the code from the clojure.core/slurp function :).

Cheers,

Juan

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/QwKmsLwLhjE/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Juan Facorro

Jim - FooBar();

unread,
Mar 22, 2013, 11:26:16 AM3/22/13
to clo...@googlegroups.com
ooops! I'm really sorry! my bad!

JIm

Rich Morin

unread,
Mar 22, 2013, 12:02:53 PM3/22/13
to clo...@googlegroups.com
On Mar 22, 2013, at 08:00, juan.facorro wrote:
> Setting the user agent did the trick, at least in my case.

Thanks! Using your code, I was able to bring in a page and
write it to a file. I was then able to confirm that it had
the expected content. FTW!

I still think this should be easier, but now I'm back on a
productive path.

juan.facorro

unread,
Mar 22, 2013, 12:44:24 PM3/22/13
to clo...@googlegroups.com
I gave the code another look and remembered that slurp can actually handle a bunch of types as input, so I just passed the InputStream from the connection and got the same results. Additionaly in the code I posted before, the get-response function was never closing the stream, which slurp does.

(ns google-search
(:import [java.net URL URLEncoder]))
(def google-search-url "http://www.google.com/search?q=")
(def user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172")
(defn open-connection [url]
(doto (.openConnection url)
(.setRequestProperty "User-Agent" user-agent)))
(defn get-response [url]
(let [conn (open-connection url)
sb (StringBuilder.)]
(slurp (.getInputStream conn))))
(defn search [query]
(let [url (java.net.URL. (str google-search-url (URLEncoder/encode query)))]
(get-response url)))
(spit "response.html" (search "clojure google"))

J

Armando Blancas

unread,
Mar 22, 2013, 1:54:37 PM3/22/13
to clo...@googlegroups.com
Rich, you may want to check out clojure-http-client.

(require '[clj-http.client :as client])
(spit "result.html" (client/get "http://www.google.com/search?q=clojure"))

Anthony Grimes

unread,
Mar 22, 2013, 2:37:44 PM3/22/13
to clo...@googlegroups.com
clojure-http-client is more or less unmaintained. https://github.com/dakrone/clj-http is the canonical http client these days.

Lazybot has a plugin for doing this with the google ajax api, if that's helpful. No API key needed. https://github.com/flatland/lazybot/blob/develop/src/lazybot/plugins/google.clj

Armando Blancas

unread,
Mar 22, 2013, 3:19:10 PM3/22/13
to clo...@googlegroups.com
Thanks, Anthony; will use that one.
Reply all
Reply to author
Forward
0 new messages