Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

how to extract url's from html source of google search result

131 views
Skip to first unread message

sujeet kumar

unread,
Jun 11, 2005, 2:44:03 PM6/11/05
to
hi
I want to make a Tk window where you give some input string and it
search that on google and prints the web address (http url) of the
result found on google in the TkFrame of that window. My program
connects to net and get the html source through function "http.get".
Now from html source , how can I find the url's of the search. Can i
do it by regular expression or any other way.
Give me any suggestion.
Thanks
sujeet


Marcel Molina Jr.

unread,
Jun 11, 2005, 8:30:35 PM6/11/05
to

The URI.extract method from the uri library can extract an array of uri's from
a string:

require 'uri'
URI.extract('My favorite site is http://google.com')
# => ["http://google.com"]

An optional second argument can limit the schemes that it will match against
and return:

URI.extract('Why do people use mailto:m...@lala.org links?')
# => ["mailto:m...@lala.org"]
URI.extract('Why do people use mailto:m...@lala.org links?', 'http')
# => []

marcel
--
Marcel Molina Jr. <mar...@vernix.org>


Alexey Verkhovsky

unread,
Jun 11, 2005, 8:44:38 PM6/11/05
to
Marcel Molina Jr. wrote:

>On Sun, Jun 12, 2005 at 03:44:03AM +0900, sujeet kumar wrote:
>
>
>>how can I find the url's of the search. Can i
>>do it by regular expression or any other way.
>>
>>

>The URI.extract method from the uri library can extract an array of uri's from
>a string:
>
>

A universal regexp that finds URIs from an abstract text is a
complicated thing, indeed. Besides, it can produce false positives
(finding things that look like URIs, but aren't).

If you are sure that the page is a well-formed XHTML (I'm not sure if
that's the case or not with Google), you might instead parse it with
REXML, and use XPath to retrieve href attributes of all <a>..</a>
elements, selecting only those that start with "http://" (there may also
be mailto:, ftp:, JavaScript calls etc).

Best regards,
Alexey Verkhovsky


Eric Hodel

unread,
Jun 11, 2005, 10:24:16 PM6/11/05
to

Why not use the Google API?

--
Eric Hodel - drb...@segment7.net - http://segment7.net
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

0 new messages