With WWW::Mechanize, the only example I found was
http://www.zenspider.com/pipermail/ruby/2005-July/002068.html. I tried
to simplify this to the script below, but it just prints out "My wife
is ".
Rubyful Soup <http://www.crummy.com/software/RubyfulSoup/> also seems
like a great library, but there doesn't seem to be a single example
(only Python ones
<http://www.crummy.com/software/BeautifulSoup/examples.html>).
#!/usr/bin/env ruby
require 'mechanize'
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Windows IE 6'
# get first page
page = agent.get('http://www.dankohn.com/')
md = page.body.match /My wife, (\w+\s\w+)<\/a>/m
printf "My wife is ", md
Thanks in advance for any help you can offer.
This can turn HTML in well formed XML
http://rubyforge.org/projects/tidy/
which is much easier to parse.
Maybe worth a look.
What are you actually trying to accomplish?
James
--
http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
I was trying to start with baby steps to understand the methods these
libraries support. Specifically, I was trying to fetch my own web
page, and then use a regex to match to my wife's name, "Julie Pullen",
since I have link text on www.dankohn.com saying "My wife, Julie
Pullen". I was then going to gradually increase the complexity of the
scraping.
Thanks in advance for any example scripts or documentation that you can
provide showing web scraping in ruby.
And Lyndon, I'm a huge fan of Tidy for cleaning up my own web pages,
but I'm not sure it's helpful here, as was aiming to use regexes to
parse the HTML rather than the DOM.
print "My wife is ", md[1]
Michel.
Well, DOM allows you to use XPath, which is a powerfull query mechanism.
This
http://www-128.ibm.com/developerworks/java/library/j-jtp03225.html?ca=dgr-jw26XQueryis
XQuery specific, but relies
on XPath.
And example from the article
//td[contains(a/small/text(), "New York, NY")]
Install (using Gem or the Windows setup program that is available at
http://wtr.rubyforge.org/) and IRB this:
require 'watir'
ie = Watir::IE.new
ie.goto "dankohn.com"
julie_lines = ie.text.scan(/.*Julie.*/)
link = ie.link(:text, /Julie/)
link.click
Cheers,
Dave