However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as
LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.
With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.
I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries. Don't suggest I pass it through
Tidy then parse the XML. There are a lot of pages that Tidy can't
Finally, there will be some smartass, who will say that I should use
web sites that are written in good HTML. I don't have choice of what
pages I or the people to ask me to write scripts take our content
from. Fine. If you have the millions to pay all those webmasters to
hire HTML gurus that will generate good HTML let me know and
I will email you a list. As for me, I am too busy with real work on my
own projects to go around nagging people working on other things to
improve their coding style.
The reply-to email address is olczy...@yahoo.com.
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
Thaddeus L. Olczyk, PhD
There is a difference between
*thinking* you know something,
and *knowing* you know something.
Look at Narf, and its htmltools and xmltree.
Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.
http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
Have you tried libxml2 in parse_html mode with the recover option on?
I've never had a problem with any site. It handles broken, nasty HTML
(Disclaimer: I don't know if the Ruby bindings expose this
Here's the original BeautifulSoup. Look like what you need?
Would anyone be interested either as a user or a developer?
I'm not a Python guy, so I don't know the library. However, I just
browsed through the site and if you ask me, it looks downright handy.
James Edward Gray II
I used Mechanize over the weekend and I just love it. In fact I had a
couple small problems that Michael fixed within hours.
I am using it to automate renewal of library books using my library's
web-site. I was amazed at how quickly I got my solution working, because
the library web-site software has some gnarly URLs and redirects that I
figured would be "fun" to deal with. But Mechanize makes it trivial.
Anyhow, the HTML from the library web-site parses fine and I easily scrape
out the information I care about (books titles, authors and due dates.)
+1 I would use it