HTML parsing as good as Perls.

TLOlczyk

unread,

Jun 21, 2005, 12:51:47 PM6/21/05

to

First let me be very clear. I hate the language that Larry "should be
lined up against a " Wall has written. IMO it encourages people
to program with, well only men can program that way, instead of their
heads.

However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as
LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.

With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.

I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries. Don't suggest I pass it through
Tidy then parse the XML. There are a lot of pages that Tidy can't
handle.

Finally, there will be some smartass, who will say that I should use
web sites that are written in good HTML. I don't have choice of what
pages I or the people to ask me to write scripts take our content
from. Fine. If you have the millions to pay all those webmasters to
hire HTML gurus that will generate good HTML let me know and
I will email you a list. As for me, I am too busy with real work on my
own projects to go around nagging people working on other things to
improve their coding style.

Thanks

The reply-to email address is olczy...@yahoo.com.
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.

James Britt

unread,

Jun 21, 2005, 1:08:41 PM6/21/05

to

TLOlczyk wrote:
> First let me be very clear. I hate the language that Larry "should be
> lined up against a " Wall has written. IMO it encourages people
> to program with, well only men can program that way, instead of their
> heads.
>
> However as bad as the language is, LWP is one of the best libraries
> around when it comes to web related applications. Most notablely
> I have never found a library which can parse HTML as well as
> LWPs HTML parser. It is my eternal hope that I can find a library as
> good, and dump the language.
>
>
> With the advent of Ruby on Rails, I am hopeful that there might be a
> package in Ruby that gives Perl's HTML parser a run for it's money.

Look at Narf, and its htmltools and xmltree.
Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.

James

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys

Mark Thomas

unread,

Jun 21, 2005, 1:22:17 PM6/21/05

to

> I'm nt looking for an XML parser, XML parsers just can't handle
> many of the web sites I want to parse. Neither can expat,libxml2
> or some of the more popular libraries.

Have you tried libxml2 in parse_html mode with the recover option on?
I've never had a problem with any site. It handles broken, nasty HTML
quite nicely.

(Disclaimer: I don't know if the Ruby bindings expose this
functionality).

Daniel Amelang

unread,

Jun 21, 2005, 1:51:02 PM6/21/05

to

I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?

Here's the original BeautifulSoup. Look like what you need?

http://www.crummy.com/software/BeautifulSoup/

Would anyone be interested either as a user or a developer?

Dan

James Edward Gray II

unread,

Jun 21, 2005, 2:05:46 PM6/21/05

to

I'm not a Python guy, so I don't know the library. However, I just
browsed through the site and if you ask me, it looks downright handy.

James Edward Gray II

Ryan Leavengood

unread,

Jun 21, 2005, 2:24:49 PM6/21/05

to

James Britt said:
>
> Look at Narf, and its htmltools and xmltree.
> Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.

I used Mechanize over the weekend and I just love it. In fact I had a
couple small problems that Michael fixed within hours.

I am using it to automate renewal of library books using my library's
web-site. I was amazed at how quickly I got my solution working, because
the library web-site software has some gnarly URLs and redirects that I
figured would be "fun" to deal with. But Mechanize makes it trivial.

Anyhow, the HTML from the library web-site parses fine and I easily scrape
out the information I care about (books titles, authors and due dates.)

Ryan

Ezra Zygmuntowicz

unread,

Jun 21, 2005, 5:18:55 PM6/21/05

to

+1 I would use it

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
ez...@yakima-herald.com