Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
HTML parsing as good as Perls.
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
TLOlczyk  
View profile  
 More options Jun 21 2005, 12:51 pm
Newsgroups: comp.lang.ruby
From: TLOlczyk <olczyk2...@yahoo.com>
Date: Tue, 21 Jun 2005 16:51:47 GMT
Local: Tues, Jun 21 2005 12:51 pm
Subject: HTML parsing as good as Perls.
First let me be very clear. I hate the language that Larry "should be
lined up against a " Wall has written. IMO it encourages people
to program with, well only men can program that way, instead of their
heads.

However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as
LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.

With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.

I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries. Don't suggest I pass it through
Tidy then parse the XML. There are a lot of pages that Tidy can't
handle.

Finally, there will be some smartass,  who will say that I should use
web sites that are written in good HTML. I don't have choice of what
pages I or the people to ask me to write scripts take our content
from. Fine. If you have the millions to pay all those webmasters to
hire HTML gurus that will generate good HTML let me know and
I will email you a list. As for me, I am too busy with real work on my
own projects to go around nagging people working on other things to
improve their coding style.

Thanks

The reply-to email address is olczyk2...@yahoo.com.
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Britt  
View profile  
 More options Jun 21 2005, 1:08 pm
Newsgroups: comp.lang.ruby
From: James Britt <jame...@neurogami.com>
Date: Wed, 22 Jun 2005 02:08:41 +0900
Local: Tues, Jun 21 2005 1:08 pm
Subject: Re: HTML parsing as good as Perls.

TLOlczyk wrote:
> First let me be very clear. I hate the language that Larry "should be
> lined up against a " Wall has written. IMO it encourages people
> to program with, well only men can program that way, instead of their
> heads.

> However as bad as the language is, LWP is one of the best libraries
> around when it comes to web related applications. Most notablely
> I have never found a library which can parse HTML as well as
> LWPs HTML parser. It is my eternal hope that I can find a library as
> good, and dump the language.

> With the advent of Ruby on Rails, I am hopeful that there might be a
> package in Ruby that gives Perl's HTML parser a run for it's money.

Look at Narf, and its htmltools and xmltree.
Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.

James

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com  - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com  - Playing with Better Toys


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Thomas  
View profile  
 More options Jun 21 2005, 1:22 pm
Newsgroups: comp.lang.ruby
From: "Mark Thomas" <m...@thomaszone.com>
Date: 21 Jun 2005 10:22:17 -0700
Local: Tues, Jun 21 2005 1:22 pm
Subject: Re: HTML parsing as good as Perls.

> I'm nt looking for an XML parser, XML parsers just can't handle
> many of the web sites I want to parse. Neither can expat,libxml2
> or some of the more popular libraries.

Have you tried libxml2 in parse_html mode with the recover option on?
I've never had a problem with any site. It handles broken, nasty HTML
quite nicely.

(Disclaimer: I don't know if the Ruby bindings expose this
functionality).


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Daniel Amelang  
View profile  
 More options Jun 21 2005, 1:51 pm
Newsgroups: comp.lang.ruby
From: Daniel Amelang <daniel.amel...@gmail.com>
Date: Wed, 22 Jun 2005 02:51:02 +0900
Local: Tues, Jun 21 2005 1:51 pm
Subject: Re: HTML parsing as good as Perls.
I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?

Here's the original BeautifulSoup. Look like what you need?

http://www.crummy.com/software/BeautifulSoup/

Would anyone be interested either as a user or a developer?

Dan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Edward Gray II  
View profile  
 More options Jun 21 2005, 2:05 pm
Newsgroups: comp.lang.ruby
From: James Edward Gray II <ja...@grayproductions.net>
Date: Wed, 22 Jun 2005 03:05:46 +0900
Local: Tues, Jun 21 2005 2:05 pm
Subject: Re: HTML parsing as good as Perls.
On Jun 21, 2005, at 12:51 PM, Daniel Amelang wrote:

> I did a poor man's port of BeautifulSoup once...if there's enough
> interest, we could turn it into something useful. I assume you're
> doing some screen scraping thing?

> Here's the original BeautifulSoup. Look like what you need?

> http://www.crummy.com/software/BeautifulSoup/

> Would anyone be interested either as a user or a developer?

I'm not a Python guy, so I don't know the library.  However, I just  
browsed through the site and if you ask me, it looks downright handy.

James Edward Gray II


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ryan Leavengood  
View profile  
(1 user)  More options Jun 21 2005, 2:24 pm
Newsgroups: comp.lang.ruby
From: "Ryan Leavengood" <mrc...@netrox.net>
Date: Wed, 22 Jun 2005 03:24:49 +0900
Local: Tues, Jun 21 2005 2:24 pm
Subject: Re: HTML parsing as good as Perls.
James Britt said:

> Look at Narf, and its htmltools and xmltree.
> Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.

I used Mechanize over the weekend and I just love it. In fact I had a
couple small problems that Michael fixed within hours.

I am using it to automate renewal of library books using my library's
web-site. I was amazed at how quickly I got my solution working, because
the library web-site software has some gnarly URLs and redirects that I
figured would be "fun" to deal with. But Mechanize makes it trivial.

Anyhow, the HTML from the library web-site parses fine and I easily scrape
out the information I care about (books titles, authors and due dates.)

Ryan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ezra Zygmuntowicz  
View profile  
 More options Jun 21 2005, 5:18 pm
Newsgroups: comp.lang.ruby
From: Ezra Zygmuntowicz <e...@yakima-herald.com>
Date: Wed, 22 Jun 2005 06:18:55 +0900
Local: Tues, Jun 21 2005 5:18 pm
Subject: Re: HTML parsing as good as Perls.

On Jun 21, 2005, at 11:05 AM, James Edward Gray II wrote:

+1 I would use it

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
e...@yakima-herald.com


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google