Extracting text

0 views
Skip to first unread message

danwa...@gmail.com

unread,
Aug 23, 2007, 2:15:02 PM8/23/07
to whytheluckystuff
I'm trying to get just the text elements one-by-one for processing
from a set of html documents.

I've been using this bit of code with some success:

doc = Hpricot(open("http://#{url}"))
doc.traverse_text

My problem is that I want to exclude text inside of script and style
elements. Is there a simple way to exclude these types of elements
when traversing text? What would be the appropriate way of filtering
out these elements using Hpricot?

The collection of pages that I am scraping have been selected at
random and I don't know anything about their structure or design a
priori. I've tried looking for an example or a tutorial somewhere,
but haven't been able to find anything helpful.

Just as an example of what I'm looking for, I have written a similar
script using Java and the W3C's Tidy parser with code that looks like
this:

! (childNode.getNodeName().equalsIgnoreCase("script")) && !
childNode.getNodeName().equalsIgnoreCase("style")

Thank you in advance for any help that anyone might be able to offer.

--dan

Reply all
Reply to author
Forward
0 new messages