I've been using this bit of code with some success:
doc = Hpricot(open("http://#{url}"))
doc.traverse_text
My problem is that I want to exclude text inside of script and style
elements. Is there a simple way to exclude these types of elements
when traversing text? What would be the appropriate way of filtering
out these elements using Hpricot?
The collection of pages that I am scraping have been selected at
random and I don't know anything about their structure or design a
priori. I've tried looking for an example or a tutorial somewhere,
but haven't been able to find anything helpful.
Just as an example of what I'm looking for, I have written a similar
script using Java and the W3C's Tidy parser with code that looks like
this:
! (childNode.getNodeName().equalsIgnoreCase("script")) && !
childNode.getNodeName().equalsIgnoreCase("style")
Thank you in advance for any help that anyone might be able to offer.
--dan