REXML screen scraping questions

Dan Kohn

unread,

Sep 14, 2005, 7:52:04 AM9/14/05

to

My goal here is to take an HTML table and convert it into an array of
arrays, with each inner array representing the 5 columns of cells in a
given row and the outer array representing the whole table.

I'm using REXML to parse the DOM tree. I would appreciate suggestions
on cleaning up the code below. I've run into the following problems:

+ The result of the first XPath.match produces two root-level TR tags,
which causes REXML to fail on reparsing with an "attempted adding
second root element to document" error. My solution was to add
root-level <top> tags, but that's an ugly hack.

+ The biggest problem is that while XPath.match generates an array, the
REXML functions are no longer able to parse it. Instead, I settled on
the hack of converting the array to a string and then having REXML
reparse it. Is this really the best way to deal with recursive
parsing?

+ I can't create rowarray or tablearray because I get an
"xmlscrape.rb:45: undefined local variable or method `rowarray' for
main:Object (NameError)" error.

+ Ruby doesn't crash if I remove rowdom and just run the XPath on row.
However, I then get duplicates because it runs across the full DOM
tree, not just the portion of the tree I've selected in that loop. Is
there a way to have REXML realize that I want to work with a subset of
the tree, other than my too-complex string-conversion and reloading?

+ The :compress_whitespace directive does not seem to correctly realize
that newlines within a text entity are just regular whitespace and so
should be compressed. My solution was to use string.gsub to replace
all newlines with spaces at the start.

+ Some important text is inside <A> tags, but it's hard to remove a tag
while preserving the text inside. I finally got the replace_tag syntax
working and put it in a replace_tag method, so I'm good to go there.

I'm obviously new to Ruby, so any help you can offer on cleaning this
up would be greatly appreciated.

- dan
--
Dan Kohn <mailto:d...@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

require "rexml/document"
include REXML
string = <<EOF
<html>
<tr>
<td class="t4" nowrap="nowrap">9-Jan-05</td>
<td class="t4"><a href="javascript:lu('OZ')">OZ</a> 0204 F
Class
<a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
LAX</a></td>
<td class="t4" nowrap="nowrap">5,968</td>
<td class="t4" nowrap="nowrap">2,984</td>
<td class="t4" nowrap="nowrap">8,952</td>
</tr>
<tr>
<td class="t4" nowrap="nowrap">19-Jan-05</td>
<td class="t4">MILEAGE PLUS UPGRADE AWARD
15,000 MILES</td>
<td class="t4" nowrap="nowrap">-15,000</td>
<td> </td>
<td class="t4" nowrap="nowrap">-15,000</td>
</tr>
</html>
EOF

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside the
parent of the now removed tag
while rexml_array.elements["//#{tag}"]
rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
end
end

doc = Document.new( string.gsub!(/\n| /," "), {
:compress_whitespace => :all } )
table = XPath.match( doc, "//tr[count(td)=5]")
#doc = Document.new File.new( "uamileage.html")
#rows = XPath.match( doc, "//tr[count(td)=5][position()=6 or
position()=7]")
table = "<top>", table, "</top>"
tabledom = Document.new( table.to_s)

XPath.each( tabledom,"/top/tr") { |row|
rowdom= Document.new( row.to_s)
XPath.each( rowdom,"//tr/td") { |cell|
remove_tag( cell,"a")
celltext = cell.texts.to_s
print celltext,"\n"
# rowarray << celltext
}
puts "\n --- \n"
# tablearray << rowarray
}

daz

unread,

Sep 15, 2005, 12:48:08 AM9/15/05

to

Dan Kohn wrote:
> My goal here is to take an HTML table and convert it into an array of
> arrays, with each inner array representing the 5 columns of cells in a
> given row and the outer array representing the whole table.
>
> I'm using REXML to parse the DOM tree. I would appreciate suggestions
> on cleaning up the code below. I've run into the following problems:
>

> [snip problems] :-)

Hiya Dan,

I know you've tried other packages but I think REXML isn't what
you want for sloppy old HTML work.

Below is an "htmltools" implementation using your input.
(If you've used Mechanize, html/tree may be installed, already.)

If you just want to search the data with regexen, you
might make a method which yields the data to its block
at the point where it's been collated.
(Post back if you need any help with that)

#-----------------------------------------------------------
require 'html/tree' # http://ruby-htmltools.rubyforge.org/

exa = HTMLTree::Parser.new(verbose=true, false)
#exa.parse_file_named('xxx.html')
exa.feed(string) # replacing '.parse_file_named'

# Pick out all <tr> tags under the <html> tag
exa.html.children.select {|e0| e0.tag == 'tr'}.each do |tr|
# p [:line, __LINE__, tr.to_s]
tdn = 0
# Pick out all <td> tags under successive <tr> tags
tr.select {|e1| e1.tag == 'td'}.each do |td|
data = ''; tdn += 1
# Inside the <td> are untagged data or nested tags
td.each do |item|
if item.data?
# p [:data, item.to_s]
data << item.to_s
else
# p [:line, __LINE__, item.tag, item.attributes]
end
end
data.gsub!(/\s+/, ' ')
### yield data # ? (from a method)
puts '#--td%02d--> %s' % [tdn, data]
end
puts '#' << '='*55
end

#--td01--> 9-Jan-05
#--td02--> OZ 0204 F Class ICN to LAX
#--td03--> 5,968
#--td04--> 2,984
#--td05--> 8,952
#=======================================================
#--td01--> 19-Jan-05
#--td02--> MILEAGE PLUS UPGRADE AWARD 15,000 MILES
#--td03--> -15,000
#--td04-->  
#--td05--> -15,000
#=======================================================

It's up to you to make the output _and_ the code look pretty.

daz

Dan Kohn

unread,

Sep 15, 2005, 4:55:58 AM9/15/05

to

Daz, thank you so much for taking the time to code that. I was also
busy today, and got my code working with REXML. Could you please take
a look at my code below and share your thoughts on whether you'd still
switch to htmltools.

The issue is that I'm creating a hundred different screen scrapers for
every frequent flyer program. Any scraper is, of course, brittle, but
it seemed to me like a DOM/XPath-based technique is both less likely to
break from small tweaks to the page and is also generally far more
concise to program. The downside, and it may be too big, is that my
code is awfully inefficient, and also requires that tidy be run on the
HTML before I start.

Also, since you're taking a look, could you please tell me if there's
any more concise way to initialize my arrays. (Ruby generally seems to
figure out variables, but this would only run if I explicitly used
Array.new.)

# the parent of the now removed tag

while rexml_array.elements["//#{tag}"]
rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
end
end

doc = Document.new( string.gsub!(/\n| /," "), {
:compress_whitespace => :all } )

tablearray = Array.new
XPath.each( doc,"//tr[count(td)=5]") { |row|
rowarray = Array.new
rowdom = Document.new( row.to_s)
XPath.each( rowdom,"//td") { |cell|
remove_tag( cell,"a")
rowarray << cell.texts.to_s
}
tablearray << rowarray
}
tablearray.each {|el| print el.join(":"),"\n"}

Even better is some other scraping I do on the same page, where in each
case I only need a one-dimensional array:

XPath.each( xml, "//td[@class='t3'][2]") { |cell|
summaryarray << cell.texts.to_s }

XPath.each( xml,
"//td[@colspan='4']/child::*") { |cell|
actsumarray << cell.text.to_s }

Thanks again, Daz, for taking the time to look at my (first ever Ruby)
code. Any other suggestions you could offer would be greatly

daz

unread,

Sep 15, 2005, 9:46:45 AM9/15/05

to

Dan Kohn wrote:
> [...]

> The issue is that I'm creating a hundred different screen scrapers for
> every frequent flyer program. Any scraper is, of course, brittle, but
> it seemed to me like a DOM/XPath-based technique is both less likely to
> break from small tweaks to the page and is also generally far more
> concise to program. The downside, and it may be too big, is that my
> code is awfully inefficient, and also requires that tidy be run on the
> HTML before I start.

Hi Dan,

Your code, IMHO, is inefficient due to the use of 'industrial grade'
software for a lightweight task, not from your coding.
I've run traces on REXML progs and the detailed work it carries out
is quite incredible (and necessary for its power).
Estimating conservatively, from timing and profiling of comparable
scripts, I'd say that I could run 15 pages through 'tools' to each
going through REXML ... probably as many as 30 ... even more while
you're pre-processing with Tidy.

>
> Also, since you're taking a look, could you please tell me if there's
> any more concise way to initialize my arrays. (Ruby generally seems to
> figure out variables, but this would only run if I explicitly used
> Array.new.)
>

That's not a factor :)

>
> Thanks again, Daz, for taking the time to look at my (first ever Ruby)
> code. Any other suggestions you could offer would be greatly
> appreciated.
>
> - dan

Glad to help.

Just one suggestion; your REXML experience won't be wasted --
don't hesitate to use REXML when it's needed (or at the weekends ;)
- it is /class/, as you know.
For this specific task, with speed being important, you need to use
a lighter package. I've used only one for any length of time, so I
can't compare with others.
Many folks would tackle this job with hand-parsing/regexps or this:
http://raa.ruby-lang.org/project/htmltokenizer/ - which may offer
you even better performance.

# Script used for timing comparisons against your latest.
#--------------------------------------------------------
exa = HTMLTree::Parser.new(verbose=true, ws=false)
exa.feed(string) # replacing '.parse_file_named'

tablearray = []

exa.html.children.select {|e0| e0.tag == 'tr'}.each do |tr|

rowarray = []

tr.select {|e1| e1.tag == 'td'}.each do |td|
data = ''

td.each do |item|
data << item.to_s if item.data?
end
data.gsub!(/(\s| )+/, ' ')
rowarray << data
end
tablearray << rowarray
end
tablearray.each {|el| puts el.join(":")}
#-------------------------------------------------------------------
9-Jan-05:OZ 0204 F Class ICN to LAX:5,968:2,984:8,952
19-Jan-05:MILEAGE PLUS UPGRADE AWARD 15,000 MILES:-15,000: :-15,000
#-------------------------------------------------------------------

Cheers,

daz
--

BTW, 'tools' does a similar job to Tidy (outputting to REXML format !):

require 'html/xpath' # http://ruby-htmltools.rubyforge.org/
exa = HTMLTree::Parser.new(verbose=false, strip_white=false)
exa.feed(string)
puts exa.tree.as_rexml_document

Gavin Kistner

unread,

Sep 15, 2005, 10:00:11 AM9/15/05

to

On Sep 14, 2005, at 10:51 PM, daz wrote:
> If you just want to search the data with regexen, you
> might make a method which yields the data to its block
> at the point where it's been collated.
> (Post back if you need any help with that)

Following is a sample of what I do for screen scraping - net::http
and regex only. Just look for an indicative message on the screen,
abstract it appropriately, and use it as an anchor for the data you
want.

The following undocumented script hammers the WUnderground.com server
to get min/max/average temperatures for a given city (airport code)
for a given date range (optionally across many years). I used it (and
Excel) to create
http://phrogz.net/tmp/BoulderTemperatures_LateSeptember.pdf
and
http://phrogz.net/tmp/CopperTemperatures_LateSeptember.pdf
(Trying to give a bunch of family members coming into town for a
wedding a feel for the potential temperature ranges, and the
variations possible within a given 5-day period.)

require 'net/http'
require 'date'

def get_temperatures( airport_code, date_range, year_range=nil )
if year_range
d1 = date_range.first
d2 = date_range.last
dates = year_range.collect { |year|
( Date.new( year, d1.mon, d1.day )..Date.new( year, d2.mon,
d2.day ) ).to_a
}.flatten
else
dates = date_range.to_a
end

Net::HTTP.start('www.wunderground.com', 80) { |http|
dates.collect { |date|
url = "/history/airport/#{airport_code}/#{date.year}/#
{date.mon}/#{date.day}/DailyHistory.html"
html = http.get( url ).body
stats = { :min=>'Min', :max=>'Max', :mean=>'Mean' }
stats.each { |key,val|
if str = html[ %r{#{val}(?: Temp(?:erature))?</td>.+?<td[^>]
*>(.+?)</td>}im , 1 ]
temp = str[ %r{(\d+).+?\°F}i, 1 ]
stats[ key ] = temp ? temp.to_f : nil
else
stats[ key ] = nil
end
}
if stats[ :min ] && stats[ :max ]
DayTemperature.new( date, stats[ :min ], stats[ :max ], stats
[ :mean ] )
end
}.compact
}
end

class DayTemperature
attr_accessor :date, :min, :max, :mean
def initialize( date, min, max, mean=nil )
@date = date
@min = min
@max = max
@mean = mean
end
def to_s
"%s\t%3i\t%3i\t%4i" % [ "#{@date.year}-#{@date.mon}-#
{@date.day}", @max, @mean, @min ]
end
end

temps = get_temperatures( 'KBJC', Date.new( 2000, 8, 15 )..Date.new
( 2000, 9, 15 ), 1990..2005 )
puts temps.join( "\n" )