Best way to parse/update HTML file?

Bucco

unread,

Jun 24, 2005, 9:13:48 PM6/24/05

to

Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

Thanks:)

SA

daz

unread,

Jun 26, 2005, 5:37:31 AM6/26/05

to

Bucco wrote:
> Sorry for the newbie question.

This has been answered once or twice before by this group ;-)

The best way tends to involve using a package although
you /could/ work your way through it using regular expressions.

If you're likely to be doing this kind of thing in the
future, you'll be glad you spent a bit of time installing;
then it's always available.

As I recall, a different package is often
recommended but I don't know which is best.

This is what some of us use:
http://ruby-htmltools.rubyforge.org/ (Ned Konz +)

Examples are included but here's another ...

#-----------------------------------------------------------------
EXAMPLE = <<EOX
<html lang="en">
<head>
<title>Page title</title>
</head>
<body>
<div id="Header">
<h1><a href="http://xxxx.net/"><em>q</em>URL.net</a></h1>
<p>For When You Want a Quick URL</p>
</div>
<hr>
<div id="Content">
<form action="http://xxxx.net/" method="post">
<fieldset>
<legend>Enter a <abbr title="Uniform Resource Locator">URL</abbr> to make into a xxxx:</legend>
<input id="InputURL" type="text" size="40" maxlength="65535" name="url" value="">
<input type="submit" name="action" value="Create xxxx">
</fieldset>
</form>
</div>
<hr>
<a href="http://xxxx.net/pages/contact">Contact</a> -
<a href="http://xxxx.net/downloads/">Downloads</a> -
<a href="http://xxxx.net/">Create</a> -
<a href="http://xxxx.net/pages/terms">Terms of Use</a> -
<a href="http://xxxx.net/pages/list">List</a> -
<a href="http://xxxx.net/pages/prefs">Preferences</a>
</body>
</html>
EOX

require 'html/tree' # http://ruby-htmltools.rubyforge.org/

verbose = true

exa = HTMLTree::Parser.new(verbose, !false)
#exa.parse_file_named('xxxx_net.html')
exa.feed(EXAMPLE) # replaces '.parse_file_named'

item_a = exa.html.select {|ea| ea.tag == 'a'}
item_a.each {|ea| p [:ahref, ea['href']]}
puts '+'*100

exa.html.each do |ea|
p [ea.tag, ea['href']]
ea.each do |item|
if item.data?
p [:data, item.to_s]
elsif item.tag == 'a'
item['href'].sub!(/xxxx/, 'mysite')
end
end
puts '='*100
ea.dump
end

### exa.html.dump

#-----------------------------------------------------------------

Output from the script above is too long to post here,
so I've uploaded it to:
http://www.d10.karoo.net/ruby/example_html_parse.txt

Hope this is of some use,

daz
--
JARH (Nihon-style) http://qurl.net/h3

mathew

unread,

Jun 27, 2005, 12:43:15 PM6/27/05

to

For invalid "tag soup" HTML, your best bet is probably to use
html/htmltokenizer.

<URL:http://rubyforge.org/projects/htmltokenizer/>

It'll search for specified 'tags', returning the text skipped over,
which you can put into a buffer. Then you can get the attributes of the
'tag', and modify them, and put the result in the buffer. Finally, you
can slurp in the rest of the pseudo-HTML.

mathew

Brad Wilson

unread,

Jun 27, 2005, 3:35:13 PM6/27/05

to

If you're comfortable "cleaning it up", why not tidy it to XHTML then
use the XML parser? This is the approach I took recently when I needed
it.

Bucco

unread,

Jun 27, 2005, 8:18:01 PM6/27/05

to

I have a couple of more questions then:

1. I tried the example for the htmltokenizer and got an error around
assert. Where/what is the assert method?

2. What do you mean by "slurp" in the rest of the text?

3. Any better examples how to use htmltokenizer?

Thanks:)
SA

mathew

unread,

Jun 28, 2005, 12:21:46 PM6/28/05

to

Bucco wrote:
> 1. I tried the example for the htmltokenizer and got an error around
> assert. Where/what is the assert method?

An error around "assert" is likely an internal error of some kind.
Assertions are pieces of code placed in software to detect invalid
arguments to methods, internal data structure inconsistencies, and so on.

For example, consider the Ruby URI library. It doesn't support all kinds
of URI. So, it would be a good idea if it were to assert that the URI it
is being passed is one of the kinds it actually knows how to parse. That
way, someone innocently using the library with the wrong kind of URI
will discover the problem immediately, rather than being passed back bad
data, or having some bizarre error occur in the middle of the library code.

So it could be that you're passing an invalid argument to a method of
htmltokenizer. It's also possible that you're triggering a bug in the
library.

> 2. What do you mean by "slurp" in the rest of the text?

"slurp" meaning "pull in the entire content of the file from the current
file pointer onwards, without performing any processing on it".

As in file = File.new("something.gif")
data = file.read # slurp!

<URL:http://www.retrologic.com/jargon/S/slurp.html>

> 3. Any better examples how to use htmltokenizer?

require 'html/htmltokenizer'

#[...]

# Parse all the images and links out of the web page
tokenizer = HTMLTokenizer.new(@body)
@images = Array.new
@links = Array.new
lastlink = ''
while tag = tokenizer.getTag('img', 'a')
if tag.tag_name == 'img'
url = tag.attr_hash['src']
uri = @uri.merge(url)
@images.push([uri.to_s, lastlink])
else
url = tag.attr_hash['href']
uri = @uri.merge(url)
@links.push(uri.to_s)
lastlink = uri.to_s
end
end

That's the only time I've used it, I'm afraid. Still, it might give you
some ideas.

mathew

why the lucky stiff

unread,

Jun 28, 2005, 4:27:47 PM6/28/05

to

Bucco wrote:

Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.

A really fantastic HTML parser library is HTree by Tanaka Akira.

<http://cvs.m17n.org/~akr/htree/>

It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.

_why

daz

unread,

Jun 29, 2005, 2:58:57 AM6/29/05

to

_why wrote:
>
> [snip]

>
> A really fantastic HTML parser library is HTree by Tanaka Akira.
>

I'm glad you brought that in because I tried it last year and saw
that it was a serious "heavy horse" and perhaps a little bit
_more_ than I was looking for.
It was adding XHTML namespace prefixes to all tags, so a
horizontal rule, for example, became:

<{http://www.w3.org/1999/xhtml}hr>

>
> It's completely forgiving of bad HTML and you can import the
> document into REXML through the HTree parser.
>
> require 'htree'
> HTree.parse( "<b>Bad markup" ).to_rexml
>

It may have been in its early stages of development, but my
assumption that HTree would be too strict is under review :-)
Applying your example, I get the result I was expecting
without all that namespace stuff.

> The only downside is that you'll need to install the iconv library,
> which can be a bit of a pain to track down on Windows.

Instead of hunting around for that, I'd made a dummy Ruby version
(no functionality for those who don't need any):

In my old version there are two files which require 'iconv' -
("text.rb" and "encoder.rb") which I changed to:

begin
require 'iconv'
rescue LoadError
require 'htree/iconv_dummy'
end

Then, add this dummy file as:
lib\ruby\site_ruby\1.8\htree\iconv_dummy.rb

#-------------------------------------------------------------
class Iconv

## For testing : Not part of the HTree package ##
warn "Using dummy iconv lib: #{__FILE__}"
IC_DUMMY = true

def Iconv.open(to, from)
inst = Iconv.new
block_given? ? yield(inst) : inst
end
def Iconv.iconv(to, from, *strs)
strs.join
end
def Iconv.conv(to, from, str)
str
end
def Iconv.list
raise 'No Iconv.list'
end
def initialize(to, from)
end
def close
''
end
def iconv(str, strt = 0, len = -1)
(len and !( len < 0 )) or len = str.size - strt
str[strt, len]
end

module Failure
def initialize(*args) # 3
end
def success
end
def failed
end
def inspect
end
end

# class InvalidEncoding < ArgumentError; end
# class IllegalSequence < ArgumentError; end
# class InvalidCharacter < ArgumentError; end
# class OutOfRange < RuntimeError; end

def Iconv.charset_map
raise 'No Iconv.charset_map'
end
end
#-------------------------------------------------------------

>
> _why
>

daz

Bill Guindon

unread,

Jul 9, 2005, 9:17:27 AM7/9/05

to

There's a page on the Rails site that covers the iconv installation on Windows:
http://wiki.rubyonrails.com/rails/show/iconv

Once I had the iconv.so in a library path, and iconv.dll in
windows\system32, I ran the test-all.rb. Got an error due to a lack
of /dev/null, but that was fixed by creating a dev directory, and
adding an empty 'null' file to it.

Should swap that out to have it point to a temp dir, but with that
setup, all of the htree tests passed.

> _why
>
>

--
Bill Guindon (aka aGorilla)