Thanks:)
SA
This has been answered once or twice before by this group ;-)
The best way tends to involve using a package although
you /could/ work your way through it using regular expressions.
If you're likely to be doing this kind of thing in the
future, you'll be glad you spent a bit of time installing;
then it's always available.
As I recall, a different package is often
recommended but I don't know which is best.
This is what some of us use:
http://ruby-htmltools.rubyforge.org/ (Ned Konz +)
Examples are included but here's another ...
#-----------------------------------------------------------------
EXAMPLE = <<EOX
<html lang="en">
<head>
<title>Page title</title>
</head>
<body>
<div id="Header">
<h1><a href="http://xxxx.net/"><em>q</em>URL.net</a></h1>
<p>For When You Want a Quick URL</p>
</div>
<hr>
<div id="Content">
<form action="http://xxxx.net/" method="post">
<fieldset>
<legend>Enter a <abbr title="Uniform Resource Locator">URL</abbr> to make into a xxxx:</legend>
<input id="InputURL" type="text" size="40" maxlength="65535" name="url" value="">
<input type="submit" name="action" value="Create xxxx">
</fieldset>
</form>
</div>
<hr>
<a href="http://xxxx.net/pages/contact">Contact</a> -
<a href="http://xxxx.net/downloads/">Downloads</a> -
<a href="http://xxxx.net/">Create</a> -
<a href="http://xxxx.net/pages/terms">Terms of Use</a> -
<a href="http://xxxx.net/pages/list">List</a> -
<a href="http://xxxx.net/pages/prefs">Preferences</a>
</body>
</html>
EOX
require 'html/tree' # http://ruby-htmltools.rubyforge.org/
verbose = true
exa = HTMLTree::Parser.new(verbose, !false)
#exa.parse_file_named('xxxx_net.html')
exa.feed(EXAMPLE) # replaces '.parse_file_named'
item_a = exa.html.select {|ea| ea.tag == 'a'}
item_a.each {|ea| p [:ahref, ea['href']]}
puts '+'*100
exa.html.each do |ea|
p [ea.tag, ea['href']]
ea.each do |item|
if item.data?
p [:data, item.to_s]
elsif item.tag == 'a'
item['href'].sub!(/xxxx/, 'mysite')
end
end
puts '='*100
ea.dump
end
### exa.html.dump
#-----------------------------------------------------------------
Output from the script above is too long to post here,
so I've uploaded it to:
http://www.d10.karoo.net/ruby/example_html_parse.txt
Hope this is of some use,
daz
--
JARH (Nihon-style) http://qurl.net/h3
For invalid "tag soup" HTML, your best bet is probably to use
html/htmltokenizer.
<URL:http://rubyforge.org/projects/htmltokenizer/>
It'll search for specified 'tags', returning the text skipped over,
which you can put into a buffer. Then you can get the attributes of the
'tag', and modify them, and put the result in the buffer. Finally, you
can slurp in the rest of the pseudo-HTML.
mathew
1. I tried the example for the htmltokenizer and got an error around
assert. Where/what is the assert method?
2. What do you mean by "slurp" in the rest of the text?
3. Any better examples how to use htmltokenizer?
Thanks:)
SA
An error around "assert" is likely an internal error of some kind.
Assertions are pieces of code placed in software to detect invalid
arguments to methods, internal data structure inconsistencies, and so on.
For example, consider the Ruby URI library. It doesn't support all kinds
of URI. So, it would be a good idea if it were to assert that the URI it
is being passed is one of the kinds it actually knows how to parse. That
way, someone innocently using the library with the wrong kind of URI
will discover the problem immediately, rather than being passed back bad
data, or having some bizarre error occur in the middle of the library code.
So it could be that you're passing an invalid argument to a method of
htmltokenizer. It's also possible that you're triggering a bug in the
library.
> 2. What do you mean by "slurp" in the rest of the text?
"slurp" meaning "pull in the entire content of the file from the current
file pointer onwards, without performing any processing on it".
As in file = File.new("something.gif")
data = file.read # slurp!
<URL:http://www.retrologic.com/jargon/S/slurp.html>
> 3. Any better examples how to use htmltokenizer?
require 'html/htmltokenizer'
#[...]
# Parse all the images and links out of the web page
tokenizer = HTMLTokenizer.new(@body)
@images = Array.new
@links = Array.new
lastlink = ''
while tag = tokenizer.getTag('img', 'a')
if tag.tag_name == 'img'
url = tag.attr_hash['src']
uri = @uri.merge(url)
@images.push([uri.to_s, lastlink])
else
url = tag.attr_hash['href']
uri = @uri.merge(url)
@links.push(uri.to_s)
lastlink = uri.to_s
end
end
That's the only time I've used it, I'm afraid. Still, it might give you
some ideas.
mathew
Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.
A really fantastic HTML parser library is HTree by Tanaka Akira.
<http://cvs.m17n.org/~akr/htree/>
It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.
require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml
The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.
_why
I'm glad you brought that in because I tried it last year and saw
that it was a serious "heavy horse" and perhaps a little bit
_more_ than I was looking for.
It was adding XHTML namespace prefixes to all tags, so a
horizontal rule, for example, became:
<{http://www.w3.org/1999/xhtml}hr>
>
> It's completely forgiving of bad HTML and you can import the
> document into REXML through the HTree parser.
>
> require 'htree'
> HTree.parse( "<b>Bad markup" ).to_rexml
>
It may have been in its early stages of development, but my
assumption that HTree would be too strict is under review :-)
Applying your example, I get the result I was expecting
without all that namespace stuff.
> The only downside is that you'll need to install the iconv library,
> which can be a bit of a pain to track down on Windows.
Instead of hunting around for that, I'd made a dummy Ruby version
(no functionality for those who don't need any):
In my old version there are two files which require 'iconv' -
("text.rb" and "encoder.rb") which I changed to:
begin
require 'iconv'
rescue LoadError
require 'htree/iconv_dummy'
end
Then, add this dummy file as:
lib\ruby\site_ruby\1.8\htree\iconv_dummy.rb
#-------------------------------------------------------------
class Iconv
## For testing : Not part of the HTree package ##
warn "Using dummy iconv lib: #{__FILE__}"
IC_DUMMY = true
def Iconv.open(to, from)
inst = Iconv.new
block_given? ? yield(inst) : inst
end
def Iconv.iconv(to, from, *strs)
strs.join
end
def Iconv.conv(to, from, str)
str
end
def Iconv.list
raise 'No Iconv.list'
end
def initialize(to, from)
end
def close
''
end
def iconv(str, strt = 0, len = -1)
(len and !( len < 0 )) or len = str.size - strt
str[strt, len]
end
module Failure
def initialize(*args) # 3
end
def success
end
def failed
end
def inspect
end
end
# class InvalidEncoding < ArgumentError; end
# class IllegalSequence < ArgumentError; end
# class InvalidCharacter < ArgumentError; end
# class OutOfRange < RuntimeError; end
def Iconv.charset_map
raise 'No Iconv.charset_map'
end
end
#-------------------------------------------------------------
>
> _why
>
daz
There's a page on the Rails site that covers the iconv installation on Windows:
http://wiki.rubyonrails.com/rails/show/iconv
Once I had the iconv.so in a library path, and iconv.dll in
windows\system32, I ran the test-all.rb. Got an error due to a lack
of /dev/null, but that was fixed by creating a dev directory, and
adding an empty 'null' file to it.
Should swap that out to have it point to a temp dir, but with that
setup, all of the htree tests passed.
> _why
>
>
--
Bill Guindon (aka aGorilla)