Greetings!
On Tue, May 29, 2012 at 6:38 AM, Daniel Jabbour <d
...@inigral.com> wrote:
> Hi all-
> So I'm running an HTML fragment through nokogiri, and I'm getting some
> odd behavior. My fragment is a little non-standard, HTML wise, but I'm
> confused by what's going on, and wondering if you can help explain
> (and surpress) the behavior I'm seeing.
> Basically, I've got a fragment, which I'm running through
> Nokogiri::HTML.fragment(<fragment>). The fragment contains an a tag,
> around a table (ex <a ...><table>...</table></a>). However, when I run
> the fragment through nokogiri and to_s it, it rewrite the a tag so it
> no longer surrounds the table (ex <a...></a><table>...</table>).
> So basically:
> 1) Why is it doing this?
> 2) How can I stop this behavior?
Nokogiri attempts to "fix" broken HTML. This is a feature. It uses libxml2
(or, if you're on JRuby, xerces instead) to do this; in order to support
many of the HTML-specific features, Nokogiri needs the parsed document to
be compliant HTML.
Tables within an anchor are, as far as I know, illegal HTML. Here's a
sample script:
#! /usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
puts Nokogiri::HTML::Document.parse("<html><body><a href='http://foo.com/bar
'></a></body></html>").to_html
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body><a href="http://foo.com/bar"></a></body></html>
puts Nokogiri::HTML::Document.parse("<html><body><a href='http://foo.com/bar
'><table><tr><td>foo</td></tr></table></a></body></html>").to_html
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body>
# <a href="http://foo.com/bar"></a><table><tr><td>foo</td></tr></table>
# </body></html>