Nokogiri messing with my a tag

43 views
Skip to first unread message

Daniel Jabbour

unread,
May 29, 2012, 6:38:39 AM5/29/12
to nokogiri-talk
Hi all-

So I'm running an HTML fragment through nokogiri, and I'm getting some
odd behavior. My fragment is a little non-standard, HTML wise, but I'm
confused by what's going on, and wondering if you can help explain
(and surpress) the behavior I'm seeing.

Basically, I've got a fragment, which I'm running through
Nokogiri::HTML.fragment(<fragment>). The fragment contains an a tag,
around a table (ex <a ...><table>...</table></a>). However, when I run
the fragment through nokogiri and to_s it, it rewrite the a tag so it
no longer surrounds the table (ex <a...></a><table>...</table>).

So basically:

1) Why is it doing this?
2) How can I stop this behavior?

Any insight would be appreciated!

Thanks kindly,
Daniel

Mike Dalessio

unread,
May 29, 2012, 7:27:30 AM5/29/12
to nokogi...@googlegroups.com
Greetings!

Nokogiri attempts to "fix" broken HTML. This is a feature. It uses libxml2 (or, if you're on JRuby, xerces instead) to do this; in order to support many of the HTML-specific features, Nokogiri needs the parsed document to be compliant HTML.

Tables within an anchor are, as far as I know, illegal HTML. Here's a sample script:

#! /usr/bin/env ruby

require 'rubygems'
require 'nokogiri'

puts Nokogiri::HTML::Document.parse("<html><body><a href='http://foo.com/bar'></a></body></html>").to_html
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body><a href="http://foo.com/bar"></a></body></html>

puts Nokogiri::HTML::Document.parse("<html><body><a href='http://foo.com/bar'><table><tr><td>foo</td></tr></table></a></body></html>").to_html
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body>
# <a href="http://foo.com/bar"></a><table><tr><td>foo</td></tr></table>
# </body></html>


Reply all
Reply to author
Forward
0 new messages