Strange output of `to_xhtml`

32 views
Skip to first unread message

Goran Topic

unread,
Feb 14, 2014, 4:27:00 AM2/14/14
to nokogi...@googlegroups.com
Hello!


I have to process some XHTML documents, and there are some puzzling things going on. My experience with Nokogiri is limited, so it might be my fault, and I'd appreciate advice.

Here is a minimal example, that I tested on Nokogiri 1.6.1 on OSX with ruby-2.0.0-p353, as well as 1.6.0 on Ubuntu with ruby-1.9.3-p327.

xhtml = <<XHTML
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"></html>
XHTML
puts Nokogiri::HTML(xhtml).to_xml

And the output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="utf-8"??>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"></html>

1) Why is xhtml namespace repeated? (If more namespaces are listed, only the base xmlns is repeated.)

2) Why does the xml processing instruction gain an extra question mark at the end?

3) Why does it insert a doctype in front of the xml processing instruction? Of the browsers I tested, both Firefox and Chrome refuse to parse the file when they encounter this order.

4) Can I select another doctype? HTML5?

I also tried Nokogiri 1.6.1 on jruby-1.7.10. Its output:

<html xmlns="http://www.w3.org/1999/xhtml">
</html>

But I can't use Nokogiri for JRuby for other reasons, mainly because it creates malformed XHTML (any unknown <tag></tag> becomes <tag /></tag>, with an unbalanced closing tag):

xhtml = <<XHTML
<?xml version="1.0" encoding="utf-8"?>
<svg:svg></svg:svg>
</html>
XHTML

puts Nokogiri::HTML(xhtml).to_xhtml

outputs

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML">
<head />
</head>
<body>
  <svg:svg />
</svg:svg>


</body>
</html>

Anyway, for my current work the only real obstacle is #1 (doubled xmlns attribute) on MRI; I was going to kill the doctype and processing instruction anyway by converting the document root instead. Still, can someone confirm I'm not doing something wrong? Nokogiri is so much nicer to work with than REXML, if only it wouldn't make my documents invalid.

Thank you!


Goran

Reply all
Reply to author
Forward
0 new messages