And the output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="utf-8"??>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"></html>
1) Why is xhtml namespace repeated? (If more namespaces are listed, only the base xmlns is repeated.)
2) Why does the xml processing instruction gain an extra question mark at the end?
3) Why does it insert a doctype in front of the xml processing instruction? Of the browsers I tested, both Firefox and Chrome refuse to parse the file when they encounter this order.
4) Can I select another doctype? HTML5?
I also tried Nokogiri 1.6.1 on jruby-1.7.10. Its output:
<html xmlns="http://www.w3.org/1999/xhtml">
</html>
But I can't use Nokogiri for JRuby for other reasons, mainly because it creates malformed XHTML (any unknown <tag></tag> becomes <tag /></tag>, with an unbalanced closing tag):
puts Nokogiri::HTML(xhtml).to_xhtml
outputs
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML">
<head />
</head>
<body>
<svg:svg />
</svg:svg>
</body>
</html>
Anyway, for my current work the only real obstacle is #1 (doubled xmlns attribute) on MRI; I was going to kill the doctype and processing instruction anyway by converting the document root instead. Still, can someone confirm I'm not doing something wrong? Nokogiri is so much nicer to work with than REXML, if only it wouldn't make my documents invalid.
Thank you!
Goran