to_xml() adds linefeeds between tags, how to stop it?

1,387 views
Skip to first unread message

Jani Patokallio

unread,
Dec 6, 2010, 10:08:46 PM12/6/10
to nokogiri-talk
Greetings,

So I've got an XML document where whitespace is meaningful, but
roundtripping it through Nokogiri seems to add in linefeeds (\n)
between tag levels, even when indenting is turned off. Is there a way
to switch this off?

(rdb:1) doc = Nokogiri::XML('<p><b>Foo</b></p>')
#<Nokogiri::XML::Document:0x3bccee4 name="document"
children=[#<Nokogiri::XML::Element:0x3bccca0 name="p"
children=[#<Nokogiri::XML::Element:0x3bcca20 name="b"
children=[#<Nokogiri::XML::Text:0x3bcc7a0 "Foo">]>]>]>
(rdb:1) doc.to_xml
"<?xml version=\"1.0\"?>\n<p>\n <b>Foo</b>\n</p>\n"
(rdb:1) doc.to_xml(:indent => 0)
"<?xml version=\"1.0\"?>\n<p>\n<b>Foo</b>\n</p>\n"

Of course I can always gsub("\n", "") the output, but that seems
unnecessarily ugly.

Cheers,
-jani

Mike Dalessio

unread,
Dec 6, 2010, 11:39:38 PM12/6/10
to nokogi...@googlegroups.com
On Mon, Dec 6, 2010 at 10:08 PM, Jani Patokallio <jpat...@iki.fi> wrote:
Greetings,

So I've got an XML document where whitespace is meaningful

How so? Can you provide an explanation, and some details? How about a real example?
 
, but
roundtripping it through Nokogiri seems to add in linefeeds (\n)
between tag levels, even when indenting is turned off.  Is there a way
to switch this off?

(rdb:1) doc = Nokogiri::XML('<p><b>Foo</b></p>')
#<Nokogiri::XML::Document:0x3bccee4 name="document"
children=[#<Nokogiri::XML::Element:0x3bccca0 name="p"
children=[#<Nokogiri::XML::Element:0x3bcca20 name="b"
children=[#<Nokogiri::XML::Text:0x3bcc7a0 "Foo">]>]>]>
(rdb:1) doc.to_xml
"<?xml version=\"1.0\"?>\n<p>\n  <b>Foo</b>\n</p>\n"
(rdb:1) doc.to_xml(:indent => 0)
"<?xml version=\"1.0\"?>\n<p>\n<b>Foo</b>\n</p>\n"

Well, if you provide a document with some whitespace between tags, like this:

Nokogiri::XML('<p>  <b>Foo</b></p>').to_xml
# <?xml version="1.0"?>
# <p>  <b>Foo</b></p>

you see that Nokogiri (actually, libxml2 v2.7.7) respects the inter-tag whitespace and doesn't insert newlines.

I think a real example might help me understand your issue better, and will result in a more useful and targetted answer.

---
mike dalessio / @flavorjones

Jani Patokallio

unread,
Dec 7, 2010, 9:36:23 PM12/7/10
to nokogiri-talk
Long story short, I'm manipulating OpenOffice ODT, which in its
autogenerated form does not contain any inter-tag whitespace.
Example:

<office:text xmlns:office="foo" xmlns:text="bar"><text:p>Plain</
text:p><text:p><text:span text:style-name="T2">Bold</text:span></
text:p></office:text>

If I load that in Nokogiri and save with to_xml, I get this:

irb(main):001:0> @doc.to_xml
"<?xml version=\"1.0\"?>\n<office:text xmlns:office=\"foo\" xmlns:text=
\"bar\">\n <text:p>Plain</text:p>\n <text:p>\n <text:span
text:style-name=\"T2\">Bold</text:span>\n </text:p>\n</office:text>
\n"=> "<?xml version=\"1.0\"?>\n<text>\n

Or in human-readable form with namespaces put back in:

<office:text>
<text:p>Plain</text:p>
<text:p>
<text:span text:style-name="T2">Bold</text:span>
</text:p>
</office:text>

And now, if I load that in OO, it inserts a spurious single space
after "Bold" that wasn't there before. If you resave the file, it
changes into this:

<office:text ...><text:p>Plain</text:p><text:p><text:span text:style-
name="T2">Bold</text:span> </text:p></office:text>

In other words, other whitespace is stripped out, but the whitespace
after the bold span is apparently considered meaningful and kept,
albeit collapsed to a single space. Whether OO *should* interpret the
input this way is certainly debatable, but unfortunately that's the
way it is now.

Cheers,
-jani

On Dec 7, 3:39 pm, Mike Dalessio <mike.dales...@gmail.com> wrote:

Mike Dalessio

unread,
Dec 8, 2010, 9:29:16 AM12/8/10
to nokogi...@googlegroups.com
On Tue, Dec 7, 2010 at 9:36 PM, Jani Patokallio <jpat...@iki.fi> wrote:
Long story short, I'm manipulating OpenOffice ODT, which in its
autogenerated form does not contain any inter-tag whitespace.
Example:

<office:text xmlns:office="foo" xmlns:text="bar"><text:p>Plain</
text:p><text:p><text:span text:style-name="T2">Bold</text:span></
text:p></office:text>

Ah, interesting. Most people who ask about whitespace preservation are doing something hacky in their tests.
 

If I load that in Nokogiri and save with to_xml, I get this:

irb(main):001:0> @doc.to_xml
"<?xml version=\"1.0\"?>\n<office:text xmlns:office=\"foo\" xmlns:text=
\"bar\">\n  <text:p>Plain</text:p>\n  <text:p>\n    <text:span
text:style-name=\"T2\">Bold</text:span>\n  </text:p>\n</office:text>
\n"=> "<?xml version=\"1.0\"?>\n<text>\n

Or in human-readable form with namespaces put back in:

<office:text>
 <text:p>Plain</text:p>
 <text:p>
   <text:span text:style-name="T2">Bold</text:span>
 </text:p>
</office:text>

And now, if I load that in OO, it inserts a spurious single space
after "Bold" that wasn't there before.  If you resave the file, it
changes into this:

<office:text ...><text:p>Plain</text:p><text:p><text:span text:style-
name="T2">Bold</text:span> </text:p></office:text>

In other words, other whitespace is stripped out, but the whitespace
after the bold span is apparently considered meaningful and kept,
albeit collapsed to a single space.  Whether OO *should* interpret the
input this way is certainly debatable, but unfortunately that's the
way it is now.

OK, if you check out the documentation for Node#serialize, you'll see that you have the option of passing in some "save options":


By default, what Node#to_s, #to_xml, #to_html, etc. all do is to use the save option FORMAT, which prints nicely. Usually, that's a win.

If you want to turn it off, do this:

  doc.serialize :save_with => 0

More semantically, you should be able to also do:

  doc.serialize :save_with => Nokogiri::XML::Node::SaveOptions.new

but I just found a bug with that syntax while I was writing this response, so don't use it. :-\

Thanks for using Nokogiri!
-m

Jani Patokallio

unread,
Dec 8, 2010, 10:07:08 PM12/8/10
to nokogiri-talk
On Dec 9, 1:29 am, Mike Dalessio <mike.dales...@gmail.com> wrote:
> By default, what Node#to_s, #to_xml, #to_html, etc. all do is to use the
> save option FORMAT, which prints nicely. Usually, that's a win.
>
> If you want to turn it off, do this:
>
>   doc.serialize :save_with => 0

Thanks, that looks *almost* perfect. Unfortunately my actual code is
running #to_xml on a NodeSet, not a Node, so I have to add in a map
loop:

nodeset.map {|e| e.serialize(:save_with => 0)}

Works fine, but to the casual reader that doesn't exactly yell out
"output as XML without prettyprinting". I don't suppose it would be
possible to add in eg. a ":format => false" option for #to_xml & co?

Cheers,
-jani

Mike Dalessio

unread,
Dec 8, 2010, 11:26:16 PM12/8/10
to nokogi...@googlegroups.com
Well, in two years, you're the first person I've talked to who had a real use case for ignoring whitespace (read: not trying to make unsemantic tests pass). :)

Though perhaps we could give this idiom some attention in the nokogiri.org tutorials ... see https://github.com/flavorjones/nokogiri.org-tutorials/issues/issue/12
Reply all
Reply to author
Forward
0 new messages