[boost] An invalid XML character (Unicode: 0x8) problem because of property_tree::xml_parser::write_xml

343 views
Skip to first unread message

Rohan Shetty

unread,
Mar 2, 2015, 12:21:59 AM3/2/15
to bo...@lists.boost.org
Hi,
I have used the following C++ code to generate the xml
boost::property_tree::ptree ptResponse;
// Populate the tree from the Microsoft Outlook contactsstd::stringstream buf;
const std::string enc("utf-8"); boost::property_tree::xml_writer_settings<char> settings(' ', 0, enc); boost::property_tree::xml_parser::write_xml(buf, ptResponse, settings);
This works fine.
But in one of the customer's machine, when reading the this(xml content) in a JAVA program. I get the following error
An invalid XML character (Unicode: 0x8) was found in the element content of the document.

Any help in solving this is appreciated.
Regards,Rohan 

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Mathias Gaunard

unread,
Mar 2, 2015, 11:40:54 AM3/2/15
to bo...@lists.boost.org
On 02/03/2015 05:59, Rohan Shetty wrote:
> Hi,
> I have used the following C++ code to generate the xml
> boost::property_tree::ptree ptResponse;
> // Populate the tree from the Microsoft Outlook contactsstd::stringstream buf;
> const std::string enc("utf-8"); boost::property_tree::xml_writer_settings<char> settings(' ', 0, enc); boost::property_tree::xml_parser::write_xml(buf, ptResponse, settings);
> This works fine.
> But in one of the customer's machine, when reading the this(xml content) in a JAVA program. I get the following error
> An invalid XML character (Unicode: 0x8) was found in the element content of the document.
>
> Any help in solving this is appreciated.

I don't understand, the error message is quite explicit: your data isn't
utf-8 even though you said it was. What were you expecting to happen?

Also this would probably be more suited to the boost-users mailing list.

Rohan Shetty

unread,
Mar 3, 2015, 12:37:04 AM3/3/15
to bo...@lists.boost.org
Hi Mathias,
Thanks for your response.
I was expecting write_xml(with "utf-8") to do the escape(e.g < replaced with &lt;) or strip any invalid characters(e.g. anything other than   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
Is this part of the write_xml()?
Do let me know if this is not clear.
Regards,Rohan
From: Mathias Gaunard <mathias...@ens-lyon.org>
To: bo...@lists.boost.org
Sent: Monday, March 2, 2015 10:10 PM
Subject: Re: [boost] An invalid XML character (Unicode: 0x8) problem because of property_tree::xml_parser::write_xml

Mathias Gaunard

unread,
Mar 3, 2015, 6:58:16 AM3/3/15
to bo...@lists.boost.org
This mailing-list uses bottom- and inline-posting, please lay out your
responses accordingly.

On 03/03/2015 04:11, Rohan Shetty wrote:
> Hi Mathias,
> Thanks for your response.
> I was expecting write_xml(with "utf-8") to do the escape(e.g <
replaced with &lt;) or strip any invalid characters(e.g. anything other
than #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF])
> Is this part of the write_xml()?
> Do let me know if this is not clear.
> Regards,Rohan

It is not reasonable to expect that the write_xml function would
silently drop data by default.
If you want invalid data to be removed, you'll have to do this yourself
prior to calling the function.

This signature of write_xml doesn't actually do anything encoding-wise,
it outputs your data as-is, and marks the data as being the encoding you
specified.

It might be more sensible to set up the encoding correctly though, or to
convert your data to the right encoding.
There is another overload of write_xml that can imbue a locale when
writing the data, which can be used for transparent transcoding.

Bjorn Reese

unread,
Mar 3, 2015, 7:18:10 AM3/3/15
to bo...@lists.boost.org
On 03/03/2015 04:11 AM, Rohan Shetty wrote:

> I was expecting write_xml(with "utf-8") to do the escape(e.g < replaced with &lt;) or strip any invalid characters(e.g. anything other than #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
> Is this part of the write_xml()?

Please read the documentation:

"RapidXML does not fully support the XML standard; it is not capable
of parsing DTDs and therefore cannot do full entity substitution.

[...]

Please note that RapidXML does not understand the encoding
specification. If you pass it a character buffer, it assumes the data
is already correctly encoded; if you pass it a filename, it will read
the file using the character conversion of the locale you give it (or
the global locale if you give it none). This means that, in order to
parse a UTF-8-encoded XML file into a wptree, you have to supply an
alternate locale, either directly or by replacing the global one."

http://www.boost.org/doc/html/boost_propertytree/parsers.html

Rohan Shetty

unread,
Mar 4, 2015, 8:27:00 AM3/4/15
to bo...@lists.boost.org
On 03/03/2015 5:28 PM, Mathias Gaunard wrote:> This mailing-list uses bottom- and inline-posting, please lay out your > responses accordingly.

> It is not reasonable to expect that the write_xml function would > silently drop data by default.> If you want invalid data to be removed, you'll have to do this yourself > prior to calling the function.
> This signature of write_xml doesn't actually do anything encoding-wise, > it outputs your data as-is, and marks the data as being the encoding you > specified.
> It might be more sensible to set up the encoding correctly though, or to > convert your data to the right encoding.> There is another overload of write_xml that can imbue a locale when > writing the data, which can be used for transparent transcoding.
Thanks Mathias.

Rohan Shetty

unread,
Mar 4, 2015, 8:27:13 AM3/4/15
to bo...@lists.boost.org
On 03/03/2015 5:48 PM, Bjorn Reese wrote:> Please read the documentation:> >   "RapidXML does not fully support the XML standard; it is not capable>   of parsing DTDs and therefore cannot do full entity substitution.> >   [...]> >   Please note that RapidXML does not understand the encoding>   specification. If you pass it a character buffer, it assumes the data>   is already correctly encoded; if you pass it a filename, it will read>   the file using the character conversion of the locale you give it (or>   the global locale if you give it none). This means that, in order to>   parse a UTF-8-encoded XML file into a wptree, you have to supply an>   alternate locale, either directly or by replacing the global one."> > http://www.boost.org/doc/html/boost_propertytree/parsers.html
Thanks Bjorn.
Reply all
Reply to author
Forward
0 new messages