SimpleXML's unsuitability as an XML serializer

690 views
Skip to first unread message

Geoffrey Sneddon

unread,
Sep 4, 2008, 7:22:17 PM9/4/08
to habar...@googlegroups.com
SimpleXML doesn't half the time escape text nodes. This makes it
really quite unusable if we want separation between the tree and the
output. See <http://bugs.php.net/bug.php?id=44478> — behaviour of
SimpleXML changed in a big way in PHP 5.2.6 as a "bug fix" completely
breaking backwards compatibility. See the below:

<?php

// Create SimpleXMLElement
$foo = new SimpleXMLElement('<foo/>');

// Add an element using addChild() with text
$foo->addChild('bar', 'this&that');

// Add an element using a property
$foo->lol = 'this&that';

// Add a child using addChild() then set text as a property
$foo->addChild('what');
$foo->what = 'this&that';

// Add a child using a property with invalid UTF-8
$foo->magic = "\x80";

// Serialize it
echo $foo->asXML();

?>

On PHP 5.2.6 that outputs:

$ php fo.php
PHP Warning: SimpleXMLElement::addChild(): unterminated entity
reference that in /Users/gsnedders/Desktop/fo.php on line 7

Warning: SimpleXMLElement::addChild(): unterminated entity
reference that in /Users/gsnedders/Desktop/fo.php on line 7
PHP Warning: SimpleXMLElement::asXML(): string is not in UTF-8 in /
Users/gsnedders/Desktop/fo.php on line 20

Warning: SimpleXMLElement::asXML(): string is not in UTF-8 in /Users/
gsnedders/Desktop/fo.php on line 20
<?xml version="1.0"?>
<foo><bar>this</bar><lol>this&amp;that</lol><what>this&amp;that</
what><magic></magic></foo>

On PHP 5.2.4:

$ php fo.php
Warning: SimpleXMLElement::addChild(): unterminated entity
reference that in /home/caius/- on line 7

Warning: main(): unterminated entity reference that in /
home/caius/- on line 14

Warning: SimpleXMLElement::asXML(): string is not in UTF-8 in /home/
caius/- on line 20
<?xml version="1.0"?>
<foo><bar>this</bar><lol>this&amp;that</lol><what>this</what><magic></
magic></foo>

This makes SimpleXML entirely unusable as a serializer: all escaping
of text must happen with in the serializer. On PHP 5.2.6, the second
parameter of SimpleXMLElement::addchild() must have all ampersands
escaped (but nothing else, so htmlspecialchars() is wrong there) —
behaviour of '<' is what would be expected, it is escaped in the
output as "&lt;". If a text node contains any invalid UTF-8 characters
nothing is output.

We currently use SimpleXML in places and DOM in others for
serialization. The behaviour is wacky enough, but the fact it
underwent a backwards incompatible change in 5.2.6 makes me absolutely
certain that we should not use it to build trees and serialize them to
XML. We should use DOM everywhere.


--
Geoffrey Sneddon
<http://gsnedders.com/>

Umbrae

unread,
Sep 4, 2008, 11:32:10 PM9/4/08
to habari-dev
I've had much better professional experience with DOM as a whole
anyhow, despite the somewhat lacking PHP documentation. I'm all in
favor of this change.

Ali B.

unread,
Sep 5, 2008, 9:04:56 AM9/5/08
to habar...@googlegroups.com
We have also verified that SimpleXML on Windows behaves different than SimpleXML on linux! On linux the addChild filters the ampersands while on windows it doesn't.

I have created a branch to tackle the switch to DOM (http://svn.habariproject.org/habari/branches/090508-dom/). If you'd like the contribute to it, It would be greatly appreciate it. I am intending to start with converting the atomhandler.
--
Ali B / dmondark
http://www.awhitebox.com

Matt Read

unread,
Sep 5, 2008, 9:44:22 AM9/5/08
to habar...@googlegroups.com
On Fri, Sep 5, 2008 at 9:04 AM, Ali B. <dmon...@gmail.com> wrote:
> We have also verified that SimpleXML on Windows behaves different than
> SimpleXML on linux! On linux the addChild filters the ampersands while on
> windows it doesn't.

Here on Linux it does not filter the ampersand.. So i'm guessing it
has something to with smiplexml/libxml versions?

> I have created a branch to tackle the switch to DOM
> (http://svn.habariproject.org/habari/branches/090508-dom/). If you'd like
> the contribute to it, It would be greatly appreciate it. I am intending to
> start with converting the atomhandler.

cool. and good luck ;)


--
Matt Read
http://mattread.com

Geoffrey Sneddon

unread,
Sep 5, 2008, 8:19:58 PM9/5/08
to habar...@googlegroups.com

On 5 Sep 2008, at 14:44, Matt Read wrote:

> On Fri, Sep 5, 2008 at 9:04 AM, Ali B. <dmon...@gmail.com> wrote:
>> We have also verified that SimpleXML on Windows behaves different
>> than
>> SimpleXML on linux! On linux the addChild filters the ampersands
>> while on
>> windows it doesn't.
>
> Here on Linux it does not filter the ampersand.. So i'm guessing it
> has something to with smiplexml/libxml versions?

libxml hasn't changed in a very long time. It has to do only with
PHP's inconsistency with itself.

Reply all
Reply to author
Forward
0 new messages