Reading CDATA from XML docs

21 views
Skip to first unread message

cdouglas

unread,
Oct 25, 2006, 12:34:26 PM10/25/06
to phpsoa
Hello,
I just found SDO this morning and it's exaclty what I was looking for
to manipulate XML documents. So far, everything I have tried has
worked except retrieving XML inside of a CDATA section inside of
another XML document, it just returns an empty string. I can get data
from all of the other elements.

In the example below, I am trying to get the MOREXML out into a string
so that I can then load it an manipulate it, write it back to the
original xml document, and then continue processing.

<test>
<entry1>
<data><![CDATA[<?xml version="1.0"
encoding="UTF-8"?><MOREXML>....</MOREXML>]]</data>
</entry1>
</test>

Thanks
Chris

simon...@googlemail.com

unread,
Oct 26, 2006, 4:16:28 AM10/26/06
to phpsoa
Hi Chris.

I just tried this and I see the same effect, i.e. the resulting SDO
field is superficially empty. I'll take a look and see if I can work
out what's going on.

Regards

Simon

simon...@googlemail.com

unread,
Oct 26, 2006, 6:29:35 AM10/26/06
to phpsoa
OK so the reason that you don't see CDATA is that the support for CDATA
is missing. When you look at the callback that LibXML2 (which SDO uses
to XML parsing) uses to handle CDATA it doesn't do anything.

void sdo_cdataBlock(void *ctx, const xmlChar *value, int len)
{
}

We base the PHP SDO implmentation on the Tuscany C++ SDO implementation
so I've just posted to the Tuscany list (will appear here
http://www.mail-archive.com/tuscany-dev%40ws.apache.org/index.html at
some point) to work out why this is. If it's just missing by mistake
I'll propose a patch.

Regards

Simon

cdouglas

unread,
Oct 26, 2006, 8:27:57 AM10/26/06
to phpsoa
Thank you very much. Almost all of the data I am needing to maipulate
is in CDATA tags, so if this can be resolved that would be great.
Otherwise I am going to have to learn java and use xmlbeans like our
other developer uses, and I would rather not.

gcha...@googlemail.com

unread,
Oct 30, 2006, 12:00:33 PM10/30/06
to phpsoa
Just to hopefully add a little more weight to this requirement. I have
been looking at using SDO to parse Atom Syndication Format messages. I
have encountered a number of examples that use CDATA to escape embedded
html. I would therefore also like to see CDATA support added to SDO.

Simon Laws

unread,
Oct 31, 2006, 3:20:05 AM10/31/06
to php...@googlegroups.com

simon...@googlemail.com

unread,
Nov 3, 2006, 9:47:05 AM11/3/06
to phpsoa
An update. I posted a question to Tuscany about how to handle CDATA but
didn't get any response. I've investigated the problem a bit more.
There are a number of options for representing CDATA in SDO, for
example

1) Duplicate the CDATA string as is, including the "<![CDATA[" and
"]]>" markers, to the appropriate property in the data object hiearchy
2) Duplicate the CDATA string excluding the "<![CDATA[" and "]]>"
markers and instigate a special flag to indicate that CDATA is present.

CDATA is the specific concern of XML, i.e. the chracter entities that
CDATA protects an XML parser from are of noconcern to SDO because SDO
is not intended to be tied directly to XML. So given the example
options above we either expose the specifics of XML to the SDO core 2)
or to the SDO user 1).

Neither are particularly attractive.

1) appears to be the simplest approach to implement because it provides
a mechanism for the user to read, and
create CDATA without having to provide much special support in SDO. 2)
is more involved particularly because
CDATA can appear mixed in with other text strings and so a sequence may
need to be used to represent properties
that have a mixture of text and CDATA marking those sequences entries
that are CDATA.

1) does require changes (at least in C++ SDO) because XML parsers tend
to be too helpful in this case for processing CDATA. XML parsers,
libxml2 in particular, recognize the "<![CDATA[" and "]]>" sequence as
a special indicator and throw it away returning just the text it
includes. We would have to reintroduce it and store it in the parameter
value in question. The C++ SDO implementation uses a lot of XML string
handling before the parameter value is actually stored which URL
encodes parts of the CDATA markers so this would have to be fixed. When
writing out the CDATA strings any string typed properties would have to
be scanned for the markers so that the appropriate libxml2 functions
can be called to get the CDATA sections in the right place.

I have a test implementation of 1) which needs nore more before I could
check it in but let me know if this is going in the right direction an
I'll do the work. In particular, when a CDATA section appears in an XML
file are you happy to have this appear verbatim in a data object
property. The result is that, as a user you would have to read the
property, parse out the CDATA markers and perform whatever processing
you require. When putting th string back you would have to make sure
that the CDATA makers appear correctly if you want this to write back
to XML without error.

Thoughts?

Simon

cdouglas

unread,
Nov 3, 2006, 12:26:13 PM11/3/06
to phpsoa
I am fine with #1, especially since its less work for you guys and you
have it mostly done.

Simon Laws

unread,
Nov 3, 2006, 12:53:26 PM11/3/06
to php...@googlegroups.com
On 11/3/06, cdouglas <ch...@douglas2000.com> wrote:

I am fine with #1, especially since its less work for you guys and you
have it mostly done.



Ok Chris, thanks, I'll push on and complete the changes
Simon


cem

unread,
Nov 6, 2006, 10:26:04 AM11/6/06
to phpsoa
simon...@googlemail.com wrote:
> An update. I posted a question to Tuscany about how to handle CDATA but
> didn't get any response. I've investigated the problem a bit more.
> There are a number of options for representing CDATA in SDO, for
> example
>
> 1) Duplicate the CDATA string as is, including the "<![CDATA[" and
> "]]>" markers, to the appropriate property in the data object hiearchy
> 2) Duplicate the CDATA string excluding the "<![CDATA[" and "]]>"
> markers and instigate a special flag to indicate that CDATA is present.
>
> CDATA is the specific concern of XML, i.e. the chracter entities that
> CDATA protects an XML parser from are of noconcern to SDO because SDO
> is not intended to be tied directly to XML. So given the example
> options above we either expose the specifics of XML to the SDO core 2)
> or to the SDO user 1).
>
> Neither are particularly attractive.
> ...

Simon,

It's surprising that neither the Java nor the C++ SDO specs mention
this issue, and I noticed that there is an issue open against the C++
SDO spec about it. I mostly agree with you about how it should be
handled - a CDATA section is simply text which has been marked not for
formatting, so depending on its position in the document, it should be
presented as either the value of an SDO Property or as a text element
in an SDO Sequence. However we should go for your option (1). It would
be out of keeping for SDO to present the tags to the user, they must be
stripped out first.

You say you are working on a solution - presumably this is in the
Tuscany C++ library? I don't think the PHP SDO Core should need to
change at all. You need to ensure that the C++ code does not process
the data between the CDATA start and end tags.

As you mention, the C++ code may need to mark the property internally
so that it knows to write out the data as a CDATA section if necessary.

simon...@googlemail.com

unread,
Nov 7, 2006, 5:30:38 AM11/7/06
to phpsoa
The issue we have with trying to mark a CDATA section somehow
internally, i.e. a solution where the <![CDATA[ and ]]> tags don't
appear in the SDO property or sequence text strings, is twofold.

Firstly, the really difficult case is when a CDATA section appears as
part of an element of simple type string. In this case I have nothing
to hang a flag off to indicate that only part of the string is not
parsed. This is less of a problem for mixed types as the sequence can
be used to separate parseable text from CDATA and some suitable flag
instigated

Secondly, this issue has been raised in the SDO spec group but no
discussion has taken place to date. On this basis I would favour the
simplest and least invasive solution, in terms of changes to SDO
infrastructure, pending any decision, which we can feed into of course,
on what the spec group is going to do. This has the benefit of being
better that throwing away the CDATA, as we do at the moment, being
quick to implement and it allows us to experiment a little with CDATA
before commiting to more radical SDO core changes.

Simon

simon...@googlemail.com

unread,
Nov 9, 2006, 8:47:47 AM11/9/06
to phpsoa
OK, so I have finaly checked in a fix for this. This took a little
which as I was talking with Caroline trying to work out how best to
organize our CVS with respect to the pecl4win build. Anyhow the result
is that I raised a bug to record this change
(http://pecl.php.net/bugs/bug.php?id=9287) and checked in the change to
the BRANCH_1_0_5 branch in our CVS project. The change affects all of
the C++ code under
SDO/commonj/sdo so you will need to check this code out and recompile
it. This code will not be compiled automatically at the moment by
pecl4win as pecl4win picks the code from HEAD. This sounds awkward but
we have done this on purpose so that the development changes we make
for each next release are not subject to the next automatic windows
build. If you can't face doing a windows build yourslelf let me know
becuase I'm keen to get some testing done on these changed before we
agree on them for the next release.

The solution I have gone with for now is my solution #1 as previously
suggested, i.e. I make the CDATA strings available in SDO properties
makred with the XML CDATA markers. I have submitted a patch to the
Tuscany C++ SDO project but time will tell whether they find this
acceptable.

Simon

cdouglas

unread,
Nov 10, 2006, 12:25:23 PM11/10/06
to phpsoa
when I tried to compile it, it failed with the following error:

/home/cdouglas/phpsdo/pecl/sdo/commonj/sdo/SDOString.cpp:37: error:
explicit specialization of 'SDOString std::basic_string<char,
std::char_traits<char>, std::allocator<char> >::toLower(unsigned int,
unsigned int)' must be introduced by 'template <>'
/home/cdouglas/phpsdo/pecl/sdo/commonj/sdo/SDOString.cpp:37: error: no
member function 'toLower' declared in 'std::basic_string<char,
std::char_traits<char>, std::allocator<char> >'
/home/cdouglas/phpsdo/pecl/sdo/commonj/sdo/SDOString.cpp:37: error:
invalid function declaration
make: *** [commonj/sdo/SDOString.lo] Error 1

simon...@googlemail.com

unread,
Nov 20, 2006, 7:22:47 AM11/20/06
to phpsoa
Chris, So sorry I missed this post. It's been sitting in my reader for
over a week and I didn't notice it in the sea of other stuff. You can
basically ignore the error you are getting from your build. If you
delete the file SDOString.cpp it should work. This is happening because
the Tuscany C++ team have typedef'd the SDOString class to a stdstring
and I mistakenly left the old implementation class hanging around in
our CVS repository.

Sorry About that

Simon

cdouglas

unread,
Nov 20, 2006, 4:58:38 PM11/20/06
to phpsoa
I tried it and now I get this: I'm sorry I am not more helpful with
this, I am not a c++ guy.
I deleted the entire archive and re-extracted it, deleted the file and
tried again with the same results.

make: *** No rule to make target
`/downloads/SDO-1.0.4/commonj/sdo/SDOString.cpp', needed by
`commonj/sdo/SDOString.lo'. Stop.

gcha...@googlemail.com

unread,
Nov 22, 2006, 4:43:31 AM11/22/06
to phpsoa
Hi,

I've just been through the same process and encountered this on
Windows. The problem is SDOString.cpp is no longer required, but is
still referred to by the config.w32 and config.m4 files. Try removing
the line "commonj/sdo/SDOString.cpp \" from config.m4 and then running
through your build from the beginning again.

Hope this fixes it.

simon...@googlemail.com

unread,
Nov 22, 2006, 6:37:25 AM11/22/06
to phpsoa
Sorry Chris. Graham is correct there is a configuration file change
that's needed also. I though I had made the change but for some reason
I only checked in the change I made for windows and not for linux.
There are files that control the build process and dictate what files
are included

config.m4 - linux
config.w32 - windows

I will just rest my change to the linux file and check it in. I'll
repost here when I've done it.

Regards

Simon

cdouglas

unread,
Nov 22, 2006, 9:18:10 AM11/22/06
to phpsoa
I took out the line in config.m4 and it compiled and installed this
time. Thanks for the help. I'll let you know how well it works with my
cdata.

cdouglas

unread,
Nov 22, 2006, 10:46:59 AM11/22/06
to phpsoa
I tried it and i get no difference in operation. I am wondering if I
didn't get it from CVS correctly. The command I used was :
cvs -d :pserver:cvs...@cvs.php.net:/repository checkout -r
Root_BRANCH_1_0_5 pecl/sdo (also tried it without the Root__)

I then cd pecl/sdo and ran the following:
phpize
./configure
make
make install

The new sdo.so is being loaded, but the version when i do a phpinfo
still says 1.0.4.

gcha...@googlemail.com

unread,
Nov 22, 2006, 12:23:45 PM11/22/06
to phpsoa
The 1.0.4 could be a red herring (mine also says that, but CDATA is
working for me). Could you try the following test, which is running
fine on my machine:

Save the following to a file called test.xsd

======================================================
<?xml version="1.0" encoding="utf-8" ?>
<xs:schema targetNamespace="testNS" xmlns="testNS"
xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="doc">
<xs:complexType>
<xs:element name="v1" type="xs:string" />
<xs:element name="v2" type="xs:string" />
<xs:element name="v3" type="xs:string" />
</xs:complexType>
</xs:element>

</xs:schema>
======================================================

Save the following to a file called test.in.xml

======================================================
<?xml version="1.0" encoding="utf-8"?>
<doc xmlns="testNS">

<v1><![CDATA[
<p>
A bit of escaped content
</p>
]]></v1>
<v2><![CDATA[<?xml version="1.0"


encoding="UTF-8"?><MOREXML>....</MOREXML>

]]></v2>
<v3>Value3</v3>

</doc>
======================================================

Create a php script with the following:

======================================================
<?php
$xmldas = SDO_DAS_XML::create('test.xsd');
$doc = $xmldas->loadFile('test.in.xml');
$xmldas->saveFile($doc, 'test.out.xml', 4);
?>
======================================================

When you run the script, you should see a file created called
test.out.xml which has the preserved CDATA sections. If the CDATA
sections are lost then it's probably your build and we can go from
there. If you do, then it may be your XML schema, and we can start
discussing that.

I hope this helps.

simon...@googlemail.com

unread,
Nov 22, 2006, 12:35:50 PM11/22/06
to phpsoa
Hi Chris, we can find out if you have the right code quite easily. If
you take a look at the file commonj/sdo/SAX2Parser.cpp you should see
the following method implementation:

void sdo_cdataBlock(void *ctx, const xmlChar *value, int len)
{

if (!((SAX2Parser*)ctx)->parserError)
{
SDOXMLString valueAsString(value, 0, len);

SDOXMLString cdata(PropertySetting::CDataStartMarker);
cdata = cdata + valueAsString;
cdata = cdata + PropertySetting::CDataEndMarker;

((SAX2Parser*)ctx)->characters(cdata);
}
}

If this method exists but is empty then you don't have the right code.
The banch you need is BRANCH_1_0_5. The Root_BRANCH_1_0_5 marks the
point at where we branched to add this code (and some other stuff) so
Root_BRANCH_1_0_5 won't have the change.

Graham has been playing with this also and has spotted a situation
where it doesn't apparently work. Something to do with schema
correctness. He is going to post also.

If you find that you have the code and that whatever Graham says checks
out then you may have straight away found a corner case that I didn't
take account of.

Here is a very simple sample that I just ran:

XSD
===


<?xml version="1.0" encoding="UTF-8"?>

<schema xmlns="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org/test"
xmlns:tns="http://www.example.org/test">

<complexType name="TestType">
<sequence>
<element name="entry" type="string"/>
</sequence>
</complexType>

<element name="test" type="tns:TestType"/>
</schema>

XML
===

<?xml version="1.0" encoding="UTF-8"?>

<tns:test xmlns:tns="http://www.example.org/test"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.example.org/test cdata1.xsd ">
<entry>xxx<![CDATA[some data and some <MoreXML></MoreXML>]]></entry>
</tns:test>

These are pretty similar to your original example so I'm hoping that we
can get you up and running. If you suspect that your examples are
different in some important way that we might not have considered can
you distil out the different and we'll make an example and try it here.


Regards

Simon

simon...@googlemail.com

unread,
Nov 22, 2006, 12:56:33 PM11/22/06
to phpsoa
The version numbers are a red herring. The brach 1.0.5 is us preparing
for the next release and we haven't fixed up the version numbering yet

S

cem

unread,
Nov 27, 2006, 5:54:01 AM11/27/06
to phpsoa
There's a new 1.1.0 release available now with Simon's fixes
incorporated. A pecl upgrade should get it for you.

cdouglas

unread,
Nov 27, 2006, 2:58:58 PM11/27/06
to phpsoa
I got the 1.1.0 release and tested it and it is working with my
documents. Thank you very much. If I come across any issues, I'll
post back.
Reply all
Reply to author
Forward
0 new messages