Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

parsing xml xmp rdf

34 views
Skip to first unread message

Jeff

unread,
Mar 21, 2009, 4:29:30 PM3/21/09
to

I have an XML file I need to parse. It looks like this:

For example I'd like to pull out the keywords in this bit:

<rdf:li>keyword 1</rdf:li>
<rdf:li>keyword 2</rdf:li>
<rdf:li>keyword 3</rdf:li>

from sample XMP below:

I found this: http://us2.php.net/xml
Is that about it for parsing xml in php. I'm not familiar enough with
xml to figure it out from the skimpy examples.

How do I go about parsing this, slicing it out with a regex looks
easier than that! Is there something else?

Jeff

<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.2-c063
53.352624, 2008/07/30-18:12:18 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
...

<rdf:Description rdf:about=""
xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
<photoshop:ColorMode>3</photoshop:ColorMode>
<photoshop:ICCProfile>sRGB IEC61966-2.1</photoshop:ICCProfile>
<photoshop:Category>cat</photoshop:Category>
<photoshop:AuthorsPosition>Jeff Title</photoshop:AuthorsPosition>
<photoshop:SupplementalCategories>
<rdf:Bag>
<rdf:li>category 1</rdf:li>
<rdf:li>category 2</rdf:li>
</rdf:Bag>
</photoshop:SupplementalCategories>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">my document title</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Seq>
<rdf:li>Jeff</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">test image</rdf:li>
</rdf:Alt>
</dc:description>
<dc:subject>
<rdf:Bag>
<rdf:li>keyword 1</rdf:li>
<rdf:li>keyword 2</rdf:li>
<rdf:li>keyword 3</rdf:li>
</rdf:Bag>
</dc:subject>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">my copyright</rdf:li>
</rdf:Alt>
</dc:rights>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>

Troy Bourdon

unread,
Mar 22, 2009, 1:49:38 AM3/22/09
to

I've never written any PHP code but I did take a quick look at the link
you supplied and it looks like standard stuff. There are a few general
ways to parse XML. One way that's been gaining popularity is XQuery. You
essentially issue a guery to the document and a node/nodes are returned
which match your query. Another way is to register callback functions for
a node, start parsing and when the node is encountered by the parser your
registered callback is invoked. This looks to be the model used by the
library you referenced. Either way is certainly easier than rolling your
own parser, no matter how handy your are with regex.

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Jerry Stuckle

unread,
Mar 22, 2009, 7:48:50 AM3/22/09
to

Jeff,

You definitely don't want to try to do this with regex's. You will find
it much more complicated than using the PHP tools to do it.

If you check the left frame on the page you referenced, you will see
several ways of parsing XML are listed, amongst them SimpleXML and DOM
XML. I prefer the latter.

Parsing XML is not easy - but the tools in PHP make it quite simple.
And they aren't very hard to learn, expecially if you have an OO background.


--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstu...@attglobal.net
==================

Jeff

unread,
Mar 22, 2009, 1:05:29 PM3/22/09
to

The DOM XML seems like it has more tools and makes more sense to me.

I'm really confused by the data though and can't seem to google up
any answers:

The don't understand the colons.

<dc:subject>
^^^^^^^^^^^^^^^^^
Is the tagname "dc" or "dc:subject"

<rdf:Bag>
^^^^^^^^^^^^^
What is that, an attribute or just a first child?

<rdf:li>keyword 1</rdf:li>
<rdf:li>keyword 2</rdf:li>

So, if I wanted to: $node->get_content of those keywords, what is the node?


> Parsing XML is not easy

This XML is not easy for me to figure out. Not that I parse XML often.

Jeff

amygdala

unread,
Mar 22, 2009, 6:20:19 PM3/22/09
to
Jeff schreef:

> Jerry Stuckle wrote:
>> Jeff wrote:


> The don't understand the colons.
>
> <dc:subject>
> ^^^^^^^^^^^^^^^^^
> Is the tagname "dc" or "dc:subject"
>
> <rdf:Bag>
> ^^^^^^^^^^^^^
> What is that, an attribute or just a first child?

Jeff,

The parts left of the colon are known as a namespace. If you are
formiliar with the term namespace in programming languages you can
probably see the analogy right away. If not, you may want to read up on
namespaces in XML (or namespaces in general, for that matter). It
basically comes down to being able to define entities in their own
'namespace', as to avoid the entities to conflict with entities that
have the same name.

jeff:li (element li in namespace jeff)

would be different than

amygdala:li (element li in namespace amygdala)

To my knowledge SimpleXML doesn't have the tools to do fancy stuff with
namespaces, if at all. I'm not even sure whether SimpleXML simply
ignores the namespaces in case it is not able to handle them, such that
you are still able to access the elements.

So my best bet in this case would be to use DOM XML too.

--
Amygdala
http://amygdala.110mb.com/

Jeff

unread,
Mar 23, 2009, 3:06:19 AM3/23/09
to

Thanks, I've had a chance to look through it.

I'll give it a spin...

Jeff

Jeff

unread,
Mar 23, 2009, 3:12:51 AM3/23/09
to


Thanks, I had no idea what to look for!

It looks like DOM XML has to hack it's way through namespaces. I
suppose I should just hack through some examples and give up on
analyzing! All the samples I see ignore namespace except for this user bit:

$dom = domxml_open_mem($xmlval);
$ctx=xpath_new_context($dom);
$ctx->xpath_register_ns("yns","http://your.name.space/uri");
$nodes = $dom->get_elements_by_tagname("yns:tagname",$ctx);

Which adds more questions than answers as almost none of those methods
are documented!

I thought this would be easy.

Jeff


>

Jeroen

unread,
Mar 23, 2009, 6:06:20 AM3/23/09
to
> - Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

Jeff,

Be careful, it may seem confusing, but DOM is not DOM XML. For PHP5,
you should use the DOM functions, as documented here:
http://nl3.php.net/manual/en/book.dom.php.

If you want to grab all XML elements from within a certain names
space, you could use the getElementsByTagNameNS functions, specifying
the target namespace URI. Actually, XML parsing is much easier than
you would think. You only need to have a basic knowledge, not so much
of the PHP DOM extension, but rather of XML itself.

Jeroen Aarts
http://www.clickworks.be

Jeff

unread,
Mar 23, 2009, 10:23:16 AM3/23/09
to

OK, it looks to me that this is roughly a PHP implementation of
javascript methods.

>
> If you want to grab all XML elements from within a certain names
> space, you could use the getElementsByTagNameNS functions,

I couldn't find anything on that in the php manual although I found it
here:
http://www.phpbuilder.com/manual/en/function.dom-domdocument-getelementsbytagnamens.php

It seems to refer to the pre : part

<photoshop:SupplementalCategories>
^^^^^^^^
<rdf:Bag>
^^^


<rdf:li>category 1</rdf:li>

^^^


<rdf:li>category 2</rdf:li>
</rdf:Bag>
</photoshop:SupplementalCategories>

as a prefix.

So what it looks to me is that I'll need to use to use DOMDocument to
get the "photoshop:SupplementalCategories" element

and then using DOMElement to get the list.

As near as I can tell the namespace refers to this:

<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.2-c063
53.352624, 2008/07/30-18:12:18 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

But I'm not sure why I should care about that.

I think we have more than one definition of namespace.

specifying
> the target namespace URI. Actually, XML parsing is much easier than
> you would think.

I actually started off thinking this would be easy and have moved
incrementally to more and more difficult! At the moment I think it's
ridiculously hard.

At any rate this is Adobe stuff and I've always found Adobe software
to be needlessly complicated and obtuse and that's just
from trying to use it!

Jeff

Jeroen

unread,
Mar 24, 2009, 5:58:41 AM3/24/09
to

Wrong, the W3C issued the DOM specification, which has implementations
in almost all computer languages, including JScript, Javascript,
PHP, ...

As all these implementation refer to the same specification, method
names and object property names are likely to be the samen, and
functionality is of course then idempotent.

>
>
>
> > If you want to grab all XML elements from within a certain names
> > space, you could use the getElementsByTagNameNS functions,
>
> I couldn't find anything on that in the php manual although I  found it

> here:http://www.phpbuilder.com/manual/en/function.dom-domdocument-geteleme...

What about: http://be2.php.net/manual/en/domdocument.getelementsbytagnamens.php
?

>
>    It seems to refer to the pre : part
>
>           <photoshop:SupplementalCategories>
>             ^^^^^^^^
>              <rdf:Bag>
>               ^^^
>                 <rdf:li>category 1</rdf:li>
>                  ^^^
>                 <rdf:li>category 2</rdf:li>
>              </rdf:Bag>
>           </photoshop:SupplementalCategories>
>
> as a prefix.
>
>    So what it looks to me is that I'll need to use to use DOMDocument to
>   get the "photoshop:SupplementalCategories" element
>
> and then using DOMElement to get the list.
>
>    As near as I can tell the namespace refers to this:
>
> <?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.2-c063
> 53.352624, 2008/07/30-18:12:18        ">
>     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> But I'm not sure why I should care about that.
>
> I think we have more than one definition of namespace.

the prefixes - like 'rdf:' - should refer to a namespace defined as
'xmlns:rdf=some-namesapce' somewhere *above*. You get all elements
with the rdf: prefix (hence, within the rdf namespace) by using any of
the getElementsByTagNameNS('some-namespace', 'element-name-without-
prefix').

Anyway, discussions about XML and namespaces are off topic here,
search the appropriate newsgroups.

>
>   specifying
>
> > the target namespace URI. Actually, XML parsing is much easier than
> > you would think.
>
> I actually started off thinking this would be easy and have moved
> incrementally to more and more difficult! At the moment I think it's
> ridiculously hard.
>
>    At any rate this is Adobe stuff and I've always found Adobe software
> to be needlessly complicated and obtuse and that's just
>                    from trying to use it!
>
>    Jeff
>
> You only need to have a basic knowledge, not so much
>
>
>
> > of the PHP DOM extension, but rather of XML itself.
>
> > Jeroen Aarts

> >http://www.clickworks.be- Hide quoted text -

0 new messages