parsing xml tags with namespace qualifier?

1,438 views
Skip to first unread message

Roy

unread,
Nov 22, 2008, 8:32:41 AM11/22/08
to beautifulsoup
Hi,

I'm trying to parse a Wordpress export file. In my code I get all the
<item> tags by soup.findAll('item'), then for each item I'm going to
extract the individual fields. Then I ran into some trouble because
the export file has some tags like:

<item>
...
<content:encoded><![CDATA[]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>338</wp:post_id>
...
</item>

I'm not sure how to extract the value of the namespace-qualified tags,
obviously item.wp:post_id doesn't work. Anyone know how to do this?

THanks!

Aaron DeVore

unread,
Nov 23, 2008, 1:59:19 AM11/23/08
to beauti...@googlegroups.com
I don't quite get what you mean. Why not just use
item.find("wp:post_id")? Or, if you want to get the value,
item.find("wp:post_id").string?

-Aaron

Lino Mastrodomenico

unread,
Nov 22, 2008, 7:14:33 PM11/22/08
to beauti...@googlegroups.com
2008/11/22 Roy <roy...@gmail.com>:

> I'm trying to parse a Wordpress export file. In my code I get all the
> <item> tags by soup.findAll('item'), then for each item I'm going to
> extract the individual fields. Then I ran into some trouble because
> the export file has some tags like:
>
> <item>
> ...
> <content:encoded><![CDATA[]]></content:encoded>
> <excerpt:encoded><![CDATA[]]></excerpt:encoded>
> <wp:post_id>338</wp:post_id>
> ...
> </item>

As much as I love BeautifulSoup, for parsing valid XML with
namespaces, isn't more appropriate a real XML parser?

Something like ElementTree, that is included by default in Python 2.5
and later (and can be installed in older versions).
See: http://docs.python.org/library/xml.etree.elementtree.html

--
Lino Mastrodomenico

Aaron DeVore

unread,
Nov 23, 2008, 3:33:56 PM11/23/08
to beauti...@googlegroups.com
On Sat, Nov 22, 2008 at 4:14 PM, Lino Mastrodomenico
<l.mastro...@gmail.com> wrote:
> As much as I love BeautifulSoup, for parsing valid XML with
> namespaces, isn't more appropriate a real XML parser?
>
> Something like ElementTree, that is included by default in Python 2.5
> and later (and can be installed in older versions).
> See: http://docs.python.org/library/xml.etree.elementtree.html


Beautiful Soup seems to handle XML namespaces reasonably well. There
are a couple of modifications to sgmllib's regular expressions that
make sgmllib handle namespaces correctly. When querying I just use
find*("ns:tagname"). There's no particular need to do something
special. Unless, of course, I don't understand that part of Beautiful
Soup. :P

-Aaron

Reply all
Reply to author
Forward
0 new messages