Read in XML without escaping CDATA block contents

2,661 views
Skip to first unread message

spiffytech

unread,
Sep 8, 2012, 1:50:49 PM9/8/12
to beauti...@googlegroups.com
I'm reading in an XML file that contains CDATA blocks like this:

<bullet begin="2">                                                                                                                                                                               
    <text><![CDATA[<P ALIGN="LEFT">The purposes of information security policies.</P>]]></text>                                                                                                  
</bullet>

When I print out the BeautifulSoup tag tree, I get this:

<bullet begin="2">
    <text>&lt;P ALIGN="LEFT"&gt;The purposes of information security policies.&lt;/P&gt;</text>                                                                                                                                                                                      
</bullet>

Is there a way to make BeautifulSoup preserve the CDATA markup instead of escaping the block contents?

Leonard Richardson

unread,
Sep 8, 2012, 2:24:02 PM9/8/12
to beauti...@googlegroups.com
> Is there a way to make BeautifulSoup preserve the CDATA markup instead of
> escaping the block contents?

You're seeing decisions made by lxml's XML parser.

"By default, lxml's parser will strip CDATA sections from the tree and
replace them by their plain text content. As real applications for
CDATA are rare, this is the best way to deal with this issue."

-- http://lxml.de/api.html#cdata

If you're using pure lxml, you can preserve the CDATA blocks by
passing strip_cdata=False into the XMLParser constructor. Beautiful
Soup's lxml tree builder passes strip_cdata=False, but Beautiful Soup
uses a different parser interface from the default, and
strip_cdata=False has no effect when using this interface. I believe
this is a bug in lxml.

In my tests, the only parser that preserved that CDATA block was
Python's built-in HTMLParser. If preserving the CDATA block is
essential, I recommend you either parse the data with HTMLParser, or
work directly with lxml--whichever is easier.

Leonard
Reply all
Reply to author
Forward
0 new messages