> Is there a way to make BeautifulSoup preserve the CDATA markup instead of
> escaping the block contents?
You're seeing decisions made by lxml's XML parser.
"By default, lxml's parser will strip CDATA sections from the tree and
replace them by their plain text content. As real applications for
CDATA are rare, this is the best way to deal with this issue."
If you're using pure lxml, you can preserve the CDATA blocks by
passing strip_cdata=False into the XMLParser constructor. Beautiful
Soup's lxml tree builder passes strip_cdata=False, but Beautiful Soup
uses a different parser interface from the default, and
strip_cdata=False has no effect when using this interface. I believe
this is a bug in lxml.
In my tests, the only parser that preserved that CDATA block was
Python's built-in HTMLParser. If preserving the CDATA block is
essential, I recommend you either parse the data with HTMLParser, or
work directly with lxml--whichever is easier.