Problem parsing CData in beautifulsoup 4 4.5.3- CData disappears

Michael Podvinec

unread,

Apr 11, 2017, 7:28:25 AM4/11/17

to beautifulsoup

Hi,

Sorry if this turns out to be a double-post.

I am trying to parse some XHTML-like markup (Confluence storage format) and Beautifulsoup fails to parse included CData at all.

Here is a minimal, synthetic example:

Beautiful Soup 4.5.3

lxml 3.7.3

Python 2.7

from bs4 import BeautifulSoup, CData

from lxml import etree

inputstring = '<root><![CDATA[git config --global]]></root>'

soup = BeautifulSoup(inputstring, "lxml")

root =etree.fromstring(inputstring)

print "Input: " + inputstring

print "Pretty: " + soup.prettify()

print "Text: " + soup.text

print "lxml parsed: " + etree.tostring(root)

$ python parseHtml_test.py

Input: <root><![CDATA[git config --global]]></root>

Pretty: <html>

<body>

<root>

</root>

</body>

</html>

Expected: "git config --global" is contained in the <root> tag, probably stripped of the CData tag, according to the documentation

Obtained: the <root> tag is empty.

Am I doing something stupid or is this perhaps a bug in beautifulsoup?

Thanks,

-Michael

leonardr

unread,

May 7, 2017, 9:42:51 AM5/7/17

to beautifulsoup

Michael,

I investigated this as a possible bug in Beautiful Soup and determined that the problem (if there is a problem) is in lxml.

The way lxml treats CDATA has been the topic of Beautiful Soup bug reports before:
https://bugs.launchpad.net/beautifulsoup/+bug/1275085

What's new here is that the CDATA is being removed altogether rather than being replaced with its text. lxml has a tendency to silently drop things it doesn't understand, which has also been the subject of Beautiful Soup bug reports:
https://bugs.launchpad.net/beautifulsoup/+bug/1668070

Fortunately, in your case there's an easy solution. It looks like you're parsing XML. Beautiful Soup is primarily an HTML parser, so when you call soup = BeautifulSoup(inputstring, "lxml") you are running the markup through lxml's HTML parser. When the HTML parser encounters tags like <root>, it goes into "thing I don't understand" mode and is likely to drop markup it thinks it can't handle.

If you instead call soup = BeautifulSoup(inputstring, "lxml-xml") or soup = BeautifulSoup(inputstring, "xml"), you will run the markup through lxml's XML parser, which will not be freaked out by a tag called <root> and which will preserve the content of the CDATA (if not the CDATA markup itself).

Leonard

Message has been deleted

Michael Podvinec

unread,

Jul 10, 2017, 9:00:30 AM7/10/17

to beautifulsoup

Hi Leonard,

Thank you very much for investigating this problem in depth and pointing out what is going wrong and where.

I had tried to use the XML parser, but since the original document isn't real XML either, this lead to other problems.

In my use case, I needed to parse the document, do some changes and write out the document again.

In the end, I resorted to a workaround, hiding the CDATA blocks from the parser before parsing and restoring them afterwards. Not very nice, but it worked for my one-off use case.

# Protect CDATA blocks

regex = re.compile(r'<!\[CDATA\[(.+?)\]\]>', re.DOTALL)

inputData = regex.sub(r'X![CDATA[\1]]X', inputData)

and then reverting this substitution after editing was complete.

Thanks,

-Michael

Reply all

Reply to author

Forward