Problem parsing CData in beautifulsoup 4 4.5.3- CData disappears

2,412 views
Skip to first unread message

Michael Podvinec

unread,
Apr 11, 2017, 7:28:25 AM4/11/17
to beautifulsoup
Hi, 

Sorry if this turns out to be a double-post. 

I am trying to parse some XHTML-like markup (Confluence storage format) and Beautifulsoup fails to parse included CData at all. 


Here is a minimal, synthetic example: 
Beautiful Soup 4.5.3
lxml 3.7.3
Python 2.7

from bs4 import BeautifulSoup, CData
from lxml import etree

inputstring = '<root><![CDATA[git config --global]]></root>'
soup = BeautifulSoup(inputstring, "lxml")
root =etree.fromstring(inputstring)
print "Input: " + inputstring
print "Pretty: " + soup.prettify()
print "Text: " + soup.text
print "lxml parsed: " + etree.tostring(root)

$ python parseHtml_test.py
Input: <root><![CDATA[git config --global]]></root>
Pretty: <html>
 <body>
  <root>
  </root>
 </body>
</html>

Expected: "git config --global" is contained in the <root> tag, probably stripped of the CData tag, according to the documentation
Obtained: the <root> tag is empty. 

Am I doing something stupid or is this perhaps a bug in beautifulsoup? 

Thanks, 
-Michael 

leonardr

unread,
May 7, 2017, 9:42:51 AM5/7/17
to beautifulsoup
Michael,

I investigated this as a possible bug in Beautiful Soup and determined that the problem (if there is a problem) is in lxml.

The way lxml treats CDATA has been the topic of Beautiful Soup bug reports before:
https://bugs.launchpad.net/beautifulsoup/+bug/1275085

What's new here is that the CDATA is being removed altogether rather than being replaced with its text. lxml has a tendency to silently drop things it doesn't understand, which has also been the subject of Beautiful Soup bug reports:
https://bugs.launchpad.net/beautifulsoup/+bug/1668070

Fortunately, in your case there's an easy solution. It looks like you're parsing XML. Beautiful Soup is primarily an HTML parser, so when you call soup = BeautifulSoup(inputstring, "lxml") you are running the markup through lxml's HTML parser. When the HTML parser encounters tags like <root>, it goes into "thing I don't understand" mode and is likely to drop markup it thinks it can't handle.

If you instead call soup = BeautifulSoup(inputstring, "lxml-xml") or soup = BeautifulSoup(inputstring, "xml"), you will run the markup through lxml's XML parser, which will not be freaked out by a tag called <root> and which will preserve the content of the CDATA (if not the CDATA markup itself).

Leonard
Message has been deleted

Michael Podvinec

unread,
Jul 10, 2017, 9:00:30 AM7/10/17
to beautifulsoup
Hi Leonard, 

Thank you very much for investigating this problem in depth and pointing out what is going wrong and where.

I had tried to use the XML parser, but since the original document isn't real XML either, this lead to other problems. 
In my use case, I needed to parse the document, do some changes and write out the document again. 

In the end, I resorted to a workaround, hiding the CDATA blocks from the parser before parsing and restoring them afterwards. Not very nice, but it worked for my one-off use case. 

# Protect CDATA blocks
regex = re.compile(r'<!\[CDATA\[(.+?)\]\]>', re.DOTALL)
inputData = regex.sub(r'X![CDATA[\1]]X', inputData)

and then reverting this substitution after editing was complete. 

Thanks, 
-Michael 


Reply all
Reply to author
Forward
0 new messages