Hi,
Sorry if this turns out to be a double-post.
I am trying to parse some XHTML-like markup (Confluence storage format) and Beautifulsoup fails to parse included CData at all.
Here is a minimal, synthetic example:
Beautiful Soup 4.5.3
lxml 3.7.3
Python 2.7
from bs4 import BeautifulSoup, CData
from lxml import etree
inputstring = '<root><![CDATA[git config --global]]></root>'
soup = BeautifulSoup(inputstring, "lxml")
root =etree.fromstring(inputstring)
print "Input: " + inputstring
print "Pretty: " + soup.prettify()
print "Text: " + soup.text
print "lxml parsed: " + etree.tostring(root)
$ python parseHtml_test.py
Input: <root><![CDATA[git config --global]]></root>
Pretty: <html>
<body>
<root>
</root>
</body>
</html>
Expected: "git config --global" is contained in the <root> tag, probably stripped of the CData tag, according to the documentation
Obtained: the <root> tag is empty.
Am I doing something stupid or is this perhaps a bug in beautifulsoup?
Thanks,
-Michael