Gustavo Ambrozio
unread,Apr 20, 2012, 3:19:52 PM4/20/12Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to beautifulsoup
Hi,
In my html I have a bunch of code like this:
<html><body><script>a="<a href='teste'>";</script></body></html>
If I parse this with BS4 I get what is the correct (in the strict
sense) result:
>>> from bs4 import BeautifulSoup
>>> from bs4 import CData
>>> html = "<html><body><script>a=\"<a href='teste'>\";</script></body></html>"
>>> s = BeautifulSoup(html)
>>> s
<html><body><script>a="<a href='teste'>";</script></body></html>
But even though it's correct it does not work in the browser because
now the browser thinks the variable a is "<a href='teste'>".
Weird but true.
I'd like to get the code wrapped in a cdata, so I did this:
>>> script_text = s.html.body.script.children.next()
>>> cdata = CData(script_text.string)
>>> cdata
u'a="<a href=\'teste\'>";'
>>> script_text.replace_with(cdata)
>>> s
<html><body><script><![CDATA[a="<a href='teste'>";]]></script></
body></html>
Is this the correct behavior? Shouldn't it be?:
<html><body><script><![CDATA[a="<a href='teste'>";]]></script></body></
html>
If not, how can I achieve this? I now that doing:
>>> s.encode(formatter=None)
'<html><body><script><![CDATA[a="<a href=\'teste\'>";]]></script></
body></html>'
Works, but I'd like it to still correct wrong chars inside an <a> tag,
for example and using formatter=None would not correct those, so....
Any help is appreciated.
Cheers,
Gustavo