Invalid html is parsed correctly but left broken on output?

Nick Welch

unread,

Nov 18, 2009, 12:02:05 AM11/18/09

to beautifulsoup

I was assuming that this assertion would pass:

html = '<h4>foo & bar</h4>'
assert BS.BeautifulSoup(html).renderContents() == '<h4>foo & bar</
h4>'

But it doesn't! I just get the original stray '&' in the output. And
actually, the
same happens with respect to '<'. BeautifulSoup is smart enough to
not be
tricked by these when building its tree, but then it gladly serializes
them back
out as invalid html.

How do I get more sane/safe output from BeautifulSoup?

Aaron DeVore

unread,

Nov 18, 2009, 12:55:27 PM11/18/09

to beauti...@googlegroups.com

The function you want is prettify(), not renderContents().
renderContents does just what it sounds like; it renders the
*children* of the tag. In this case, that's "foo & bar".

Cheers!
Aaron DeVore

> --
>
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=.
>
>
>

Nick Welch

unread,

Nov 19, 2009, 12:14:50 AM11/19/09

to beautifulsoup

I thought prettify just modified the whitespace to make the output
more human readable, and a short test supports that notion:

>>> html = BS.BeautifulSoup('<h4>foo & bar</h4>')
>>> html.renderContents()

'<h4>foo & bar</h4>'

>>> html.prettify()
'<h4>\n foo & bar\n</h4>'

What I want the output to be is:

'<h4>foo & bar</h4>'

Reply all

Reply to author

Forward