BeautifulSoup did not give me unicode, dammit

229 views
Skip to first unread message

Nick Welch

unread,
Dec 4, 2009, 3:13:58 AM12/4/09
to beautifulsoup
>>> BS.BeautifulSoup(u'\u201cnext generation\u201d').renderContents()
'\xe2\x80\x9cnext generation\xe2\x80\x9d'

It's giving me back a utf8-encoded str object. What gives?

Aaron DeVore

unread,
Dec 4, 2009, 12:27:31 PM12/4/09
to beauti...@googlegroups.com
renderContents converts to a str on Python 2.*. Use this instead:

unicode(BS.BeautifulSoup(u'\u201cnext generation\u201d'))

-Aaron DeVore
> --
>
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>
>
>

sparky

unread,
Jan 14, 2010, 2:06:10 AM1/14/10
to beautifulsoup
I have this, that breaks, and in my opinion shouldn't:

>> print BS(u'\u201cnext generation\u201d').renderContents()
“next generation”
>> print unicode(BSu'\u201cnext generation\u201d').renderContents() )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
0: ordinal not in range(128)

Am I missing something too?

C.

On Dec 4 2009, 7:27 pm, Aaron DeVore <aaron.dev...@gmail.com> wrote:
> renderContentsconverts to a str on Python 2.*. Use this instead:


>
> unicode(BS.BeautifulSoup(u'\u201cnext generation\u201d'))
>
> -Aaron DeVore
>

Aaron DeVore

unread,
Jan 15, 2010, 6:24:10 PM1/15/10
to beauti...@googlegroups.com
On Wed, Jan 13, 2010 at 11:06 PM, sparky <craig.s...@gmail.com> wrote:
>>> print BS(u'\u201cnext generation\u201d').renderContents()
> “next generation”
>>> print unicode(BSu'\u201cnext generation\u201d').renderContents() )
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
> 0: ordinal not in range(128)
>
> Am I missing something too?

I don't have a good grasp of Unicode/encodings so I can't give an
authoritative word on this. However, I do know where the error is
coming from. Using unicode(soup.renderEncoding()) attempts to convert
an str to unicode using the ascii codec, which doesn't include \201c.
Instead, you need to use the utf-8 encoding:

soup.renderContents().decode('utf-8')

-Aaron DeVore

Reply all
Reply to author
Forward
0 new messages