converted to a unicode character?

2,691 views
Skip to first unread message

Geoff Dillon

unread,
Mar 30, 2012, 4:38:29 PM3/30/12
to beautifulsoup
I am trying to scrape some data out of a Belarc Advisor scan HTML and
I'm finding that whenever the source html contains " " the
resultant string in bs4 looks like a capital A with a caret. I
haven't found anybody else with this problem yet, has anybody seen
this?

Leonard Richardson

unread,
Mar 30, 2012, 4:54:18 PM3/30/12
to beauti...@googlegroups.com
  corresponds to the Unicode character NO-BREAK SPACE.

>>> print u"\N{NO-BREAK SPACE}"

NO-BREAK SPACE is encoded in UTF-8 as "\xc2\xa0".

>>> u"\N{NO-BREAK SPACE}".encode("utf8")
'\xc2\xa0'

But in Latin-1, "\xc2\xa0" decodes to the character Â.

>>> print u"\N{NO-BREAK SPACE}".encode("utf8").decode("latin-1")
Â

You're viewing an UTF-8 document using a browser that interprets it as Latin-1.

There are a number of solutions:

1. Add a <meta> tag that specifies the document's encoding, so the
browser knows to interpret it as UTF-8.

2. Encode the document as Latin-1 using soup.encode("latin-1")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-encoding

3. Turn the NO-BREAK SPACE back into &nbsp; with soup.encode(formatter="html")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters

Leonard

> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>

Reply all
Reply to author
Forward
0 new messages