>>> print u"\N{NO-BREAK SPACE}"
NO-BREAK SPACE is encoded in UTF-8 as "\xc2\xa0".
>>> u"\N{NO-BREAK SPACE}".encode("utf8")
'\xc2\xa0'
But in Latin-1, "\xc2\xa0" decodes to the character Â.
>>> print u"\N{NO-BREAK SPACE}".encode("utf8").decode("latin-1")
Â
You're viewing an UTF-8 document using a browser that interprets it as Latin-1.
There are a number of solutions:
1. Add a <meta> tag that specifies the document's encoding, so the
browser knows to interpret it as UTF-8.
2. Encode the document as Latin-1 using soup.encode("latin-1")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-encoding
3. Turn the NO-BREAK SPACE back into with soup.encode(formatter="html")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters
Leonard
> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>