Baffling encoding problem

30 views
Skip to first unread message

guillaume...@imperial.ac.uk

unread,
Apr 11, 2018, 5:48:21 PM4/11/18
to beautifulsoup
Consider the following document, windows-1252 encoded:

b'<html><head>\n<title>Message: &#147;Our Line&#146;s Been Changed Again&#148;</title>\n</head>\n<p>Message: &#147;Our Line&#146;s Been Changed Again&#148;</p>\n<p>But... \x93What Does It Mean?\x97Not Very Much.\x94 </p\n</body>\n</html>\n'

Notice how the windows smart quotes are escaped in the title but not in the second paragraph. Calling BeautifulSoup thusly 

bs4.BeautifulSoup(a, from_encoding='windows-1252').prettify('utf-8')
b'<html>\n <head>\n  <title>\n   Message: \xc2\x93Our Line\xc2\x92s Been Changed Again\xc2\x94\n  </title>\n </head>\n <body>\n  <p>\n   Message: \xc2\x93Our Line\xc2\x92s Been Changed Again\xc2\x94\n  </p>\n  <p>\n   But... \xe2\x80\x9cWhat Does It Mean?\xe2\x80\x94Not Very Much.\xe2\x80\x9d\n  </p>\n </body>\n</html>\n'

Now what's weird here is that the smart codes have been correctly transcoded in utf-8; however the HTML escaped sequences are mangled:  \xc2\x93 is not a valid UTF-8 codepoint; but \x93 is the correct windows-1252 codepoint....

So somehow the escaped sequences have been - correctly - transcoded to windows-1252, but then incorrectly translated to UTF-8...

What's going on? Interestingly html5lib works correctly, but both html.parser and lxml fail:

In [51]: diagnose(a)
Diagnostic running on Beautiful Soup 4.4.1
Python version 3.4.3 (default, Nov 28 2017, 16:40:41) 
[GCC 4.8.4]
Found lxml version 3.8.0.0
Found html5lib version 1.0b3

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<html>
 <head>
  <title>
   Message: “Our Line’s Been Changed Again”
  </title>
 </head>
 <p>
  Message: “Our Line’s Been Changed Again”
 </p>
 <p>
  But... “What Does It Mean?—Not Very Much.”
 </p>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
  <title>
   Message: “Our Line’s Been Changed Again”
  </title>
 </head>
 <body>
  <p>
   Message: “Our Line’s Been Changed Again”
  </p>
  <p>
   But... “What Does It Mean?—Not Very Much.”
  </p>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <head>
  <title>
   Message: “Our Line’s Been Changed Again”
  </title>
 </head>
 <body>
  <p>
   Message: “Our Line’s Been Changed Again”
  </p>
  <p>
   But... “What Does It Mean?—Not Very Much.”
  </p>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with ['lxml', 'xml']
Here's what ['lxml', 'xml'] did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<html>
 <head>
  <title>
   Message: “Our Line’s Been Changed Again”
  </title>
 </head>
 <p>
  Message: “Our Line’s Been Changed Again”
 </p>
 <p>
  But... “What Does It Mean?—Not Very Much.”
 </p>
</html>
--------------------------------------------------------------------------------


 

Reply all
Reply to author
Forward
0 new messages