Baffling encoding problem

30 views

Skip to first unread message

guillaume...@imperial.ac.uk

unread,

Apr 11, 2018, 5:48:21 PM4/11/18

to beautifulsoup

Consider the following document, windows-1252 encoded:

b'<html><head>\n<title>Message: &#147;Our Line&#146;s Been Changed Again&#148;</title>\n</head>\n<p>Message: &#147;Our Line&#146;s Been Changed Again&#148;</p>\n<p>But... \x93What Does It Mean?\x97Not Very Much.\x94 </p\n</body>\n</html>\n'

Notice how the windows smart quotes are escaped in the title but not in the second paragraph. Calling BeautifulSoup thusly

bs4.BeautifulSoup(a, from_encoding='windows-1252').prettify('utf-8')

b'<html>\n <head>\n  <title>\n   Message: \xc2\x93Our Line\xc2\x92s Been Changed Again\xc2\x94\n  </title>\n </head>\n <body>\n  <p>\n   Message: \xc2\x93Our Line\xc2\x92s Been Changed Again\xc2\x94\n  </p>\n  <p>\n   But... \xe2\x80\x9cWhat Does It Mean?\xe2\x80\x94Not Very Much.\xe2\x80\x9d\n  </p>\n </body>\n</html>\n'

Now what's weird here is that the smart codes have been correctly transcoded in utf-8; however the HTML escaped sequences are mangled: \xc2\x93 is not a valid UTF-8 codepoint; but \x93 is the correct windows-1252 codepoint....

So somehow the escaped sequences have been - correctly - transcoded to windows-1252, but then incorrectly translated to UTF-8...

What's going on? Interestingly html5lib works correctly, but both html.parser and lxml fail:

In [51]: diagnose(a)

Diagnostic running on Beautiful Soup 4.4.1

Python version 3.4.3 (default, Nov 28 2017, 16:40:41)

[GCC 4.8.4]

Found lxml version 3.8.0.0

Found html5lib version 1.0b3

Trying to parse your markup with html.parser

Here's what html.parser did with the markup:

<html>

<head>

<title>

Message: “Our Line’s Been Changed Again”

</title>

</head>

Message: “Our Line’s Been Changed Again”

But... “What Does It Mean?—Not Very Much.”

</html>

--------------------------------------------------------------------------------

Trying to parse your markup with html5lib

Here's what html5lib did with the markup:

<html>

<head>

<title>

Message: “Our Line’s Been Changed Again”

</title>

</head>

<body>

Message: “Our Line’s Been Changed Again”

But... “What Does It Mean?—Not Very Much.”

</body>

</html>

--------------------------------------------------------------------------------

Trying to parse your markup with lxml

Here's what lxml did with the markup:

<html>

<head>

<title>

Message: “Our Line’s Been Changed Again”

</title>

</head>

<body>

Message: “Our Line’s Been Changed Again”

But... “What Does It Mean?—Not Very Much.”

</body>

</html>

--------------------------------------------------------------------------------

Trying to parse your markup with ['lxml', 'xml']

Here's what ['lxml', 'xml'] did with the markup:

<?xml version="1.0" encoding="utf-8"?>

<html>

<head>

<title>

Message: “Our Line’s Been Changed Again”

</title>

</head>

Message: “Our Line’s Been Changed Again”

But... “What Does It Mean?—Not Very Much.”

</html>

--------------------------------------------------------------------------------

Reply all

Reply to author

Forward

0 new messages