I've attached a patch that should solve the problem. Try it out and
let me know how it works.
On the face of it, the solution is simple: if lxml demands a
bytestring, send it a bytestring. But there are bytestrings that lxml
can't parse. In my experience, in the absence of an encoding
declaration, lxml assumes a bytestring is UTF-8. (I don't know if it
always assumes UTF-8, or if it assumes the system default encoding.)
If it's not UTF-8, lxml will raise an exception.
This patch runs a little experiment when you load the lxml
treebuilder, to see whether lxml will parse Unicode strings or not. If
not, then any bytestring you send to lxml will be converted to
Unicode, and then encoded as UTF-8.
This is not a great solution. I just want to see if I understand the
problem correctly. Try out this patch and see if your code works and
if you can run the test suite.
Leonard
On Tue, May 14, 2013 at 1:47 PM, Staffan Malmgren
<
staffan....@gmail.com> wrote:
> I have the same problem on Mac OS (10.8.3), using a pip installed lxml which
> I guess uses the systemwide libxml2. The results of a simple diagnostic is
> as follows:
>
>>>> diagnose("<foo>")
> Diagnostic running on Beautiful Soup 4.2.0
> Python version 3.3.1 (v3.3.1:d9893d13c628, Apr 6 2013, 11:07:11)
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
> I noticed that html5lib is not installed. Installing it may help.
> Found lxml version 3.1.2.0
>
> Trying to parse your markup with html.parser
> Here's what html.parser did with the markup:
> <foo>
> </foo>
> --------------------------------------------------------------------------------
> Trying to parse your markup with lxml
> lxml could not parse the markup.
> Traceback (most recent call last):
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/diagnose.py",
> line 50, in diagnose
> soup = BeautifulSoup(data, parser)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 172, in __init__
> self._feed()
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 185, in _feed
> self.builder.feed(self.markup)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/builder/_lxml.py",
> line 191, in feed
> self.parser.feed(markup)
> File "parser.pxi", line 1104, in lxml.etree._FeedParser.feed
> (src/lxml/lxml.etree.c:88180)
> lxml.etree.ParserError: Unicode parsing is not supported on this platform
> --------------------------------------------------------------------------------
> Trying to parse your markup with ['lxml', 'xml']
> ['lxml', 'xml'] could not parse the markup.
> Traceback (most recent call last):
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/diagnose.py",
> line 50, in diagnose
> soup = BeautifulSoup(data, parser)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 172, in __init__
> self._feed()
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 185, in _feed
> self.builder.feed(self.markup)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/builder/_lxml.py",
> line 83, in feed
> self.parser.feed(data)
> File "parser.pxi", line 1104, in lxml.etree._FeedParser.feed
> (src/lxml/lxml.etree.c:88180)
> lxml.etree.ParserError: Unicode parsing is not supported on this platform
> --------------------------------------------------------------------------------
>
> After trying a very simple implementation of your suggestion to re-encode
> data into a utf-8 bytestring if needed (see the attached patch, against the
> released bs4 4.2.0), I get this (expected) result instead:
>
>>>> diagnose("<foo>")
> Diagnostic running on Beautiful Soup 4.2.0
> Python version 3.3.1 (v3.3.1:d9893d13c628, Apr 6 2013, 11:07:11)
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
> I noticed that html5lib is not installed. Installing it may help.
> Found lxml version 3.1.2.0
>
> Trying to parse your markup with html.parser
> Here's what html.parser did with the markup:
> <foo>
> </foo>
> --------------------------------------------------------------------------------
> Trying to parse your markup with lxml
> Here's what lxml did with the markup:
> <html>
> <body>
> <foo>
> </foo>
> </body>
> </html>
> --------------------------------------------------------------------------------
> Trying to parse your markup with ['lxml', 'xml']
> Here's what ['lxml', 'xml'] did with the markup:
> <?xml version="1.0" encoding="utf-8"?>
>
> Hope this can be of any help.
>
> Best regards,
>
> Staffan