Re: BeautifulSoup4 gives Unicode error using LXML on Python3.3, works on 3.2

1,590 views
Skip to first unread message

Mark Grandi

unread,
Dec 12, 2012, 4:54:11 PM12/12/12
to beautifulsoup
It seems that lxml just straight up doesn't work on python 3.3. Or the
feed() method doesn't at least.

Corvidae:as3Docs_2_docset markgrandi$ python3
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 01:25:11)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree as etree
>>> import io
>>> x = io.StringIO("heygirl")
>>> parser = etree.XMLParser()
>>> parser.feed(x.read())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "parser.pxi", line 1105, in lxml.etree._FeedParser.feed (src/
lxml/lxml.etree.c:87183)
lxml.etree.ParserError: Unicode parsing is not supported on this
platform

I posted on the lxml mailing list but i have not heard anything back.
From the source, it seems that it throws this error if libxml2 cannot
find a proper encoding, so probably something that happened in
python3.3 broke that.

On Nov 28, 10:01 pm, Colin Davis <cir...@gmail.com> wrote:
> Good Evening.
>
> I'm running into an error between BeautifulSoup4 and a change in Python3.3
> that affects lxml.
>
> The LXML faq (http://lxml.de/dev/FAQ.html) states-
> In Python 3, lxml always returns Unicode strings for text and names, as
> does ElementTree. Since Python 3.3, Unicode strings that contain only ASCII
> encodable characters are generally as efficient as byte strings. In older
> versions of Python 3, the above mentioned drawbacks apply.
>
> I believe this is the cause of receiving the following error on Python3.3,
> when parsing some strings-
>
> lxml.etree.ParserError: Unicode parsing is not supported on this platform
>
> On Python3.2, parsing the same string works successfully.
>
> The relavent part of the trace appears to be-
>
>     soup = BeautifulSoup(formattedbody)
>   File "/usr/local/lib/python3.3/site-packages/bs4/__init__.py", line 172,
> in __init__
>     self._feed()
>   File "/usr/local/lib/python3.3/site-packages/bs4/__init__.py", line 185,
> in _feed
>     self.builder.feed(self.markup)
>   File "/usr/local/lib/python3.3/site-packages/bs4/builder/_lxml.py", line
> 194, in feed
>     self.parser.feed(markup)
>   File "parser.pxi", line 1105, in lxml.etree._FeedParser.feed
> (src/lxml/lxml.etree.c:87183)
> lxml.etree.ParserError: Unicode parsing is not supported on this platform
>
> Is this something that can/should be changed in BS4, or should I find a way
> to load things differently?
> [  Or stick with 3.2, I suppose ;)   ]
>
> Thank you for any thoughts,
> Colin

Thomas Kluyver

unread,
Feb 22, 2013, 11:57:38 AM2/22/13
to beauti...@googlegroups.com
lxml is deliberately refusing to parse unicode strings where the XML/HTML declares a character encoding:

http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings

leonardr

unread,
May 9, 2013, 12:28:09 PM5/9/13
to beauti...@googlegroups.com


On Friday, February 22, 2013 11:57:38 AM UTC-5, Thomas Kluyver wrote:
lxml is deliberately refusing to parse unicode strings where the XML/HTML declares a character encoding:

http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings

I can't get this error on Python 3.3.1 using lxml 3.2.0 and the BS 4.2.0 prerelease. At the end of this message are the results of running diagnostic() on some simple markup.

I'm inclined to take the error message at its word. "Unicode parsing is not supported on this  platform " This might be a problem that's specific to Mac OS X.

If I can duplicate this problem I can add a workaround for it. I can pass the raw bytestring to lxml instead of running it through Unicode, Dammit. Or, in the worst case, I can run it through Unicode Dammit and then convert it to a UTF-8 bytestring. But I don't know when the problem happens. I'd appreciate it if someone who has a Mac could replicate my experiment.

Leonard

Diagnostic running on Beautiful Soup 4.2.0
Python version 3.3.1 (default, May  9 2013, 11:41:29)
[GCC 4.6.3]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 3.2.0.0

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<foo>
</foo>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <foo>
  </foo>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with ['lxml', 'xml']
Here's what ['lxml', 'xml'] did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<foo/>

--------------------------------------------------------------------------------

Leonard Richardson

unread,
May 9, 2013, 12:32:48 PM5/9/13
to beauti...@googlegroups.com
I looked at the lxml code and the error message happens if
_UNICODE_ENCODING is not set. That variable is set "to the internal
encoding name of Python unicode strings if libxml2 supports reading
native Python unicode. This depends on iconv and the local Python
installation."

So, yes, it is highly dependent on individual system setups. I'll
investigate skipping Unicode Dammit for lxml, but without being able
to duplicate the problem I can't be sure I've fixed it.

Leonard
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to beautifulsou...@googlegroups.com.
> To post to this group, send email to beauti...@googlegroups.com.
> Visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Leonard Richardson

unread,
May 15, 2013, 11:28:26 AM5/15/13
to beauti...@googlegroups.com
I've attached a patch that should solve the problem. Try it out and
let me know how it works.

On the face of it, the solution is simple: if lxml demands a
bytestring, send it a bytestring. But there are bytestrings that lxml
can't parse. In my experience, in the absence of an encoding
declaration, lxml assumes a bytestring is UTF-8. (I don't know if it
always assumes UTF-8, or if it assumes the system default encoding.)
If it's not UTF-8, lxml will raise an exception.

This patch runs a little experiment when you load the lxml
treebuilder, to see whether lxml will parse Unicode strings or not. If
not, then any bytestring you send to lxml will be converted to
Unicode, and then encoded as UTF-8.

This is not a great solution. I just want to see if I understand the
problem correctly. Try out this patch and see if your code works and
if you can run the test suite.

Leonard

On Tue, May 14, 2013 at 1:47 PM, Staffan Malmgren
<staffan....@gmail.com> wrote:
> I have the same problem on Mac OS (10.8.3), using a pip installed lxml which
> I guess uses the systemwide libxml2. The results of a simple diagnostic is
> as follows:
>
>>>> diagnose("<foo>")
> Diagnostic running on Beautiful Soup 4.2.0
> Python version 3.3.1 (v3.3.1:d9893d13c628, Apr 6 2013, 11:07:11)
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
> I noticed that html5lib is not installed. Installing it may help.
> Found lxml version 3.1.2.0
>
> Trying to parse your markup with html.parser
> Here's what html.parser did with the markup:
> <foo>
> </foo>
> --------------------------------------------------------------------------------
> Trying to parse your markup with lxml
> lxml could not parse the markup.
> Traceback (most recent call last):
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/diagnose.py",
> line 50, in diagnose
> soup = BeautifulSoup(data, parser)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 172, in __init__
> self._feed()
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 185, in _feed
> self.builder.feed(self.markup)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/builder/_lxml.py",
> line 191, in feed
> self.parser.feed(markup)
> File "parser.pxi", line 1104, in lxml.etree._FeedParser.feed
> (src/lxml/lxml.etree.c:88180)
> lxml.etree.ParserError: Unicode parsing is not supported on this platform
> --------------------------------------------------------------------------------
> Trying to parse your markup with ['lxml', 'xml']
> ['lxml', 'xml'] could not parse the markup.
> Traceback (most recent call last):
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/diagnose.py",
> line 50, in diagnose
> soup = BeautifulSoup(data, parser)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 172, in __init__
> self._feed()
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py",
> line 185, in _feed
> self.builder.feed(self.markup)
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/builder/_lxml.py",
> line 83, in feed
> self.parser.feed(data)
> File "parser.pxi", line 1104, in lxml.etree._FeedParser.feed
> (src/lxml/lxml.etree.c:88180)
> lxml.etree.ParserError: Unicode parsing is not supported on this platform
> --------------------------------------------------------------------------------
>
> After trying a very simple implementation of your suggestion to re-encode
> data into a utf-8 bytestring if needed (see the attached patch, against the
> released bs4 4.2.0), I get this (expected) result instead:
>
>>>> diagnose("<foo>")
> Diagnostic running on Beautiful Soup 4.2.0
> Python version 3.3.1 (v3.3.1:d9893d13c628, Apr 6 2013, 11:07:11)
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
> I noticed that html5lib is not installed. Installing it may help.
> Found lxml version 3.1.2.0
>
> Trying to parse your markup with html.parser
> Here's what html.parser did with the markup:
> <foo>
> </foo>
> --------------------------------------------------------------------------------
> Trying to parse your markup with lxml
> Here's what lxml did with the markup:
> <html>
> <body>
> <foo>
> </foo>
> </body>
> </html>
> --------------------------------------------------------------------------------
> Trying to parse your markup with ['lxml', 'xml']
> Here's what ['lxml', 'xml'] did with the markup:
> <?xml version="1.0" encoding="utf-8"?>
>
> Hope this can be of any help.
>
> Best regards,
>
> Staffan
patch

Staffan Malmgren

unread,
May 15, 2013, 1:08:26 PM5/15/13
to beauti...@googlegroups.com, leon...@segfault.org
I'd love to try it out, but I cannot seem to find an attached patch?

Best regards,

Staffan

Leonard Richardson

unread,
May 15, 2013, 1:09:37 PM5/15/13
to beauti...@googlegroups.com
The patch was definitely attached to my earlier mesage, but here it is
in plain text.

Leonard

=== modified file 'bs4/builder/__init__.py'
--- bs4/builder/__init__.py 2012-06-30 14:43:47 +0000
+++ bs4/builder/__init__.py 2013-05-15 14:17:16 +0000
@@ -6,6 +6,7 @@
ContentMetaAttributeValue,
whitespace_re
)
+from bs4.dammit import UnicodeDammit

__all__ = [
'HTMLTreeBuilder',
@@ -126,6 +127,17 @@
document_declared_encoding=None):
return markup, None, None, False

+ def _force_markup_to_unicode(self, markup, user_specified_encoding=None,
+ document_declared_encoding=None,
is_html=True):
+ if isinstance(markup, unicode):
+ return markup, None, None, False
+
+ try_encodings = [user_specified_encoding, document_declared_encoding]
+ dammit = UnicodeDammit(markup, try_encodings, is_html=is_html)
+ return (dammit.markup, dammit.original_encoding,
+ dammit.declared_html_encoding,
+ dammit.contains_replacement_characters)
+
def test_fragment_to_document(self, fragment):
"""Wrap an HTML fragment to make it look like a document.


=== modified file 'bs4/builder/_htmlparser.py'
--- bs4/builder/_htmlparser.py 2013-05-07 12:19:02 +0000
+++ bs4/builder/_htmlparser.py 2013-05-15 13:38:06 +0000
@@ -132,14 +132,8 @@
declared within markup, whether any characters had to be
replaced with REPLACEMENT CHARACTER).
"""
- if isinstance(markup, unicode):
- return markup, None, None, False
-
- try_encodings = [user_specified_encoding, document_declared_encoding]
- dammit = UnicodeDammit(markup, try_encodings, is_html=True)
- return (dammit.markup, dammit.original_encoding,
- dammit.declared_html_encoding,
- dammit.contains_replacement_characters)
+ return self._force_markup_to_unicode(
+ markup, user_specified_encoding, document_declared_encoding)

def feed(self, markup):
args, kwargs = self.parser_args

=== modified file 'bs4/builder/_lxml.py'
--- bs4/builder/_lxml.py 2013-05-09 19:36:30 +0000
+++ bs4/builder/_lxml.py 2013-05-15 14:27:57 +0000
@@ -17,6 +17,16 @@
XML)
from bs4.dammit import UnicodeDammit

+# Try an experiment to see whether lxml will parse a Unicode string on
+# this system.
+LXML_WILL_PARSE_UNICODE = True
+_parser = etree.XMLParser(target=None, recover=True)
+try:
+ _parser.feed(u"<a></a>")
+except etree.ParserError:
+ LXML_WILL_PARSE_UNICODE = False
+print "LXML will parse unicode? %s" % LXML_WILL_PARSE_UNICODE
+
LXML = 'lxml'

class LXMLTreeBuilderForXML(TreeBuilder):
@@ -63,17 +73,34 @@
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
"""
- :return: A 3-tuple (markup, original encoding, encoding
- declared within markup).
+ :return: A 4-tuple (markup, original encoding, encoding
+ declared within markup, whether any characters had to be
+ replaced with REPLACEMENT CHARACTER).
"""
- if isinstance(markup, unicode):
- return markup, None, None, False
-
- try_encodings = [user_specified_encoding, document_declared_encoding]
- dammit = UnicodeDammit(markup, try_encodings, is_html=True)
- return (dammit.markup, dammit.original_encoding,
- dammit.declared_html_encoding,
- dammit.contains_replacement_characters)
+ return markup, None, None, False
+ if LXML_WILL_PARSE_UNICODE:
+ return self._force_markup_to_unicode(
+ markup, user_specified_encoding, document_declared_encoding,
+ False)
+ else:
+ # lxml will not parse a Unicode string as XML on this
+ # computer. We must give it a bytestring.
+ if isinstance(markup, unicode):
+ # Encode the string as UTF-8 so lxml will parse it.
+ return markup.encode("utf8"), None, None, False
+ else:
+ # We have a bytestring, and lxml wants a bytestring,
+ # but lxml won't just take _any_ bytestring. In my
+ # tests, it needed a well-formed UTF-8
+ # bytestring. Convert the bytestring to Unicode and
+ # then encode it as UTF-8.
+ (markup, original_encoding, declared_html_encoding,
+ contains_replacement_characters) =
self._force_markup_to_unicode(
+ markup, user_specified_encoding,
document_declared_encoding,
+ False)
+ markup = markup.encode("utf8")
+ return (markup, original_encoding, declared_html_encoding,
+ contains_replacement_characters)

def feed(self, markup):
if isinstance(markup, bytes):
@@ -84,10 +111,10 @@
# or the parser won't be initialized.
data = markup.read(self.CHUNK_SIZE)
self.parser.feed(data)
- while data != '':
+ while data not in (u'', b''):
# Now call feed() on the rest of the data, chunk by chunk.
data = markup.read(self.CHUNK_SIZE)
- if data != '':
+ if data not in (u'', b''):
self.parser.feed(data)
self.parser.close()

@@ -190,6 +217,17 @@
def default_parser(self):
return etree.HTMLParser

+ def prepare_markup(self, markup, user_specified_encoding=None,
+ document_declared_encoding=None):
+ """
+ :return: A 4-tuple (markup, original encoding, encoding
+ declared within markup, whether any characters had to be
+ replaced with REPLACEMENT CHARACTER).
+ """
+ # lxml's HTML parser will always parse Unicode.
+ return self._force_markup_to_unicode(
+ markup, user_specified_encoding, document_declared_encoding)
+
def feed(self, markup):
self.parser.feed(markup)
self.parser.close()

Staffan Malmgren

unread,
May 15, 2013, 3:04:25 PM5/15/13
to beauti...@googlegroups.com, leon...@segfault.org
Thank you! I applied the patch on a clean distribution of beautifulsoup 4.2.0, ran python setup.py install (since the problem manifests only on python 3.3, I had to make sure 2to3 was applied to the source tree), then attempted to run the test suite. It did not work right away, but after some light editing I was able to get all tests but one to pass. Since my edits were done on a source tree that had been processed by 2to3, I can't provide any meaningful diffs, but I'll just paste in the error log and what I did to make the tests pass:

1st run:

    (ferenda-py33)[staffan@mba beautifulsoup4-4.2.0]$ cd lib/python3.3/site-packages/bs4/
    (ferenda-py33)[staffan@mba bs4]$ python -m unittest discover -v -f tests
    LXML will parse unicode? False
    test_beautifulsoup_constructor_does_lookup (test_builder_registry.BuiltInRegistryTest) ... ERROR
    
    ======================================================================
    ERROR: test_beautifulsoup_constructor_does_lookup (test_builder_registry.BuiltInRegistryTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/tests/test_builder_registry.py", line 71, in test_beautifulsoup_constructor_does_lookup
        BeautifulSoup("", features="html")
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py", line 172, in __init__
        self._feed()
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py", line 185, in _feed
        self.builder.feed(self.markup)
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/builder/_lxml.py", line 229, in feed
        self.parser.feed(markup)
      File "parser.pxi", line 1104, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:88180)
    lxml.etree.ParserError: Unicode parsing is not supported on this platform
    
    ----------------------------------------------------------------------
    Ran 1 test in 0.002s
    
    FAILED (errors=1)

Two problems here: 

- LXMLTreeBuilder.prepare_markup converts to unicode, but this doesn't work either (ie. the problem is not just in lxml's XML parser) -- I just removed the prepare_markup methods and let the superclass (LXMLTreeBuilderForXML) handle it
- LXMLTreeBuilderForXML.prepare_markup starts off with "return markup, None, None, None", so the LXML_WILL_PARSE_UNICODE testing and converting code is never run -- I just commented out this return statement.

2nd run:

    (ferenda-py33)[staffan@mba bs4]$ python -m unittest discover -v -f tests
    LXML will parse unicode? False
[... passing tests skipped ...]    
    ======================================================================
    ERROR: test_beautifulstonesoup_is_xml_parser (test_lxml.LXMLTreeBuilderSmokeTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/tests/test_lxml.py", line 61, in test_beautifulstonesoup_is_xml_parser
        soup = BeautifulStoneSoup("<b />")
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py", line 350, in __init__
        super(BeautifulStoneSoup, self).__init__(*args, **kwargs)
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py", line 172, in __init__
        self._feed()
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/__init__.py", line 185, in _feed
        self.builder.feed(self.markup)
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/builder/_lxml.py", line 109, in feed
        data = markup.read(self.CHUNK_SIZE)
    AttributeError: 'bytes' object has no attribute 'read'
    
    ----------------------------------------------------------------------
    Ran 58 tests in 0.078s
    
    FAILED (errors=1)


In these cases, markup is a bytes object, so LXMLTreeBuilderForXML.feed doesn't wrap it like it would if markup was a (python 3) str -- I imported BytesIO from io and added the following just after the initial isinstance test in LXMLTreeBuilderForXML.feed:
        elif isinstance(markup, bytes):
            markup = BytesIO(markup)

3rd try:

    (ferenda-py33)[staffan@mba bs4]$ python -m unittest discover -v tests
    LXML will parse unicode? False
[... 327 passing tests skipped ...]
    
    ======================================================================
    FAIL: test_real_iso_latin_document (test_lxml.LXMLTreeBuilderSmokeTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/testing.py", line 361, in test_real_iso_latin_document
        self.assertEqual(result, expected)
    AssertionError: b'<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-type"/></head><body><p>Sacr\xc3\x83\xc2\xa9 bleu!</p></body></html>' != b'<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-type"/></head><body><p>Sacr\xc3\xa9 bleu!</p></body></html>'
    
    ----------------------------------------------------------------------
    Ran 328 tests in 0.613s
    
    FAILED (failures=1)

This last error was beyond my capabilities to fix.

Best regards,

Staffan






. Since the problem manifests only on py3.3,

Leonard Richardson

unread,
May 15, 2013, 4:00:28 PM5/15/13
to beauti...@googlegroups.com
> ======================================================================
> FAIL: test_real_iso_latin_document (test_lxml.LXMLTreeBuilderSmokeTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> File
> "/Users/staffan/virtualenvs/ferenda-py33/lib/python3.3/site-packages/bs4/testing.py",
> line 361, in test_real_iso_latin_document
> self.assertEqual(result, expected)
> AssertionError: b'<html><head><meta content="text/html; charset=utf-8"
> http-equiv="Content-type"/></head><body><p>Sacr\xc3\x83\xc2\xa9
> bleu!</p></body></html>' != b'<html><head><meta content="text/html;
> charset=utf-8" http-equiv="Content-type"/></head><body><p>Sacr\xc3\xa9
> bleu!</p></body></html>'
>
> ----------------------------------------------------------------------
> Ran 328 tests in 0.613s
>
> FAILED (failures=1)
>
> This last error was beyond my capabilities to fix.

And that's the one that's the problem. That's an ISO-Latin-1 document
that has a <meta> tag saying it's ISO-Latin-1. Unicode, Dammit goes
through the document, figures out that it's in ISO-Latin-1, and
converts the whole thing to Unicode.

At this point the contents of the <meta> tag are inaccurate, because a
Unicode string is not in any particular encoding. My philosophy is to
deal with the problem later.

So at that point I feed the Unicode into a parser. As the document is
turned into a document object model, Beautiful Soup recognizes that
there's a <meta> tag that defines an encoding. It creates an internal
representation of a <meta> tag whose "encoding" attribute is a
placeholder value. At that point the inaccuracy is resolved. The
Python object representing the <meta> tag no longer says that the
document is in any particular encoding. When the document is encoded,
the placeholder value is replaced with the actual encoding.

In the newer versions of lxml, I'm not allowed to deal with the
problem later. I have to send lxml a bytestring. But lxml is less
lenient than Unicode, Dammit. If it doesn't like the bytestring I give
it, it will raise an exception.

And at this point we come to the failed test. Running the markup
through Unicode, Dammit and then encoding it to UTF-8 won't solve the
problem when the document contains a <meta> tag claiming that the
encoding is Latin-1. The failed test means that lxml will parse that
UTF-8 bytestring as Latin-1, which is completely wrong.

In real life there are some UTF-8 documents whose <meta> tags claim
they are Latin-1 documents. It's hard to separate Beautiful Soup's
behavior from lxml's behavior just by looking at test results, but I
believe lxml will parse those documents, incorrectly, as Latin-1.

If I could change the encoding to match what the <meta> tag says, lxml
would do the right thing. And I do have code for "sniffing" the <meta>
encoding without parsing the document. But at that point it's a huge
tower of hacks.

So I think we're headed towards a world in which lxml has its own,
less lenient equivalent of Unicode, Dammit. It won't allow a document
whose encoding is ambiguous to temporarily exist in that ambiguous
state. If your document doesn't meet lxml's standards, you'll need to
use another parser backend.

At any rate, the problem is complicated enough that I can't fix it
until it starts affecting one of my computers. I've filed a bug:

https://bugs.launchpad.net/beautifulsoup/+bug/1180527

That's all I can do for now.

Leonard

Staffan Malmgren

unread,
May 17, 2013, 6:11:11 AM5/17/13
to beauti...@googlegroups.com, leon...@segfault.org
Thank you for the thorough explanation! I did go ahead and tried to extend prepare_markup to encode the unicode string to the appropriate encoding, based on what Unicode, dammit reported was used as a declared encoding, and got that last test case to work as well (on python 2.7 and 3.3). However, the code did indeed start to look like a tower of hacks. Even so, if you're interested, I'd be happy to clean it up and send you a patch.

Meanwhile, I managed to solve my particular problem in another way, by telling the lxml setup code to download,build and statically link libxml2 instead of relying on the MacOS supplied libs:

STATIC_DEPS=true pip install lxml

(Perhaps this could be added to the "Problems after installation" section of the docs?). This works well enough for me right now, and if i need to work around it in code, I guess I could create my own subclass of LXMLTreeBuilder with the required bytestring conversion.

Again, thank you very much for the help!

Best regards,

Staffan

Leonard Richardson

unread,
May 31, 2013, 9:36:48 AM5/31/13
to beauti...@googlegroups.com
I've created an experimental branch that should solve this problem
while improving performance.

https://code.launchpad.net/~leonardr/beautifulsoup/let-lxml-handle-encoding

This branch splits Unicode, Dammit into two parts. One part comes up
with guesses as to a document's encoding, and the other part uses
those guesses to convert a bytestring to Unicode.

The lxml tree builder only uses the first part. Instead of trying to
feed Unicode to the lxml tree builder, it takes a possible encoding,
creates an lxml parser designed to parse a bytestring using that
encoding, and feeds the bytestring to the lxml parser. If the parser
raises an exception, it creates another parser using the next possible
encoding.

Since lxml uses C code to decode the document to Unicode, this
improves performance. In this branch, running a seven-megabyte file
through the lxml treebuilder took half the time it takes with
Beautiful Soup 4.2.0.

I've written a test for lxml's nightmare scenario: a Unicode
bytestring describing an XML document whose declared encoding is
Shift-JIS but whose body contains a character that can't be
represented in Shift-JIS. The test passes on my computer. But since I
still don't have access to a Mac, I would like people who've
encountered the "Unicode parsing is not supported" problem to pull
this branch, and run the test suite on their computers to see if
everything works. I can build a tarball if that would be easier for
you.

This branch will not be going directly into Beautiful Soup. I changed
the tree builder interface, and I refactored a lot of Unicode, Dammit
code that doesn't have tests. But it would be really helpful to know
that this technique works.

Leonard

Staffan Malmgren

unread,
Jun 1, 2013, 3:25:18 PM6/1/13
to beauti...@googlegroups.com
Wonderful!

Unfortunately, I still encounter the "Unicode parsing is not supported" when running the test suite:

----------------------------------------------------------------------
E.....................................................EEEE../bs4/__init__.py:347: UserWarning: The BeautifulStoneSoup class is deprecated. Instead of using it, pass features="xml" into the BeautifulSoup constructor.
  'The BeautifulStoneSoup class is deprecated. Instead of using '
EEEEE.EEEEEEEEEE.EEEEEEEEEEEE..E.EEEEEEEEEEE.EEEE.E.E..........................................................................................................................................EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE........EE.E......E.................E................
======================================================================
ERROR: test_beautifulsoup_constructor_does_lookup (tests.test_builder_registry.BuiltInRegistryTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/staffan/wds/let-lxml-handle-encoding/py3k/bs4/tests/test_builder_registry.py", line 71, in test_beautifulsoup_constructor_does_lookup
    BeautifulSoup("", features="html")
  File "./bs4/__init__.py", line 170, in __init__
    self._feed()
  File "./bs4/__init__.py", line 184, in _feed
    self.builder.feed(self.markup)
  File "./bs4/builder/_lxml.py", line 225, in feed
    self.parser.feed(markup)
  File "parser.pxi", line 1127, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:89864)
lxml.etree.ParserError: Unicode parsing is not supported on this platform
----------------------------------------------------------------------

Similarly, this is the output of diagnose:

----------------------------------------------------------------------
Diagnostic running on Beautiful Soup 4.3.0
Python version 3.3.2 (default, May 29 2013, 09:47:32) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 3.2.1.0

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<foo>
</foo>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
lxml could not parse the markup.
Traceback (most recent call last):
  File "./bs4/diagnose.py", line 53, in diagnose
    soup = BeautifulSoup(data, parser)
  File "./bs4/__init__.py", line 170, in __init__
    self._feed()
  File "./bs4/__init__.py", line 184, in _feed
    self.builder.feed(self.markup)
  File "./bs4/builder/_lxml.py", line 230, in feed
    self.parser.feed(markup)
  File "parser.pxi", line 1127, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:89864)
lxml.etree.ParserError: Unicode parsing is not supported on this platform
--------------------------------------------------------------------------------
Trying to parse your markup with ['lxml', 'xml']
['lxml', 'xml'] could not parse the markup.
Traceback (most recent call last):
  File "./bs4/diagnose.py", line 53, in diagnose
    soup = BeautifulSoup(data, parser)
  File "./bs4/__init__.py", line 170, in __init__
    self._feed()
  File "./bs4/__init__.py", line 184, in _feed
    self.builder.feed(self.markup)
  File "./bs4/builder/_lxml.py", line 118, in feed
    self.parser.feed(data)
  File "parser.pxi", line 1127, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:89864)
lxml.etree.ParserError: Unicode parsing is not supported on this platform
--------------------------------------------------------------------------------

Looking at the code with some simple printf debugging, it seems to me that LXMLTreeBuilderForXML.prepare_markup starts off by yielding a tuple where markup is unicode, which results in the "Unicode parsing is not supported" exception, but is then never called again, so the code path that calls EncodingDetector is never reached.

Best regards,

Staffan



2013/5/31 Leonard Richardson <leon...@segfault.org>
You received this message because you are subscribed to a topic in the Google Groups "beautifulsoup" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beautifulsoup/68CG0m-6tv8/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to beautifulsou...@googlegroups.com.

Leonard Richardson

unread,
Jun 2, 2013, 1:47:25 PM6/2/13
to beauti...@googlegroups.com
> Looking at the code with some simple printf debugging, it seems to me that
> LXMLTreeBuilderForXML.prepare_markup starts off by yielding a tuple where
> markup is unicode, which results in the "Unicode parsing is not supported"
> exception, but is then never called again, so the code path that calls
> EncodingDetector is never reached.

OK, try it again. I made etree.ParseError the kind of exception that
tells Beautiful Soup to call prepare_markup again.

Leonard

Staffan Malmgren

unread,
Jun 2, 2013, 2:19:12 PM6/2/13
to beauti...@googlegroups.com
Great!

The new revision works well with LXMLTreeBuilderForXML, but gives the same error for LXMLTreeBuilder. I took a look at the differences, and I noticed that LXMLTreeBuilder.feed didn't catch etree.ParseError like LXMLTreeBuilderForXML.feed did. 

After changing this, the EncodingDetector seems to be called and almost everything works... except for one test, which I guess could be a side effect of letting lxml handle the decoding? For what it's worth, the same test fails on python 3.2, which do not have the same "Unicode parsing is not supported" issue (and seeming as the test uses a byte string, not a unicode string, maybe that is not so surprising).

OK, conversion is done.
Now running the unit tests.
............................................................/bs4/__init__.py:347: UserWarning: The BeautifulStoneSoup class is deprecated. Instead of using it, pass features="xml" into the BeautifulSoup constructor.
  'The BeautifulStoneSoup class is deprecated. Instead of using '
..................................E..................................................................................................................................................................................................................................................
======================================================================
ERROR: test_smart_quotes_converted_on_the_way_in (tests.test_lxml.LXMLTreeBuilderSmokeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./bs4/testing.py", line 335, in test_smart_quotes_converted_on_the_way_in
    soup = self.soup(quote)
  File "./bs4/testing.py", line 29, in soup
    return BeautifulSoup(markup, builder=builder, **kwargs)
  File "./bs4/__init__.py", line 170, in __init__
    self._feed()
  File "./bs4/__init__.py", line 184, in _feed
    self.builder.feed(self.markup)
  File "./bs4/builder/_lxml.py", line 226, in feed
    self.parser.close()
  File "parser.pxi", line 1209, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:90597)
  File "parsertarget.pxi", line 136, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:99900)
  File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:99807)
  File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:9383)
TypeError: function takes exactly 5 arguments (1 given)

----------------------------------------------------------------------
Ran 336 tests in 0.447s

FAILED (errors=1)

Please let me know if there is any other diagnostic or test I could run.

Best regards,

Staffan


2013/6/2 Leonard Richardson <leon...@segfault.org>

Leonard

--
Reply all
Reply to author
Forward
0 new messages