UnicodeDecode when calling render()

38 views
Skip to first unread message

anatoly techtonik

unread,
Dec 23, 2012, 8:29:43 PM12/23/12
to gen...@googlegroups.com
The following code fails with UnicodeDecode error and I am completely puzzled about what does it want.

import genshi


import urllib2
mbt_file = urllib2.urlopen(iURL)
mbt_genshi = genshi.input.HTMLParser(mbt_file)
parsed = mbt_genshi.parse()
parsed.select("head").render()

The full traceback:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    parsed.select("head").render()
  File "/usr/lib/pymodules/python2.6/genshi/core.py", line 183, in render
    return encode(generator, method=method, encoding=encoding, out=out)
  File "/usr/lib/pymodules/python2.6/genshi/output.py", line 57, in encode
    return _encode(''.join(list(iterator)))
  File "/usr/lib/pymodules/python2.6/genshi/output.py", line 223, in __call__
    for kind, data, pos in stream:
  File "/usr/lib/pymodules/python2.6/genshi/output.py", line 670, in __call__
    for kind, data, pos in stream:
  File "/usr/lib/pymodules/python2.6/genshi/output.py", line 771, in __call__
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File "/usr/lib/pymodules/python2.6/genshi/output.py", line 586, in __call__
    for ev in stream:
  File "/usr/lib/pymodules/python2.6/genshi/core.py", line 288, in _ensure
    for event in stream:
  File "/usr/lib/pymodules/python2.6/genshi/path.py", line 581, in _generate
    for event in stream:
  File "/usr/lib/pymodules/python2.6/genshi/core.py", line 288, in _ensure
    for event in stream:
  File "/usr/lib/pymodules/python2.6/genshi/input.py", line 432, in _coalesce
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File "/usr/lib/pymodules/python2.6/genshi/input.py", line 327, in _generate
    self.feed(data)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 9: ordinal not in range(128)


What this error is about?

Simon Cross

unread,
Dec 24, 2012, 3:14:04 AM12/24/12
to gen...@googlegroups.com
Hi Anatoly

Could you try construct a minimal test case?

Schiavo
Simon

anatoly techtonik

unread,
Dec 24, 2012, 3:37:27 AM12/24/12
to genshi
Actually, the code script pasted is a minimal test case. =)

Simon Cross

unread,
Dec 24, 2012, 3:43:51 AM12/24/12
to gen...@googlegroups.com
On Mon, Dec 24, 2012 at 10:37 AM, anatoly techtonik <tech...@gmail.com> wrote:
> Actually, the code script pasted is a minimal test case. =)

It references a giant blob of HTML.

anatoly techtonik

unread,
Dec 24, 2012, 4:45:41 AM12/24/12
to genshi
I experimented with encoding a bit and it boiled down to http://genshi.edgewall.org/ticket/375 so I think it is more important.
-- 
anatoly t.

Simon Cross

unread,
Dec 24, 2012, 4:58:50 AM12/24/12
to gen...@googlegroups.com
On Mon, Dec 24, 2012 at 11:45 AM, anatoly techtonik <tech...@gmail.com> wrote:
> I experimented with encoding a bit and it boiled down to
> http://genshi.edgewall.org/ticket/375 so I think it is more important.

I closed that ticket as wontfix -- cleaning up HTML seems outside of
Genshi's scope and in any case it's not clear why Genshi would do a
better job than a dedicated tool.

Schiavo
Simon

anatoly techtonik

unread,
Dec 24, 2012, 5:57:39 AM12/24/12
to genshi
Genshi has an HTML parser, so if parser can not handle HTML that is accepted and correctly rendered by at least three top browsers, it is of a little use of Genshi. I used it, because Genshi comes bundled with Trac, and it is used in a plugin that substitutes links like "issue #423" for specific repositories with a reference to external tracker.

If I can't use Genshi for parsing HTML then I can't see benefits in using XML based complications in Trac templating layer over familiar Django and Jinja-style.

Eli Stevens (Gmail)

unread,
Dec 24, 2012, 1:01:14 PM12/24/12
to gen...@googlegroups.com
On Mon, Dec 24, 2012 at 2:57 AM, anatoly techtonik <tech...@gmail.com> wrote:
Genshi has an HTML parser, so if parser can not handle HTML that is accepted and correctly rendered by at least three top browsers,

Just to chime in, I've had to deal with the difference between correct HTML and the HTML that will be rendered "correctly" by browsers previously, and the difference between the two is huge.  The amount of "garbage in, what you probably wanted out" is staggering.  I don't know if this is still true, but at the time even tools like Beautiful Soup couldn't properly parse a Google search result page, much less a tool that expected properly formed markup.  Expecting Genshi to replicate all of the cleanup code present in a browser doesn't make sense, IMO.

What we ended up doing was use the browser to parse the pages we were interested in, then use them to save an HTML version of the DOM.  Since the browser was just serializing the in-memory DOM, it was syntactically correct.  This was before the days of tools like PhantomJS, so it would probably be even easier now.

Eli

anatoly techtonik

unread,
Dec 24, 2012, 11:35:26 PM12/24/12
to genshi
On Mon, Dec 24, 2012 at 9:01 PM, Eli Stevens (Gmail) <wicke...@gmail.com> wrote:
On Mon, Dec 24, 2012 at 2:57 AM, anatoly techtonik <tech...@gmail.com> wrote:
Genshi has an HTML parser, so if parser can not handle HTML that is accepted and correctly rendered by at least three top browsers,

Just to chime in, I've had to deal with the difference between correct HTML and the HTML that will be rendered "correctly" by browsers previously, and the difference between the two is huge.  The amount of "garbage in, what you probably wanted out" is staggering.  I don't know if this is still true, but at the time even tools like Beautiful Soup couldn't properly parse a Google search result page, much less a tool that expected properly formed markup.  Expecting Genshi to replicate all of the cleanup code present in a browser doesn't make sense, IMO.

The HTML5 standard actually describes all the cleanup procedures http://ejohn.org/blog/html-5-parsing/ so maybe Genshi should implement HTML5Parser using http://code.google.com/p/html5lib/ and patch its existing HTMLParser to have optional fallback mechanism?
 
What we ended up doing was use the browser to parse the pages we were interested in, then use them to save an HTML version of the DOM.  Since the browser was just serializing the in-memory DOM, it was syntactically correct.  This was before the days of tools like PhantomJS, so it would probably be even easier now.

Yes, tools are evolved. =)

Simon Cross

unread,
Dec 25, 2012, 1:36:33 AM12/25/12
to gen...@googlegroups.com
On Mon, Dec 24, 2012 at 11:45 AM, anatoly techtonik <tech...@gmail.com> wrote:
> I experimented with encoding a bit and it boiled down to
> http://genshi.edgewall.org/ticket/375 so I think it is more important.

Your original example doesn't boil down to #357. It's an encoding
issue. Genshi trunk raises:

UnicodeError: source returned bytes, but no encoding specified

and setting "encoding='latin-1'" in the construction of HTMLParse
causes your example to work.

The attached patch to Genshi 0.6.x makes the behaviour there similar.
I haven't applied it to the 0.6.x branch yet because I still need to
think through all the ramifications.

Schiavo
Simon
use-input-encoding.diff

anatoly techtonik

unread,
Dec 25, 2012, 3:23:04 AM12/25/12
to genshi
But the content downloaded is 'utf-8', page meta specifies 'utf-8' and server header specify 'utf-8' as well. And undocumented encoding parameter in HTMLParser constructor seems to be 'utf-8' as well. Is it a problem with urlopen autoconverting to 'latin-1'?

anatoly techtonik

unread,
Dec 25, 2012, 3:24:29 AM12/25/12
to genshi
I am using Python 2. No bytes.
Reply all
Reply to author
Forward
0 new messages