Re: [sphinx-users] SphinxError: Can't decode unicode within a doc

gilberto dos santos alves

unread,

Apr 17, 2013, 8:58:07 PM4/17/13

to sphinx...@googlegroups.com

are you using
# -*- coding: utf-8 -*-

for your python files? see url [1]

url [1] http://www.python.org/dev/peps/pep-0263/

2013/4/17 Conway M <c0ld...@gmail.com>

I am trying to compile the docs of Pandas but I am unable to get Sphinx to compile a document with some unicode. Is there some flag I need to specify to let Sphinx correctly build documents with unicode in them? In this case, I don't want Sphinx to decode the text.

.. _io.unicode:
Dealing with Unicode Data
~~~~~~~~~~~~~~~~~~~~~~~~~
The ``encoding`` argument should be used for encoded unicode data, which will
result in byte strings being decoded to unicode in the result:
.. ipython:: python
data = 'word,length\nTr\xe4umen,7\nGr\xfc\xdfe,5'
df = pd.read_csv(StringIO(data), encoding='latin-1')
df
df['word'][1]
Some formats which encode all characters as multiple bytes, like UTF-16, won't
parse correctly at all without specifying the encoding.

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/sphinx/cmdline.py", line 247, in main
app.build(force_all, filenames)
File "/usr/local/lib/python2.7/dist-packages/sphinx/application.py", line 211, in build
self.builder.build_update()
File "/usr/local/lib/python2.7/dist-packages/sphinx/builders/__init__.py", line 211, in build_update
'out of date' % len(to_build))
File "/usr/local/lib/python2.7/dist-packages/sphinx/builders/__init__.py", lin e 231, in build
purple, length):
File "/usr/local/lib/python2.7/dist-packages/sphinx/builders/__init__.py", line 131, in status_iterator
for item in iterable:
File "/usr/local/lib/python2.7/dist-packages/sphinx/environment.py", line 458, in update_generator
self.read_doc(docname, app=app)
File "/usr/local/lib/python2.7/dist-packages/sphinx/environment.py", line 609, in read_doc
raise SphinxError(str(err))
SphinxError: 'utf8' codec can't decode byte 0xe4 in position 36: invalid continuation byte
> /usr/local/lib/python2.7/dist-packages/sphinx/environment.py(609)read_doc()
-> raise SphinxError(str(err))
(Pdb)

--
You received this message because you are subscribed to the Google Groups "sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sphinx-users...@googlegroups.com.
To post to this group, send email to sphinx...@googlegroups.com.
Visit this group at http://groups.google.com/group/sphinx-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--
gilberto dos santos alves
+55.11.98646-5049
sao paulo - sp - brasil

Conway M

unread,

Apr 18, 2013, 12:29:04 AM4/18/13

to sphinx...@googlegroups.com

gilberto, thanks for the reply.

I should have been clearer. The above text exists in a restructured text file. I'm not sure I see how setting the source code encodings would affect Sphinx's processing of the restructured text.

Guenter Milde

unread,

Apr 18, 2013, 3:06:43 AM4/18/13

to sphinx...@googlegroups.com

On 2013-04-17, Conway M wrote:

> I am trying to compile the docs of Pandas

> <https://github.com/pydata/pandas>but I am unable to get Sphinx to

> compile a document with some unicode. Is there some flag I need to
> specify to let Sphinx correctly build documents with unicode in them?

The default input encoding is 'utf8', so if your rst document is
utf8-encoded, it should be OK.

If not, please post more details (used encoding, docutils settings).
A minimal example (the part of the input file that coused the error) may
help further.

> In this case, I don't want Sphinx to decode the text.

Docutils/Sphinx will always decode the input into an "unicode" instance
and encode the output. All inner processing is done on "unicode" (or
derived) objects.

...

>> * File "/usr/local/lib/python2.7/dist-packages/sphinx/environment.py",

>> line 609, in read_doc
>> raise SphinxError(str(err))

>> *SphinxError: 'utf8' codec can't decode byte 0xe4 in position 36: invalid
>> continuation byte
>> *>

>> /usr/local/lib/python2.7/dist-packages/sphinx/environment.py(609)read_doc()
>> -> raise SphinxError(str(err))
>> (Pdb)

It looks like the input file is either broken or not in utf8 encoding (which
then?).

It looks like the input decoding is not done by docutils.io, but by the
Sphinx "wrapper" - this means you must tell Sphinx about the correct
"source_encoding"
http://sphinx-doc.org/config.html#confval-source_encoding.
Setting the Docutils config setting "input-encoding"
http://docutils.sourceforge.net/docs/user/config.html#input-encoding will
not help.

Günter

Conway M

unread,

Apr 18, 2013, 11:38:48 AM4/18/13

to sphinx...@googlegroups.com, mi...@users.sf.net

Günter, thanks for your response.

The conf.py did not have a source_encoding specified. So I assume it would just default to 'utf-8-sig'. Even explicitly specifying the encoding as 'utf-8-sig' produced the same error.

The snippet in the rst document that is causing the error is (also specified in the original post):

data = 'word,length\nTr\xe4umen,7\nGr\xfc\xdfe,5'

The complete rst document can be found here. The resulting html should look like this.

One thing that I just realized is that other developers who have built the docs have built them exclusively on a Linux box. However, I am working off a Ubuntu 12.04 virtual machine running on Windows 7. So I'm not entirely convicted the the input file is broken and that it might be a platform dependent issue.

gilberto dos santos alves

unread,

Apr 18, 2013, 12:29:39 PM4/18/13

to sphinx...@googlegroups.com

when you are under virtualbox, qemu, vmware, based on my experience it works ok. i used ubuntu 12.04 on virtualbox and qemu on windows, mac etc. please see what is your locale (open terminal man locale) see that when you install os you could choose (your language.codepage) see that are correct setup. my experience see that if some one are using en_us.ascii programs (sphinx and others) not handle this. my locale say pt_BR.utf-8 if i install only pt_BR.ascii things not go very well.

2013/4/18 Conway M <c0ld...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sphinx-users...@googlegroups.com.
To post to this group, send email to sphinx...@googlegroups.com.
Visit this group at http://groups.google.com/group/sphinx-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Lothar Braun

unread,

Apr 18, 2013, 1:15:03 PM4/18/13

to sphinx...@googlegroups.com

I’m not a specialist on this, but

seeing your html I would think, that you would like to see those 4 characters \xe4 in your html document, but Sphinx sees this as the utf-8 form of one (non existing) unicode character.

If that is correct, I would try some escaping to bypass the parsing of this sequence as utf-8.

Lothar

--

Conway M

unread,

Apr 18, 2013, 2:57:24 PM4/18/13

to sphinx...@googlegroups.com, l...@aucotec.com

Lothar, that is correct. I would like to see those 4 characters in the html document. When you say "escaping" is there a way to specify in the rst document for a line to selectively not be interpreted as UTF-8?

Conway M

unread,

Apr 18, 2013, 3:51:35 PM4/18/13

to sphinx...@googlegroups.com

gilberto, my locale was set to en_US and I changed everything to en_US.UTF-8

I'm not sure if both were necessary but I did this:

sudo update-locale LANG=en_US.UTF-8

sudo update-locale LC_ALL=en_US.UTF-8

And it worked!!

Thank you gilberto...I would not have figured that out on my own!

gilberto dos santos alves

unread,

Apr 18, 2013, 5:33:40 PM4/18/13

to sphinx...@googlegroups.com

very nice.

regards.

2013/4/18 Conway M <c0ld...@gmail.com>

Guenter Milde

unread,

Apr 19, 2013, 5:15:36 AM4/19/13

to sphinx...@googlegroups.com

On 2013-04-18, Conway M wrote:

> Günter, thanks for your response.

> The conf.py did not have a source_encoding specified. So I assume it would
> just default to 'utf-8-sig'. Even explicitly specifying the encoding as
> 'utf-8-sig' produced the same error.

Do you have non-ASCII chars in conf.py? Otherwise, specifying the source
encoding of conf.py is not necessary.
The source encoding of conf.py will in any way not influence how Sphinx
decodes the rst input files.

> The snippet in the rst document that is causing the error is (also
> specified in the original post):

> *data = 'word,length\nTr\xe4umen,7\nGr\xfc\xdfe,5'*

Interestingly, this does not contain any non-ASCII characters, so it should
pass without problems!

> The complete rst document can be found here<https://raw.github.com/pydata/pandas/master/doc/source/io.rst>.

Here I see that this is part of a "ipython" directive. It is used similar to
"code" or "code-block", so I assume it should

* treat the content as "literal", i.e. without special meaning to characters
like the backslash

* parse the content for syntax highlihgt.

Maybe the extension defining "ipython" does not get this right and converts
\x.. to non-ASCII characters.

> The resulting html should look like this<http://pandas.pydata.org/pandas-docs/dev/io.html#dealing-with-unicode-data>.

Yes indeed, the whole block contains Unicode characters not present in the
input::

Out[1054]:
word length
0 Träumen 7
1 Grüße 5

It seems there is rather a problem with ipython or the interface.

(BTW, the example "Träumen" appears never alone in this form in German:
either it is the verb "träumen" (with small t) capitalized at the
beginning of a sentence like "Träumen werde ich." (Dream, I will.) or it
is the plural accusative of the substantive "Traum" like in "in meinen
Träumen" (in my dreams).

> One thing that I just realized is that other developers who have built the
> docs have built them exclusively on a Linux box. However, I am working off
> a Ubuntu 12.04 virtual machine running on Windows 7. So I'm not entirely
> convicted the the input file is broken and that it might be a platform
> dependent issue.

From the other posts I learned that the issue could be solved with a locale
setting.

Günter

Reply all

Reply to author

Forward