Right solution to unicode error?

Anders

unread,

Nov 7, 2012, 5:17:42 PM11/7/12

to

I've run into a Unicode error, and despite doing some googling, I
can't figure out the right way to fix it. I have a Python 2.6 script
that reads my Outlook 2010 task list. I'm able to read the tasks from
Outlook and store them as a list of objects without a hitch. But when
I try to print the tasks' subjects, one of the tasks is generating an
error:

Traceback (most recent call last):
File "outlook_tasks.py", line 66, in <module>
my_tasks.dump_today_tasks()
File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
dump_today_tasks
print task.subject
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 42: ordinal not in range(128)

(where task.subject was previously assigned the value of
task.Subject, aka the Subject property of an Outlook 2010 TaskItem)

From what I understand from reading online, the error is telling me
that the subject line contains an en dash and that Python is trying
to convert to ascii and failing (as it should).

Here's where I'm getting stuck. In the code above I was just printing
the subject so I can see whether the script is working properly.
Ultimately what I want to do is parse the tasks I'm interested in and
then create an HTML file containing those tasks. Given that, what's
the best way to fix this problem?

BTW, if there's a clear description of the best solution for this
particular problem – i.e., where I want to ultimately display the
results as HTML – please feel free to refer me to the link. I tried
reading a number of docs on the web but still feel pretty lost.

Thanks,
Anders

Prasad, Ramit

unread,

Nov 7, 2012, 6:07:33 PM11/7/12

to Anders, pytho...@python.org

> particular problem - i.e., where I want to ultimately display the
> results as HTML - please feel free to refer me to the link. I tried

> reading a number of docs on the web but still feel pretty lost.
>

You can always encode in a non-ASCII codec.
`print task.subject.encode(<encoding>)` where <encoding> is something that
supports the characters you want e.g. latin1.

The list of built in codecs can be found:
http://docs.python.org/library/codecs.html#standard-encodings

~Ramit

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

Oscar Benjamin

unread,

Nov 7, 2012, 6:27:17 PM11/7/12

to Anders, pytho...@python.org

On 7 November 2012 22:17, Anders <aschne...@asha.org> wrote:
>
> Traceback (most recent call last):
> File "outlook_tasks.py", line 66, in <module>
> my_tasks.dump_today_tasks()
> File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
> print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)
>

> Here's where I'm getting stuck. In the code above I was just printing
> the subject so I can see whether the script is working properly.
> Ultimately what I want to do is parse the tasks I'm interested in and
> then create an HTML file containing those tasks. Given that, what's
> the best way to fix this problem?

Are you using cmd.exe (standard Windows terminal)? If so, it does not
support unicode and Python is telling you that it cannot encode the
string in a way that can be understood by your terminal. You can try
using chcp to set the code page to something that works with your
script.

If you are only printing it for debugging purposes you can just print
the repr() of the string which will be ascii and will come out fine in
your terminal. If you want to write it to a html file you should
encode the string with whatever encoding (probably utf-8) you use in
the html file. If you really just want your script to be able to print
unicode characters then you need to use something other than cmd.exe
(such as IDLE).

Oscar

Andrew Berg

unread,

Nov 7, 2012, 6:51:11 PM11/7/12

to comp.lang.python

On 2012.11.07 17:27, Oscar Benjamin wrote:
> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> support unicode

Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
the OP since Python versions below 3.3 don't support cp65001, but I
think it's important to point out that the Windows command line system
(it is not unique to cmd) does in fact support Unicode.
--
CPython 3.3.0 | Windows NT 6.1.7601.17835

Steven D'Aprano

unread,

Nov 7, 2012, 6:53:49 PM11/7/12

to

On Wed, 07 Nov 2012 14:17:42 -0800, Anders wrote:

> I've run into a Unicode error, and despite doing some googling, I can't
> figure out the right way to fix it. I have a Python 2.6 script that
> reads my Outlook 2010 task list. I'm able to read the tasks from Outlook
> and store them as a list of objects without a hitch. But when I try to
> print the tasks' subjects, one of the tasks is generating an error:
>
> Traceback (most recent call last):
> File "outlook_tasks.py", line 66, in <module>
> my_tasks.dump_today_tasks()
> File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
> print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)

This error confuses me. Is that an exact copy and paste of the error, or
have you edited it or reconstructed it? Because it seems to me that if
task.subject is a unicode string, as it appears to be, calling print on
it should succeed:

py> s = u'ABC\u2013DEF'
py> print s
ABC–DEF

What does type(task.subject) return?

--
Steven

Oscar Benjamin

unread,

Nov 7, 2012, 7:44:26 PM11/7/12

to Andrew Berg, comp.lang.python

On 7 November 2012 23:51, Andrew Berg <bahamut...@gmail.com> wrote:
> On 2012.11.07 17:27, Oscar Benjamin wrote:
>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>> support unicode
> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> the OP since Python versions below 3.3 don't support cp65001, but I
> think it's important to point out that the Windows command line system
> (it is not unique to cmd) does in fact support Unicode.

I have tried to use code page 65001 and it didn't work for me even if
I did use a version of Python (possibly 3.3 alpha) that claimed to
support it. It turned out that there were other Windows related
problems with using the codepage so that I had to do something like

chcp 65001 && python myscript.py && chcp 2521

(It was important for all those commands to be on the same line) I'm
not on Windows right now and I can't remember all the details but I
seem to remember that even with that awkwardness and changing the font
it still didn't actually work.

If you know how to make it work, I'd be interested to know.

Oscar

wxjm...@gmail.com

unread,

Nov 8, 2012, 6:01:14 AM11/8/12

to

----------

The problem is not on the Python side or specific
to Python. It is on the side of the "coding of
characters".

1) Unicode is an abstract entity, it has to be encoded
for the system/device that will host it.
Using Python:
<unicode>.encode(host_coding)

2) The host_coding scheme may not contain the
character (glyph/grapheme) corresponding to the
"unicode character". In that case, 2 possible
solutions, "ignore" it ou "replace" it with a
substitution character.
Using Python:
<unicode>.encode(host_coding, "ignore")
<unicode>.encode(host_coding, "replace")

3) Detecting the host_coding, the most difficult
task. Either you have to hard-code it or you
may expect Python find it via its sys.encoding.

4) Due to the nature of unicode, it the unique
way to do it correctly.

Expectedly failing and not failing examples.
Mainly Py3, but it doesn't matter. Note: Py3 encodes
and creates a byte string, which has to be
decoded to produce a native (unicode) string, here
with cp1252.

Py2

>>> u'éléphant\u2013abc'.encode('ascii')

Traceback (most recent call last):

File "<pyshell#0>", line 1, in <module>
u'éléphant\u2013abc'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
>>> print(u'éléphant\u2013abc'.encode('cp1252'))
éléphant–abc
>>>

Py3

>>> 'éléphant\u2013abc'.encode('ascii')

Traceback (most recent call last):

File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in
position 0: ordinal not in range(128)
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore')
b'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace')
b'?l?phant?abc'
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore').decode('cp1252')
'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace').decode('cp1252')
'?l?phant?abc'
>>>
>>> 'éléphant\u2013abc'.encode('cp1252').decode('cp1252')
'éléphant–abc'

>>> sys.stdout.encoding
'cp1252'
>>> 'éléphant\u2013abc'.encode(sys.stdout.encoding).decode('cp1252')
'éléphant–abc'

etc

jmf

Hans Mulder

unread,

Nov 8, 2012, 6:40:11 AM11/8/12

to

On 8/11/12 00:53:49, Steven D'Aprano wrote:
> This error confuses me. Is that an exact copy and paste of the error, or
> have you edited it or reconstructed it? Because it seems to me that if
> task.subject is a unicode string, as it appears to be, calling print on
> it should succeed:
>
> py> s = u'ABC\u2013DEF'
> py> print s
> ABC–DEF

That would depend on whether python thinks sys.stdout can
handle UTF8. For example, on my MacOS X box:

$ python2.6 -c 'print u"abc\u2013def"'
abc–def
$ python2.6 -c 'print u"abc\u2013def"' | cat

Traceback (most recent call last):

File "<string>", line 1, in <module>

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in

position 3: ordinal not in range(128)

This is because python knows that my terminal is capable
of handling UTF8, but it has no idea whether the program at
the other end of a pipe had that ability, so it'll fall
back to ASCII only if sys.stdout goes to a pipe.

Apparently the OP has a terminal that doesn't handle UTF8,
or one that Python doesn't know about.

Hope this helps,

-- HansM

Anders Schneiderman

unread,

Nov 8, 2012, 9:00:43 AM11/8/12

to Oscar Benjamin, pytho...@python.org

Thanks, Oscar and Ramit! This is exactly what I was looking for.

Anders

> -----Original Message-----
> From: Oscar Benjamin [mailto:oscar.j....@gmail.com]
> Sent: Wednesday, November 07, 2012 6:27 PM
> To: Anders Schneiderman
> Cc: pytho...@python.org
> Subject: Re: Right solution to unicode error?
>
> On 7 November 2012 22:17, Anders <aschne...@asha.org> wrote:
> >

> > Traceback (most recent call last):
> > File "outlook_tasks.py", line 66, in <module>
> > my_tasks.dump_today_tasks()
> > File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> > dump_today_tasks
> > print task.subject
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> > position 42: ordinal not in range(128)
> >

> > Here's where I'm getting stuck. In the code above I was just printing
> > the subject so I can see whether the script is working properly.
> > Ultimately what I want to do is parse the tasks I'm interested in and
> > then create an HTML file containing those tasks. Given that, what's
> > the best way to fix this problem?
>

> Are you using cmd.exe (standard Windows terminal)? If so, it does not

Oscar Benjamin

unread,

Nov 8, 2012, 9:06:42 AM11/8/12

to Andrew Berg, comp.lang.python

On 8 November 2012 00:44, Oscar Benjamin <oscar.j....@gmail.com> wrote:
> On 7 November 2012 23:51, Andrew Berg <bahamut...@gmail.com> wrote:
>> On 2012.11.07 17:27, Oscar Benjamin wrote:

>>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>>> support unicode

>> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
>> the OP since Python versions below 3.3 don't support cp65001, but I
>> think it's important to point out that the Windows command line system
>> (it is not unique to cmd) does in fact support Unicode.
>
> I have tried to use code page 65001 and it didn't work for me even if
> I did use a version of Python (possibly 3.3 alpha) that claimed to
> support it.

I stand corrected. I've just checked and codepage 65001 does work in
cmd.exe (on this machine):

O:\>Q:\tools\Python33\python -c print('abc\u2013def')

Traceback (most recent call last):

File "<string>", line 1, in <module>

File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in
position 3: character maps to
<undefined>

O:\>chcp 65001
Active code page: 65001

O:\>Q:\tools\Python33\python -c print('abc\u2013def')
abc-def

O:\>Q:\tools\Python33\python -c print('\u03b1')
α

It would be a lot better though if it just worked straight away
without me needing to set the code page (like the terminal in every
other OS I use).

Oscar

wxjm...@gmail.com

unread,

Nov 8, 2012, 10:05:14 AM11/8/12

to

----------

It *WORKS* straight away. The problem is that
people do not wish to use unicode correctly
(eg. Mulder's example).
Read the point 1) and 4) in my previous post.

Unicode and in general the coding of the characters
have nothing to do with the os's or programming languages.

jmf

Oscar Benjamin

unread,

Nov 8, 2012, 1:32:11 PM11/8/12

to wxjm...@gmail.com, pytho...@python.org

On 8 November 2012 15:05, <wxjm...@gmail.com> wrote:
> Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
>> On 8 November 2012 00:44, Oscar Benjamin <oscar.j....@gmail.com> wrote:
>> > On 7 November 2012 23:51, Andrew Berg <bahamut...@gmail.com> wrote:
>> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
>>
>> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>> >>> support unicode
>>
>> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
>> >> the OP since Python versions below 3.3 don't support cp65001, but I
>> >> think it's important to point out that the Windows command line system
>> >> (it is not unique to cmd) does in fact support Unicode.
>>
>> > I have tried to use code page 65001 and it didn't work for me even if
>> > I did use a version of Python (possibly 3.3 alpha) that claimed to
>> > support it.
>>
>> I stand corrected. I've just checked and codepage 65001 does work in
>> cmd.exe (on this machine):
>>

>> O:\>chcp 65001
>> Active code page: 65001
>>
>> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
>> abc-def
>>
>> O:\>Q:\tools\Python33\python -c print('\u03b1')
>> α
>>
>> It would be a lot better though if it just worked straight away
>> without me needing to set the code page (like the terminal in every
>> other OS I use).
>

> It *WORKS* straight away. The problem is that
> people do not wish to use unicode correctly
> (eg. Mulder's example).
> Read the point 1) and 4) in my previous post.
>
> Unicode and in general the coding of the characters
> have nothing to do with the os's or programming languages.

I don't know what you mean that it works "straight away".

The default code page on my machine is cp850.

O:\>chcp
Active code page: 850

cp850 doesn't understand utf-8. It just prints garbage:

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
╬▒

Using the correct encoding doesn't help:

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))"

Traceback (most recent call last):
File "<string>", line 1, in <module>

File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
position 0: character maps to
<undefined>

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
coding))"

Traceback (most recent call last):
File "<string>", line 1, in <module>

File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
position 0: character maps to
<undefined>

If I want the other characters to work I need to change the code page:

O:\>chcp 65001
Active code page: 65001

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
α

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
coding))"
α

Oscar

Ian Kelly

unread,

Nov 8, 2012, 1:48:23 PM11/8/12

to Python

On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
<oscar.j....@gmail.com> wrote:
> If I want the other characters to work I need to change the code page:
>

> O:\>chcp 65001
> Active code page: 65001
>

> O:\>Q:\tools\Python33\python -c "import sys;
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> α
>
> O:\>Q:\tools\Python33\python -c "import sys;
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> coding))"
> α

I find that I also need to change the font. With the default font,
printing '\u2013' gives me:

â€“

The only alternative font option I have in Windows XP is Lucida
Console, which at least works correctly, although it seems to be
lacking a lot of glyphs.

wxjm...@gmail.com

unread,

Nov 8, 2012, 2:30:37 PM11/8/12

to wxjm...@gmail.com, pytho...@python.org

You are confusing two things. The coding of the
characters and the set of the characters (glyphes/graphemes)
of a coding scheme.

It is always possible to encode safely an unicode, but
the target coding may not contain the character.

Take a look at the output of this "special" interactive
interpreter" where the host coding (sys.stdout.encoding)
can be change on the fly.

>>> s = 'éléphant\u2013abcéœ€'
>>> sys.stdout.encoding
'<unicode>'
>>> s
'éléphant–abcéœ€'
>>>
>>> sys.stdout.encoding = 'cp1252'
>>> s.encode('cp1252')
'éléphant–abcéœ€'
>>> sys.stdout.encoding = 'cp850'
>>> s.encode('cp850')

Traceback (most recent call last):

File "<eta last command>", line 1, in <module>
File "C:\Python32\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
in position 8: character maps to <undefined>
>>> # but
>>> s.encode('cp850', 'replace')
'éléphant?abcé??'
>>>
>>> sys.stdout.encoding = 'utf-8'
>>> s
'Ã©lÃ©phantâ€“abcÃ©Å“â‚¬'
>>> s.encode('utf-8')
'éléphant–abcéœ€'
>>>
>>> sys.stdout.encoding = 'utf-16-le' <<<<<<<<<
>>> s
' é l é p h a n t a b c é S ¬ '
>>> s.encode('utf-16-le')
'éléphant–abcéœ€'

<<<<<<<<<<< some cheating here do to the mail system, it really looks like this.

jmf

wxjm...@gmail.com

unread,

Nov 8, 2012, 2:30:37 PM11/8/12

to comp.lan...@googlegroups.com, pytho...@python.org, wxjm...@gmail.com

You are confusing two things. The coding of the
characters and the set of the characters (glyphes/graphemes)
of a coding scheme.

It is always possible to encode safely an unicode, but
the target coding may not contain the character.

Take a look at the output of this "special" interactive
interpreter" where the host coding (sys.stdout.encoding)
can be change on the fly.

>>> s = 'éléphant\u2013abcéœ€'
>>> sys.stdout.encoding
'<unicode>'
>>> s
'éléphant–abcéœ€'
>>>
>>> sys.stdout.encoding = 'cp1252'
>>> s.encode('cp1252')
'éléphant–abcéœ€'
>>> sys.stdout.encoding = 'cp850'
>>> s.encode('cp850')

Traceback (most recent call last):

File "<eta last command>", line 1, in <module>
File "C:\Python32\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)

wxjm...@gmail.com

unread,

Nov 8, 2012, 2:54:23 PM11/8/12

to Python

--------

Font has nothing to do here.
You are "simply" wrongly encoding your "unicode".

>>> '\u2013'
'–'
>>> '\u2013'.encode('utf-8')
b'\xe2\x80\x93'
>>> '\u2013'.encode('utf-8').decode('cp1252')
'â€“'

jmf

wxjm...@gmail.com

unread,

Nov 8, 2012, 2:54:23 PM11/8/12

to comp.lan...@googlegroups.com, Python

Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :

Ian Kelly

unread,

Nov 8, 2012, 3:41:37 PM11/8/12

to Python

On Thu, Nov 8, 2012 at 12:54 PM, <wxjm...@gmail.com> wrote:
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
>
>>>> '\u2013'
> '–'
>>>> '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
>>>> '\u2013'.encode('utf-8').decode('cp1252')
> 'â€“'

No, it seriously is the font. This is what I get using the default
("Raster") font:

C:\>chcp 65001
Active code page: 65001

C:\>c:\python33\python
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2013'
'â€“'
>>> import sys
>>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
â€“
4

I should note here that the characters copied and pasted do not
correspond to the glyphs actually displayed in my terminal window. In
the terminal window I actually see:

ΓÇô

If I change the font to Lucida Console and run the *exact same code*,
I get this:

C:\>chcp 65001
Active code page: 65001

C:\>c:\python33\python
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2013'
'–'

>>> import sys
>>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
–
4

Why is the font important? I have no idea. Blame Microsoft.

Prasad, Ramit

unread,

Nov 8, 2012, 3:54:23 PM11/8/12

to wxjm...@gmail.com, pytho...@python.org

wxjm...@gmail.com wrote:
>
> Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
> > On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
> >
> > <oscar.j....@gmail.com> wrote:
> >
> > > If I want the other characters to work I need to change the code page:
> > >
> > > O:\>chcp 65001
> > > Active code page: 65001
> > >
> > > O:\>Q:\tools\Python33\python -c "import sys;
> > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> > > α
> > >
> > > O:\>Q:\tools\Python33\python -c "import sys;
> > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> > > coding))"
> > > α
> >
> > I find that I also need to change the font. With the default font,
> >
> > printing '\u2013' gives me:
> > â€“
> >
> > The only alternative font option I have in Windows XP is Lucida
> > Console, which at least works correctly, although it seems to be
> > lacking a lot of glyphs.
>
> --------
>
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
>

Why would font not matter? Unicode is the abstract definition
of all characters right? From that we map the abstract
character to a code page/set, which gives real values for an
abstract character. From that code page we then visually display
the "real value" based on the font. If that font does
not have a glyph for a specific character page (or a different
glyph) then that is a problem and not related encoding.

Unicode->code page->font

> >>> '\u2013'
> '–'
> >>> '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
> >>> '\u2013'.encode('utf-8').decode('cp1252')
> 'â€“'
>

This is a mismatched translation between code pages; not
font related but is instead one abstraction "level" up.

Ian Kelly

unread,

Nov 8, 2012, 4:07:15 PM11/8/12

to Python

On Thu, Nov 8, 2012 at 1:54 PM, Prasad, Ramit <ramit....@jpmorgan.com> wrote:
> Why would font not matter? Unicode is the abstract definition
> of all characters right? From that we map the abstract
> character to a code page/set, which gives real values for an
> abstract character. From that code page we then visually display
> the "real value" based on the font. If that font does
> not have a glyph for a specific character page (or a different
> glyph) then that is a problem and not related encoding.

Usually though when the font is missing a glyph for a Unicode
character, you just get a missing glyph symbol, such as an empty
rectangle. For some reason when using the default font, cmd seemingly
ignores the active code page, skips decoding the characters, and tries
to print the individual bytes as if using code page 437.

Oscar Benjamin

unread,

Nov 8, 2012, 4:37:48 PM11/8/12

to wxjm...@gmail.com, pytho...@python.org

On 8 November 2012 19:54, <wxjm...@gmail.com> wrote:
> Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
>> On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
>>
>> <oscar.j....@gmail.com> wrote:
>>
>> > If I want the other characters to work I need to change the code page:
>>
>> >
>>
>> > O:\>chcp 65001
>>
>> > Active code page: 65001
>>
>> >
>>
>> > O:\>Q:\tools\Python33\python -c "import sys;
>>

>> I find that I also need to change the font. With the default font,
>>
>> printing '\u2013' gives me:
>>
>> â€“
>>
>>
>>
>> The only alternative font option I have in Windows XP is Lucida
>>
>> Console, which at least works correctly, although it seems to be
>>
>> lacking a lot of glyphs.
>

> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
>
>>>> '\u2013'
> '–'
>>>> '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
>>>> '\u2013'.encode('utf-8').decode('cp1252')
> 'â€“'

You have correctly identified that the displayed characters are the
result of accidentally interpreting utf-8 bytes as if they were cp1252
or similar. However, it is not Ian or Python that is confusing the
encoding. It is cmd.exe that is confusing the encoding in a
font-dependent way. I also had to change the font as Ian describes
though I did it some time ago and forgot to mention it here.

jmf, can you please trim the text you quote removing the parts you are
not responding to and then any remaining blank lines that were
inserted by your reader/editor?

Oscar

Andrew Berg

unread,

Nov 8, 2012, 10:30:46 PM11/8/12

to comp.lang.python

On 2012.11.08 08:06, Oscar Benjamin wrote:
> It would be a lot better though if it just worked straight away
> without me needing to set the code page (like the terminal in every
> other OS I use).

The crude equivalent of .bashrc/.zshrc/whatever shell startup script for
cmd is setting a string value (REG_SZ) in
HKCU\Software\Microsoft\Command Processor named autorun and setting that
with whatever command(s) you want to run whenever the shell starts. Mine
has a value of '@chcp 65001>nul'. I actually run zsh when practical
(gotta love Cygwin) and I have an equivalent command in my .zshrc.
Getting unicode to work in a Windows is a hassle, but it /can/ work.
CPython does have a bug that makes it annoying at times, though -
http://bugs.python.org/issue1602

wxjm...@gmail.com

unread,

Nov 9, 2012, 5:06:05 AM11/9/12

to

---------

If you have something like this 'ΓÇô'; in
Unicode nomenclature:
>>> import unicodedata as ud
>>> for c in 'ΓÇô':
... ud.name(c)
...
'GREEK CAPITAL LETTER GAMMA'
'LATIN CAPITAL LETTER C WITH CEDILLA'
'LATIN SMALL LETTER O WITH CIRCUMFLEX'

it is a sign of a "cp437" somewhere.

>>> '\u2013'.encode('utf-8').decode('cp437')
'ΓÇô'

On Windows 7. I do not remember having once a "coding
of the caracters" issue on XP.

jmf