(Fucking) Unicode: console print statement and PythonWin: replacement for off-table chars HOWTO?

86 views
Skip to first unread message

Robert

unread,
Jan 10, 2006, 2:28:07 PM1/10/06
to
(windows or linux console)

>>> print u'\u034a'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\PYTHON23\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u034a' in
position
0: character maps to <undefined>
>>>

How to get a replacement behaviour into Python's print statement
generally ?

Fumble on sys.stdout/stderr? sys.stdout.write(u) puts at least random
chars. Thus print seems to do it itself and obviously gets
sys.stdout.encoding and encodes 'strict'. Where is a good and portable
chance for hooking?
E.g. for doing it similar as .encode(xy,'replace') or
'backslashreplace'?

Shouldn't 'replace' be the default behaviour for (tty-)output !?

Background: my file handling script fails on consoles not supporting
all filenamechars. I want my apps to auto-run on each platform as
smooth, smart and tolerant as possible without fumbling on hundreds and
thousands of print/output statements. (input is an extra issue of
course)

2nd Problem with PythonWin output functions: PythonWin/win32 functions
(which obviously do not support wide unicode auto or by xxxW functions)
obviously use the python default encoding, but try a defaultlocale
before (defaultlocale, then 'ascii'/site.encoding then error exception
by occasion!).
This can only be made tolerant on alien chars by hacking
site.py/sitecustomize.py/encoding (very sad about this on each python
installation).
Or is there a Pythonwin function to set the encoding?
sys.setdefaultencoding is completely destroyed - not even preserved as
sys._setdefaultencoding or so.
(to 'mbcs' - not defaultlocale (cp1252 on my machine), because only
mbcs is tolerant on foreign chars and converts them to '?' )

The PythonWin scintilla-editor/interactive (obviously) is better: it
obviously uses 'mbcs' always.

I now decided to put 'mbcs' in site.py for Windows. Isn't that by far
the best and acceptable default solution. 'utf-8' in site.py would be
acceptable to get some idea about alien chars, but will

Thus on my Python/Pythonwin Windows default installation 4 encodings
are in action simultaneously !!!! :
* 'ascii' in site.py / str()
* 'mbcs' in PythonWin interactive/editor
* 'cp1252'+'ascii' in PythonWin/win32 Output functions
* 'cp850' at console output
.. and all output is intolerant on alien chars ! (except 'mbcs' on the
primary _test_ field PythonWin Interactive only!! :-( )
Isn't that designed by the Python creators to drive developers crazy?

Now by setting site.py/encoding to 'mbcs' (or 'utf-8') the problems in
PythonWin are solved slightly. But so far I have no idea, how to have
mbcs-output if chars existing and utf-8 or backslashreplace if
non-existing.
Also: Is wide unicode output possible somehow with PythonWin - at least
in certain cases? by WM_SETTEXT ,...SETITEM ... tricks?

On Linux there is some improvement after setting
site.py/encoding='utf-8'. Still the locale sensitive encoding on tty's
should be tolerant/replace-mode by default.

Robert

PS:

this guy also is somewhat angry about the current situation:
http://blog.ianbicking.org/do-i-hate-unicode-or-do-i-hate-ascii.html

GvR felt save with 'ascii' for "future improvements" like utf-8 :
http://mail.python.org/pipermail/python-dev/2002-March/020962.html

My suggestions:
* Win/Linux: guess at least 'mbcs' on Win and 'utf-8' on Linux for
site.encoding are by far worth to do the improvement step. Or provide a
prominent function (not fragile sitexxxx.py interface) to change. The
current solution it is very unportable und requires very long time to
understand for new programmers)
And/Or: making tty-print somehow tolerant/char-replacing.
* PythonWin: always use 'mbcs' als default-encoding in win32-functions
(mbcs_encode is tolerant/replacing in itself). or make the encoding
tolerant/char-replacing.
And: Add xxxW-Functions or even automatic unicode switching for the
major output functions (SetWindowText, SetItem, DrawText, ....)

gregarican

unread,
Jan 10, 2006, 2:39:25 PM1/10/06
to
Robert wrote:

> (windows or linux console)
>
>
>
> >>> print u'\u034a'
>
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "C:\PYTHON23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\u034a' in
> position
> 0: character maps to <undefined>


Are you certain that this is a valid unicode character? Checking other
values (like \u0020 which is a blank space) seems to work okay. What
does \u034A represent?

Robert

unread,
Jan 10, 2006, 3:09:17 PM1/10/06
to

yes, its delivered by filesystem:
>>> glob.glob(u'test/*')[3]
u'sytest3\\\u041f\u043e\u0448\u0443\u043a.txt'

u'\u043a' is cyrillic: к

no matter, I guess no (small) system can know all unicode ranges in use
wordwide. The real problem is: to get a smoot, smart an tolerant setup
by default - not a mixup of 4 codecs and (most bothersome) intolerant
exception-breaks on simple tty-/win-outputs.

How to do this best and most tolerant to
platform/(python-)installation?

Robert

Fredrik Lundh

unread,
Jan 10, 2006, 3:03:34 PM1/10/06
to pytho...@python.org

> Are you certain that this is a valid unicode character? Checking other
> values (like \u0020 which is a blank space) seems to work okay. What
> does \u034A represent?

>>> import unicodedata
>>> unicodedata.name(u"\u034A")
'COMBINING NOT TILDE ABOVE'

(space is a valid CP850 character, combining not tilde above is not).

</F>

Robert

unread,
Jan 10, 2006, 3:46:08 PM1/10/06
to

Tried around to get a tolerant print/PythonWin setup. Seems like I can
live acceptably with this:

modifying site.py/encoding to 'mbcs' on win and 'utf-8' or 'latin-1' or
locale on linux and/or (more important) doing this at startup:

# tolerant unicode output ... #
_stdout=sys.stdout
if sys.platform=='win32' and not
sys.modules.has_key('pywin.framework.startup'):
_stdoutenc=getattr(_stdout,'encoding',sys.getdefaultencoding())
class StdOut:
def write(self,s):
_stdout.write(s.encode(_stdoutenc,'backslashreplace'))
sys.stdout=StdOut()
elif sys.platform.startswith('linux'):
import locale
_stdoutenc=locale.getdefaultlocale()[1]
class StdOut:
def write(self,s):
_stdout.write(s.encode(_stdoutenc,'backslashreplace'))
sys.stdout=StdOut()


fragile tricks... and pain on each project and python installation.
Shouldn't something like that (or 'replace') (or a prominent
switch-function for such behaviour) be the default for python - output
the maximum, not minimum ?

Robert

Neil Hodgson

unread,
Jan 10, 2006, 4:23:51 PM1/10/06
to
Robert:

> u'sytest3\\\u041f\u043e\u0448\u0443\u043a.txt'
>
> u'\u043a' is cyrillic: к
>
> no matter, I guess no (small) system can know all unicode ranges in use
> wordwide. The real problem is: to get a smoot, smart an tolerant setup
> by default - not a mixup of 4 codecs and (most bothersome) intolerant
> exception-breaks on simple tty-/win-outputs.

PythonWin did have some Unicode support but I think Mark Hammond was
discouraged by bugs. In pythonwin/__init__.py there is a setting
is_platform_unicode = 0 with a commented out real test for Unicode on
the next line. Change this to 1 and restart and you may see

>>> x = u'sytest3\\\u041f\u043e\u0448\u0443\u043a.txt'
>>> print x
sytest3\Пошук.txt
>>>

This is dependent on using fonts that contain the required
characters. Tested on Windows XP SP2 with PythonWin build 204.

Neil

Robert

unread,
Jan 11, 2006, 9:21:56 AM1/11/06
to

Neil Hodgson schrieb:
> Robert:

> PythonWin did have some Unicode support but I think Mark Hammond was
> discouraged by bugs. In pythonwin/__init__.py there is a setting
> is_platform_unicode = 0 with a commented out real test for Unicode on
> the next line. Change this to 1 and restart and you may see
>
> >>> x = u'sytest3\\\u041f\u043e\u0448\u0443\u043a.txt'
> >>> print x
> sytest3\Пошук.txt
> >>>

thanks for that hint. But found that it is still not consistent or even
buggy:

After "is_platform_unicode = <auto>", scintilla displays some unicode
as you showed. but the win32-functions (e.g. MessageBox) still do not
pass through wide unicode. And pasting/inserting/parsing in scintilla
doesn't work correct:

PythonWin 2.3.5 (#62, Feb 8 2005, 16:23:02) [MSC v.1200 32 bit
(Intel)] on win32.
Portions Copyright 1994-2004 Mark Hammond (mham...@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.


>>> x = u'sytest3\\\u041f\u043e\u0448\u0443\u043a.txt'
>>> print x
sytest3\Пошук.txt

>>> print "sytest3\Пошук.txt"
sytest3\?????.txt

!!!

--------

Then tried in __init__.py to do more uft-8:

default_platform_encoding = "utf-8" #"mbcs" # Will it ever ...this?
default_scintilla_encoding = "utf-8" # Scintilla _only_ supports this
ATM

Pasting around in scintilla then works correct. But MessageBox then
shows plain utf-8 encoded chars. Even german umlauts are not
displayable any more on my machine and when opening document files with
above-128 chars, Pythonwin breaks (because files are not valid utf-8
streams, I guess):

>>> Traceback (most recent call last):
File

"C:\PYTHON23\Lib\site-packages\pythonwin\pywin\scintilla\document.py",
line 27, in OnOpenDocument
text = f.read()
File "C:\Python23\lib\codecs.py", line 380, in read
return self.reader.read(size)
File "C:\Python23\lib\codecs.py", line 253, in read
return self.decode(self.stream.read(), self.errors)[0]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position
19983: unexpected code byte
win32ui: OnOpenDocument() virtual handler (<bound method
SyntEditDocument.OnOpenDocument of
<pywin.framework.editor.color.coloreditor.SyntEditDocument instance at
0x00E356E8>>) raised an exception


Thus the result is: no combination provides a real improvement so far.
wide unicode in win32-functions is obviously not possible at all. I
switch back to the original setup.

Guess I have to create special C-code for my major wide unicode needs -
especially listctrl-SetItem and TextOut-Stuff...

Or does anybody know of some existing wide-unicode functions/C-code
parallel to normal pywin32?

Robert

Thomas Heller

unread,
Jan 11, 2006, 10:06:05 AM1/11/06
to
"Robert" <kxro...@googlemail.com> writes:

>
> Guess I have to create special C-code for my major wide unicode needs -
> especially listctrl-SetItem and TextOut-Stuff...
>
> Or does anybody know of some existing wide-unicode functions/C-code
> parallel to normal pywin32?

You could use ctypes to access and call the ...W functions directly.

Thomas

Neil Hodgson

unread,
Jan 11, 2006, 5:40:22 PM1/11/06
to
Robert:

> After "is_platform_unicode = <auto>", scintilla displays some unicode
> as you showed. but the win32-functions (e.g. MessageBox) still do not
> pass through wide unicode.

Win32 issues are better discussed on the python-win32 mailing list
which is read by more of the people interested in working on this library.
http://mail.python.org/mailman/listinfo/python-win32
Patches that improve MessageBox in particular or larger sets of
functions in a general way are likely to be welcomed.

Neil

Robert

unread,
Jan 12, 2006, 3:35:46 AM1/12/06
to

ok. I have no patches so far as of now - maybe later. Played with
Heller's ctypes for my urgent needs. That works correct with unicode
like this:

>>> import ctypes
>>> ctypes.windll.user32.MessageBoxW(0,u'\u041f\u043e\u0448\u0443\u043a.txt',0,0)
1

My recommendation for the general style of unicode integration in win32
in future:
* output-functions should dispatch auto on unicode paramams in order to
use the apropriate xxxW-functions
* input-functions (which are used much more infrequent in apps!) should
accept an additional unicode=1 parameter (e.g.:
SetWindowText(unicode=1); please not extra xxxW -functions! thus one
can easily virtualize apps with something like
xyfunc(...,unicode=ucflag)
* or: input-functions should also auto-run unicode when a significant
string calling parameter is unicode - same as with filesystem-functions
in normal python. Example: win32api.FindFiles(u"*") is same as
FindFiles("*",unicode=1) and is better as FindFilesW("*")

Thus existing ansi apps can be converted to unicode aware apps with
minimum extra efforts.

Robert

Thomas Heller

unread,
Jan 12, 2006, 4:33:48 AM1/12/06
to
"Robert" <kxro...@googlemail.com> writes:

> Neil Hodgson wrote:
>> Robert:
>>
>> > After "is_platform_unicode = <auto>", scintilla displays some unicode
>> > as you showed. but the win32-functions (e.g. MessageBox) still do not
>> > pass through wide unicode.
>>
>> Win32 issues are better discussed on the python-win32 mailing list
>> which is read by more of the people interested in working on this library.
>> http://mail.python.org/mailman/listinfo/python-win32
>> Patches that improve MessageBox in particular or larger sets of
>> functions in a general way are likely to be welcomed.
>
> ok. I have no patches so far as of now - maybe later. Played with
> Heller's ctypes for my urgent needs. That works correct with unicode
> like this:
>
>>>> import ctypes
>>>> ctypes.windll.user32.MessageBoxW(0,u'\u041f\u043e\u0448\u0443\u043a.txt',0,0)
> 1

FYI, if you assign the argtypes attribute for ctypes functions, the
ascii/unicode conversion is automatic (if needed).

So after these assignments:

ctypes.windll.user32.MessageBoxW.argtypes = (c_int, c_wchar_p,
c_wchar_p, c_int)
ctypes.windll.user32.MessageBoxA.argtypes = (c_int, c_char_p,
c_char_p, c_int)

both MessageBoxA and MessageBoxW can both be called with either ansi and
unicode strings, and should work correctly. By default the conversion
is done with ('msbc', 'ignore'), but this can also be changed,
ctypes-wide, with a call to ctypes.set_conversion_mode(encoding,errors).

You have to pass None for the third parameter (if not a string).

Thomas

Robert

unread,
Jan 12, 2006, 5:01:06 PM1/12/06
to

Thomas Heller schrieb:

That is a right style of functionality, consistency and duty-free
default execution flow which python and pythonwin are lacking so far.
Those have no prominent mode-setting function, the mode-_tuple_ etc. so
far and/or defaults are set to break simple apps with common tasks.

Only question: is there a reason to have 'ignore' instead of 'replace'
as default? Wouldn't 'replace' deliver better indications (as for
example every Webbrowser does on unknown unicode chars ; (and even
mbcs_encode in 'strict'-mode) ). I can not see any advantage of
'ignore' vs. 'replace' when strict equality anyway has been given up
...

Robert

Thomas Heller

unread,
Jan 13, 2006, 11:14:59 AM1/13/06
to
"Robert" <kxro...@googlemail.com> writes:

> Thomas Heller schrieb:

>> So after these assignments:
>>
>> ctypes.windll.user32.MessageBoxW.argtypes = (c_int, c_wchar_p,
>> c_wchar_p, c_int)
>> ctypes.windll.user32.MessageBoxA.argtypes = (c_int, c_char_p,
>> c_char_p, c_int)
>>
>> both MessageBoxA and MessageBoxW can both be called with either ansi and
>> unicode strings, and should work correctly. By default the conversion
>> is done with ('msbc', 'ignore'), but this can also be changed,
>> ctypes-wide, with a call to ctypes.set_conversion_mode(encoding,errors).
>
> That is a right style of functionality, consistency and duty-free
> default execution flow which python and pythonwin are lacking so far.
> Those have no prominent mode-setting function, the mode-_tuple_ etc. so
> far and/or defaults are set to break simple apps with common tasks.
>
> Only question: is there a reason to have 'ignore' instead of 'replace'
> as default? Wouldn't 'replace' deliver better indications (as for
> example every Webbrowser does on unknown unicode chars ; (and even
> mbcs_encode in 'strict'-mode) ). I can not see any advantage of
> 'ignore' vs. 'replace' when strict equality anyway has been given up

Hm, I don't know. I try to avoid converting questionable characters at
all, if possible. Then, it seems the error-mode doesn't seem to change
anything with "mbcs" encoding. WinXP, Python 2.4.2 on the console:

>>> u"abc\u034adef".encode("mbcs", "ignore")
'abc?def'
>>> u"abc\u034adef".encode("mbcs", "strict")
'abc?def'
>>> u"abc\u034adef".encode("mbcs", "error")
'abc?def'
>>>

With "latin-1", it is different:

>>> u"abc\u034adef".encode("latin-1", "ignore")
'abcdef'
>>> u"abc\u034adef".encode("latin-1", "replace")
'abc?def'
>>> u"abc\u034adef".encode("latin-1", "strict")


Traceback (most recent call last):

File "<stdin>", line 1, in ?

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u034a' in position 3: ordinal not in range(256)
>>>

Thomas

Neil Hodgson

unread,
Jan 13, 2006, 5:30:49 PM1/13/06
to
Thomas Heller:

> Hm, I don't know. I try to avoid converting questionable characters at
> all, if possible. Then, it seems the error-mode doesn't seem to change
> anything with "mbcs" encoding. WinXP, Python 2.4.2 on the console:
>
>>>> u"abc\u034adef".encode("mbcs", "ignore")
> 'abc?def'
>>>> u"abc\u034adef".encode("mbcs", "strict")
> 'abc?def'
>>>> u"abc\u034adef".encode("mbcs", "error")
> 'abc?def'
>
> With "latin-1", it is different:

Yes, there are no 'ignore' or 'strict' modes for mbcs. It is a
simple call to WideCharToMultiByte with no options set. 'ignore' may
need two calls with different values of the default character to allow
identification and removal of default characters as any given default
character may also appear naturally in the output. 'strict' and 'error'
would be easier to implement by checking both the return status and
lpUsedDefaultChar which is set when any default character insertion is done.

The relevant code is in dist\src\Objects\unicodeobject.c.

Neil


Robert

unread,
Jan 15, 2006, 4:37:02 AM1/15/06
to
Neil Hodgson schrieb:

> Thomas Heller:
>
> > Hm, I don't know. I try to avoid converting questionable characters at
> > all, if possible. Then, it seems the error-mode doesn't seem to change
> > anything with "mbcs" encoding. WinXP, Python 2.4.2 on the console:
> >
> >>>> u"abc\u034adef".encode("mbcs", "ignore")
> > 'abc?def'
> >>>> u"abc\u034adef".encode("mbcs", "strict")
> > 'abc?def'
> >>>> u"abc\u034adef".encode("mbcs", "error")
> > 'abc?def'

yes I know, thats why 'mbcs' can also be set in site(customize).py to
solve some of the problems discussed. (site.py mechanism doesn't allow
to set the mode as in ctypes

> > With "latin-1", it is different:
>
> Yes, there are no 'ignore' or 'strict' modes for mbcs. It is a
> simple call to WideCharToMultiByte with no options set. 'ignore' may
> need two calls with different values of the default character to allow
> identification and removal of default characters as any given default
> character may also appear naturally in the output. 'strict' and 'error'
> would be easier to implement by checking both the return status and
> lpUsedDefaultChar which is set when any default character insertion is done.

But as discussed, I would not recommend this as encouragement to dig
for a real 'strict' or 'ignore' for mbcs.
('replace' also creates no invalid chars. both 'ignore' and 'replace'
change the stream and equality cannot be preserved by principle.)
Its a political discussion if the default mode should go through, or be
picky. (detailed in
<1137059888.5...@o13g2000cwo.googlegroups.com>)

Better change consciously to ('mbcs','replace').

The default behaviour of Python is a horror for any new Programmer and
a reason to quickly go away to mature unicode platforms like Java. It
takes many many hours to find out how everything depends in Python and
how to make simple print actions not break the application (especially,
when PythonWin is involved). This creates a lot of anger for users and
programmers. When strict converstion is really required for some
technical strings (very rare), programmers are naturally very aware.

Be a new programmer and try:

>>> print '\n'.join( glob.glob(u'test/*') )


Traceback (most recent call last):
File "<stdin>", line 1, in ?

File "C:\PYTHON24\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
146-150:
character maps to <undefined>
>>>

=> "What is this cp850.py and has to do with my (undefined?) files? A
very cice language, which cannot print by default... go to Java ...
Bye"

My recommendation is to use 'backslashreplace as default mode. Nobody
is angry when alien chars are printed in this style.

Robert

Reply all
Reply to author
Forward
0 new messages