[Python-3000] encode function errors="replace", but print() failed, is this a bug?

2 views
Skip to first unread message

Decheng Fan

unread,
Nov 19, 2008, 9:27:57 PM11/19/08
to pytho...@python.org
Hi,

Recently I encountered a problem with the str.encode() function.  I used the function like this: s.encode("mbcs", "replace"), expecting it will eliminate all invalid characters.  However it failed with the following message: UnicodeEncodeError: 'gbk' codec can't encode character '\ue104' in position 4: i

Am I using it in a wrong way or is it a bug?

Platform: Windows Vista SP1, system default code page: 936 (zh-cn).  Program (test.py.txt) in attachment.

>python3 test.py
A
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    print(str.encode("mbcs", "replace").decode("mbcs", "replace"))
  File "C:\Python30\lib\io.py", line 1485, in write
    b = encoder.encode(s)
UnicodeEncodeError: 'gbk' codec can't encode character '\ue104' in position 4: i
llegal multibyte sequence
>python3 test.py
A
??黹擑魗??夈皠榸
B
>python3 test.py
A
掗??????駋勒锜
B

Thanks,

Decheng (AKA Robbie Mosaic) Fan
test.py.txt

Benjamin Peterson

unread,
Nov 19, 2008, 10:50:55 PM11/19/08
to Decheng Fan, pytho...@python.org
2008/11/19 Decheng Fan <fande...@gmail.com>:

> Hi,
>
> Recently I encountered a problem with the str.encode() function. I used the
> function like this: s.encode("mbcs", "replace"), expecting it will eliminate
> all invalid characters. However it failed with the following message:
> UnicodeEncodeError: 'gbk' codec can't encode character '\ue104' in position
> 4: i
>
> Am I using it in a wrong way or is it a bug?

print() sends it's data to stdout which encodes the data based on it's
own encoding. If you want to change this behavior, replace sys.stdout
with your own io.TextIOWrapper with 'replace' as the errors argument.


--
Cheers,
Benjamin Peterson
"There's nothing quite as beautiful as an oboe... except a chicken
stuck in a vacuum cleaner."
_______________________________________________
Python-3000 mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: http://mail.python.org/mailman/options/python-3000/python-3000-garchive-63646%40googlegroups.com

"Martin v. Lo"wis"

unread,
Nov 20, 2008, 1:45:41 AM11/20/08
to Decheng Fan, pytho...@python.org
> Am I using it in a wrong way or is it a bug?

You are using it in a wrong way. The terminal window, on Windows, does
not use the "mbcs" encoding. Microsoft has two system encodings: the
"ANSI" code page (CP_ACP), called "mbcs" by Python, and the "OEM" code
page (CP_OEMCP). The latter is what the terminal window uses. Python
does not directly expose the Microsoft OEMCP codec; instead, it
determines the terminal's code page, and then carries its own codec for
that code page ("gbk" in your case).

To make your example work, replace "mbcs" with sys.stdout.encoding.

HTH,
Martin

Reply all
Reply to author
Forward
0 new messages