Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode problem

6 views
Skip to first unread message

Rehceb Rotkiv

unread,
Apr 7, 2007, 2:09:30 PM4/7/07
to
Please have a look at this little script:

#!/usr/bin/python
import sys
import codecs
fileHandle = codecs.open(sys.argv[1], 'r', 'utf-8')
fileString = fileHandle.read()
print fileString

if I call it from a Bash shell like this

$ ./test.py testfile.utf8.txt

it works just fine, but when I try to pipe the output to another process
("|") or into a file (">"), e.g. like this

$ ./test.py testfile.utf8.txt | cat

I get an error:

Traceback (most recent call last):
File "./test.py", line 6, in ?
print fileString
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 538: ordinal not in range(128)

I absolutely don't know what's the problem here, can you help?

Thanks,
Rehceb

Gabriel Genellina

unread,
Apr 7, 2007, 3:46:49 PM4/7/07
to
Rehceb Rotkiv wrote:

Using codecs.open, when you read the file you get Unicode. When you
print the Unicode object, it is encoded using your terminal default
encoding (utf8 I presume?)
But when you redirect the output, it's no more connected to your
terminal so no encoding can be assumed, and the default encoding is
used.

Try this line at the top:
print
"stdout:",sys.stdout.encoding,"default:",sys.getdefaultencoding()
I get stdout: ANSI_X3.4-1968 default: ascii normally and stdout: None
default: ascii when redirected.

You have to encode the Unicode object explicitely: print
fileString.encode("utf-8")
(or any other suitable one; I said utf-8 just because you read the
input file using that)

--
Gabriel Genellina

Rehceb Rotkiv

unread,
Apr 8, 2007, 5:13:55 PM4/8/07
to
On Sat, 07 Apr 2007 12:46:49 -0700, Gabriel Genellina wrote:

> You have to encode the Unicode object explicitely: print
> fileString.encode("utf-8")
> (or any other suitable one; I said utf-8 just because you read the input
> file using that)

Thanks! That's a nice little stumbling block for a newbie like me ;) Is
there a way to make utf-8 the default encoding for every string, so that
I do not have to encode each string explicitly?

"Martin v. Löwis"

unread,
Apr 8, 2007, 11:21:08 PM4/8/07
to Rehceb Rotkiv
> Thanks! That's a nice little stumbling block for a newbie like me ;) Is
> there a way to make utf-8 the default encoding for every string, so that
> I do not have to encode each string explicitly?

You can make sys.stdout encode each string with UTF-8, with

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

Make sure that you then that *all* strings that you print
are Unicode strings.

HTH,
Martin

Georg Brandl

unread,
Apr 9, 2007, 3:46:13 AM4/9/07
to pytho...@python.org
Martin v. Löwis schrieb:

BTW, any reason why an EncodedFile can't act like a Unicode writer/reader object
if one of its encodings is explicitly set to None?

IMO the docs don't make it clear that getwriter() is the correct API to use
here. I've wanted to write "sys.stdout = codecs.EncodedFile(sys.stdout,
'utf-8')" more than once.

Georg

"Martin v. Löwis"

unread,
Apr 9, 2007, 9:33:22 AM4/9/07
to Georg Brandl
> BTW, any reason why an EncodedFile can't act like a Unicode
> writer/reader object
> if one of its encodings is explicitly set to None?

AFAIU, that's not the intention of EncodedFile: instead, it is
meant to do recoding. I find it a pretty useless API, and
rather see it go away than being enhanced.

Regards,
Martin

0 new messages