#!/usr/bin/python
import sys
import codecs
fileHandle = codecs.open(sys.argv[1], 'r', 'utf-8')
fileString = fileHandle.read()
print fileString
if I call it from a Bash shell like this
$ ./test.py testfile.utf8.txt
it works just fine, but when I try to pipe the output to another process
("|") or into a file (">"), e.g. like this
$ ./test.py testfile.utf8.txt | cat
I get an error:
Traceback (most recent call last):
File "./test.py", line 6, in ?
print fileString
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 538: ordinal not in range(128)
I absolutely don't know what's the problem here, can you help?
Thanks,
Rehceb
Using codecs.open, when you read the file you get Unicode. When you
print the Unicode object, it is encoded using your terminal default
encoding (utf8 I presume?)
But when you redirect the output, it's no more connected to your
terminal so no encoding can be assumed, and the default encoding is
used.
Try this line at the top:
print
"stdout:",sys.stdout.encoding,"default:",sys.getdefaultencoding()
I get stdout: ANSI_X3.4-1968 default: ascii normally and stdout: None
default: ascii when redirected.
You have to encode the Unicode object explicitely: print
fileString.encode("utf-8")
(or any other suitable one; I said utf-8 just because you read the
input file using that)
--
Gabriel Genellina
> You have to encode the Unicode object explicitely: print
> fileString.encode("utf-8")
> (or any other suitable one; I said utf-8 just because you read the input
> file using that)
Thanks! That's a nice little stumbling block for a newbie like me ;) Is
there a way to make utf-8 the default encoding for every string, so that
I do not have to encode each string explicitly?
You can make sys.stdout encode each string with UTF-8, with
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
Make sure that you then that *all* strings that you print
are Unicode strings.
HTH,
Martin
BTW, any reason why an EncodedFile can't act like a Unicode writer/reader object
if one of its encodings is explicitly set to None?
IMO the docs don't make it clear that getwriter() is the correct API to use
here. I've wanted to write "sys.stdout = codecs.EncodedFile(sys.stdout,
'utf-8')" more than once.
Georg
AFAIU, that's not the intention of EncodedFile: instead, it is
meant to do recoding. I find it a pretty useless API, and
rather see it go away than being enhanced.
Regards,
Martin