I read a string from an utf-8 file:
fichierLaTeX = codecs.open(sys.argv[1], "r", "utf-8")
s = fichierLaTeX.read()
fichierLaTeX.close()
I can then print the string without error with 'print s'.
Next I parse this string:
def parser(s):
i = 0
while i < len(s):
if s[i:i+1] == '\\':
i += 1
if s[i:i+1] == '\\':
print "backslash"
elif s[i:i+1] == '%':
print "pourcentage"
else:
if estUnCaractere(s[i:i+1]):
motcle = ""
while estUnCaractere(s[i:i+1]):
motcle += s[i:i+1]
i += 1
print "mot-clé '"+motcle+"'"
but when I run this code, I get this error:
Traceback (most recent call last):
File "./versOO.py", line 115, in <module>
parser(s)
File "./versOO.py", line 105, in parser
print "mot-clé '"+motcle+"'"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 6: ordinal not in range(128)
What must I do to solve this?
Thanks!
--
Fabrice DELENTE
>>> "mot-clé" + "mot-clé"
'mot-cl\xc3\xa9mot-cl\xc3\xa9'
>>> u"mot-clé" + u"mot-clé"
u'mot-cl\xe9mot-cl\xe9'
>>> "mot-clé" + u"mot-clé"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6:
ordinal not in range(128)
codecs.open().read() returns unicode, but your literals are all bytestrings.
When you are mixing unicode and str Python tries to convert the bytestring
to unicode using the ascii codec, and of course fails for non-ascii
characters.
Change your string literals to unicode by adding the u-prefix and you should
be OK.
Peter
Thanks, it solved the problem... for a while!
I need now to know if s[i] gives the next byte or the next character,
when I scan the string s. I've googled pages about python and unicode,
but didn't find a solution to that. I scan the string read from the
file char by char to construct s, but now get the same error when just
trying 'print s'.
Is there a way to tell python that all strings and characters are to
be treated as UTF-8? I have LC_ALL=en_GB.utf-8 in my shell, but python
does'nt seem to use this variable?
Thanks!
--
Fabrice DELENTE
Assuming s = fichierLaTeX.read() as from your code snippet, the next
character. When in doubt, check what `type(s)` is; if it's <type
'str'>, indices are in bytes; if it's <type 'unicode'>, indices are in
code points.
Please give the full stack traceback for your error.
Cheers,
Chris
--
http://blog.rebertia.com
I have taken the easy way out, I read on a page that python 3 worked
by default in UTF-8, so I downloaded and installed it.
Apart from a few surprises (print is not a funtion, and rules about
mixing spaces and tabs in indentation are much more strict, and I
guess more is to come :^) everything now works transparently.
Thanks again.
--
Fabrice DELENTE
> I have taken the easy way out, I read on a page that python 3 worked
> by default in UTF-8, so I downloaded and installed it.
Just a quick reminder: UTF-8 is not the same as unicode. Python3 works in
unicode and by default uses UTF-8 to read from or write into files.
Peter
> Just a quick reminder: UTF-8 is not the same as unicode. Python3 works in unicode and by default uses UTF-8 to read from or write into files.
I'm not the OP, but wanted to make sure I was fully understanding your
point.
Are you saying all open() calls in Python that read text files,
automatically convert UTF-8 content to Unicode in the same manner as the
following might when using Python 2.6?
codecs.open( fileName, mode='r', encoding='UTF8', ... )
Thanks for your feedback,
Malcolm
> Are you saying all open() calls in Python that read text files,
> automatically convert UTF-8 content to Unicode in the same manner as the
> following might when using Python 2.6?
>
> codecs.open( fileName, mode='r', encoding='UTF8', ... )
That's what I meant to say, but it's not actually true.
Quoting http://docs.python.org/py3k/library/functions.html#open
"""
open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)
[...]
encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent
(whatever locale.getpreferredencoding() returns), but any encoding supported
by Python can be used. See the codecs module for the list of supported
encodings.
"""
So it just happend to be UTF-8 on my machine.
Peter
>> Are you saying all open() calls in Python that read text files,
>> automatically convert UTF-8 content to Unicode in the same manner as the
>> following might when using Python 2.6?
>>
>> codecs.open( fileName, mode='r', encoding='UTF8', ... )
> That's what I meant to say, but it's not actually true.
Thanks for the clarification.
It sounds like Python 3 has unified the standard library open() function
and the codecs.open() into a single function?
In other words, would it be accurate to say that in Python 3, there is
no longer a need to use codecs.open()?
Any idea if the above applies to Python 2.7?
Thank you Peter!
Malcolm
I wish it were, though.
> Quoting http://docs.python.org/py3k/library/functions.html#open
>
> """
> open(file, mode='r', buffering=None, encoding=None, errors=None,
> newline=None, closefd=True)
>
> [...]
>
> encoding is the name of the encoding used to decode or encode the file. This
> should only be used in text mode. The default encoding is platform dependent
> (whatever locale.getpreferredencoding() returns), but any encoding supported
> by Python can be used. See the codecs module for the list of supported
> encodings.
> """
>
> So it just happend to be UTF-8 on my machine.
Unfortunately, it is not on US Windows.
Terry Jan Reedy