>>> f = open(filename)
>>> data = f.read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
data = f.read()
File "C:\Python30\lib\io.py", line 1724, in read
decoder.decode(self.buffer.read(), final=True))
File "C:\Python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to <undefined>
The string at position 10442 is something like this :
"query":"0 1Ȉ \u2021 0\u201a0 \u2021Ȉ ","
So what encoding value am I supposed to give ? I tried f =
open(filename, encoding="cp1252") but still same error. I guess
Python3 auto-detects it as cp1252
--
Anjanesh Lekshmnarayanan
Thanks a lot ! utf-8 and latin1 were accepted !
Just so you know, latin-1 can decode any sequence of bytes, so it will always
work even if that's not the "real" encoding.
>
>
> On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan <mail <at>
anjanesh.net> wrote:
> > It does auto-detect it as cp1252- look at the files in the traceback and
> > you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
> > encoding, try opening it as utf-8 or latin1 and see if that fixes it.
Benjamin, "auto-detect" has strong connotations of the open() call (with mode
including text and encoding not specified) reading some/all of the file and
trying to guess what the encoding might be -- a futile pursuit and not what the
docs say:
"""encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent,
but any encoding supported by Python can be passed. See the codecs module for
the list of supported encodings"""
On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It
would be interesting to know
(1) what is produced on Anjanesh's machine
(2) how the default encoding is derived (I would have thought I was a prime
candidate for 'cp1252')
(3) whether the 'default encoding' of open() is actually the same as the
'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs
don't say so.
> Thanks a lot ! utf-8 and latin1 were accepted !
Benjamin and Anjanesh, Please understand that
any_random_rubbish.decode('latin1') will be "accepted". This is *not* useful
information to be greeted with thanks and exclamation marks. It is merely a
by-product of the fact that *any* single-byte character set like latin1 that
uses all 256 possible bytes can not fail, by definition; no character "maps to
<undefined>".
> If you want to read the file as text, find out which encoding it actually is.
In one of those encodings, you'll probably see some nonsense characters. If you
are just looking at the file as a sequence of bytes, open the file in binary
mode rather than text. That way, you'll avoid this issue all together (just make
sure you use byte strings instead of unicode strings).
In fact, inspection of Anjanesh's report:
"""UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to <undefined>
The string at position 10442 is something like this :
"query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," """
draws two observations:
(1) there is nothing in the reported string that can be unambiguously identified
as corresponding to "0x9d"
(2) it looks like a small snippet from a Python source file!
Anjanesh, Is it a .py file? If so, is there something like "# encoding: cp1252"
or "# encoding: utf-8" near the start of the file? *Please* tell us what
sys.getdefaultencoding() returns on your machine.
Instead of "something like", please report exactly what is there:
print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
Cheers,
John
> First of all, you're right that might be confusing. I was thinking of
auto-detect as in "check the platform and locale and guess what they usually
use". I wasn't thinking of it like the web browsers use it.I think it uses
locale.getpreferredencoding().
You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll
raise a request on the bug tracker to get some more precise wording in the
open() docs.
> On my machine, I get sys.getpreferredencoding() == 'utf-8' and
locale.getdefaultencoding()== 'cp1252'.
sys <-> locale ... +1 long-range transposition typo of the year :-)
> If you check my response to Anjanesh's comment, I mentioned that he should
either find out which encoding it is in particular or he should open the file in
binary mode. I suggested utf-8 and latin1 because those are the most likely
candidates for his file since cp1252 was already excluded.
The OP is on a Windows machine. His file looks like a source code file. He is
unlikely to be creating latin1 files himself on a Windows box. Under the
hypothesis that he is accidentally or otherwise reading somebody else's source
files as data, it could be any encoding. In one package with which I'm familiar,
the encoding is declared as cp1251 in every .py file; AFAICT the only file with
non-ASCII characters is an example script containing his wife's name!
The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 --
admittedly all as implausible as the latin1 control character.
> Looking at a character map, 0x9d is a control character in latin1, so the page
is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
that isn't as common as UTF-8.
Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL
LETTER Y WITH ACUTE) in the OP's report
"query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý","
Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just
went up a notch.
The preceding character U+00BB (looks like >>) doesn't cause an exception
because 0xBB unlike 0x9D is defined in cp1252.
Curiously looking at the \uxxxx escape sequences:
\u2021 is "double dagger", \u201a is "single low-9 quotation mark" ... what
appears to be the value part of an item in a hard-coded dictionary is about as
comprehensible as the Voynich manuscript.
Trouble with cases like this is as soon as they become interesting, the OP often
snatches somebody's one-liner that "works" (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)
Cheers,
John
> (2) it looks like a small snippet from a Python source file!
Its a file containing just JSON data - but has some unicode characters
as well as it has data from the web.
> Anjanesh, Is it a .py file
Its a .json file. I have a bunch of these json files which Im parsing.
using json library.
> Instead of "something like", please report exactly what is there:
>
> print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
>>> print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
b'":42,"query":"0 1\xc2\xbb\xc3\x9d \\u2021 0\\u201a0 \\u2'
> Trouble with cases like this is as soon as they become interesting, the OP often
snatches somebody's one-liner that "works" (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)
Actually, I moved the files to my Ubuntu PC which has Python 2.5.2 and
didnt give the encoding issue. I just couldnt spend that much time on
why a couple of these files had encoding issues in Py3 since I had to
parse a whole lot of files.