UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Anjanesh Lekshminarayanan

unread,

Jan 29, 2009, 11:24:46 AM1/29/09

to Python List

Im reading a file. But there seems to be some encoding error.

>>> f = open(filename)
>>> data = f.read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
data = f.read()
File "C:\Python30\lib\io.py", line 1724, in read
decoder.decode(self.buffer.read(), final=True))
File "C:\Python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to <undefined>

The string at position 10442 is something like this :
"query":"0 1Â»Ã \u2021 0\u201a0 \u2021Â»Ã ","

So what encoding value am I supposed to give ? I tried f =
open(filename, encoding="cp1252") but still same error. I guess
Python3 auto-detects it as cp1252
--
Anjanesh Lekshmnarayanan

Anjanesh Lekshminarayanan

unread,

Jan 29, 2009, 12:09:29 PM1/29/09

to Python List

> It does auto-detect it as cp1252- look at the files in the traceback and
> you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
> encoding, try opening it as utf-8 or latin1 and see if that fixes it.

Thanks a lot ! utf-8 and latin1 were accepted !

Benjamin Peterson

unread,

Jan 29, 2009, 3:25:41 PM1/29/09

to pytho...@python.org

Just so you know, latin-1 can decode any sequence of bytes, so it will always
work even if that's not the "real" encoding.

John Machin

unread,

Jan 29, 2009, 4:19:43 PM1/29/09

to pytho...@python.org

Benjamin Kaplan <bsk16 <at> case.edu> writes:

>
>
> On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan <mail <at>
anjanesh.net> wrote:
> > It does auto-detect it as cp1252- look at the files in the traceback and
> > you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
> > encoding, try opening it as utf-8 or latin1 and see if that fixes it.

Benjamin, "auto-detect" has strong connotations of the open() call (with mode
including text and encoding not specified) reading some/all of the file and
trying to guess what the encoding might be -- a futile pursuit and not what the
docs say:

"""encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent,
but any encoding supported by Python can be passed. See the codecs module for
the list of supported encodings"""

On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It
would be interesting to know
(1) what is produced on Anjanesh's machine
(2) how the default encoding is derived (I would have thought I was a prime
candidate for 'cp1252')
(3) whether the 'default encoding' of open() is actually the same as the
'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs
don't say so.

> Thanks a lot ! utf-8 and latin1 were accepted !

Benjamin and Anjanesh, Please understand that
any_random_rubbish.decode('latin1') will be "accepted". This is *not* useful
information to be greeted with thanks and exclamation marks. It is merely a
by-product of the fact that *any* single-byte character set like latin1 that
uses all 256 possible bytes can not fail, by definition; no character "maps to
<undefined>".

> If you want to read the file as text, find out which encoding it actually is.
In one of those encodings, you'll probably see some nonsense characters. If you
are just looking at the file as a sequence of bytes, open the file in binary
mode rather than text. That way, you'll avoid this issue all together (just make
sure you use byte strings instead of unicode strings).

In fact, inspection of Anjanesh's report:

"""UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to <undefined>
The string at position 10442 is something like this :

"query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," """

draws two observations:
(1) there is nothing in the reported string that can be unambiguously identified
as corresponding to "0x9d"
(2) it looks like a small snippet from a Python source file!

Anjanesh, Is it a .py file? If so, is there something like "# encoding: cp1252"
or "# encoding: utf-8" near the start of the file? *Please* tell us what
sys.getdefaultencoding() returns on your machine.

Instead of "something like", please report exactly what is there:

print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))

Cheers,
John

John Machin

unread,

Jan 29, 2009, 8:22:50 PM1/29/09

to pytho...@python.org

Benjamin Kaplan <benjamin.kaplan <at> case.edu> writes:

> First of all, you're right that might be confusing. I was thinking of
auto-detect as in "check the platform and locale and guess what they usually
use". I wasn't thinking of it like the web browsers use it.I think it uses
locale.getpreferredencoding().

You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll
raise a request on the bug tracker to get some more precise wording in the
open() docs.

> On my machine, I get sys.getpreferredencoding() == 'utf-8' and
locale.getdefaultencoding()== 'cp1252'.

sys <-> locale ... +1 long-range transposition typo of the year :-)

> If you check my response to Anjanesh's comment, I mentioned that he should
either find out which encoding it is in particular or he should open the file in
binary mode. I suggested utf-8 and latin1 because those are the most likely
candidates for his file since cp1252 was already excluded.

The OP is on a Windows machine. His file looks like a source code file. He is
unlikely to be creating latin1 files himself on a Windows box. Under the
hypothesis that he is accidentally or otherwise reading somebody else's source
files as data, it could be any encoding. In one package with which I'm familiar,
the encoding is declared as cp1251 in every .py file; AFAICT the only file with
non-ASCII characters is an example script containing his wife's name!

The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 --
admittedly all as implausible as the latin1 control character.

> Looking at a character map, 0x9d is a control character in latin1, so the page
is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
that isn't as common as UTF-8.

Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL
LETTER Y WITH ACUTE) in the OP's report

"query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý","

Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just
went up a notch.

The preceding character U+00BB (looks like >>) doesn't cause an exception
because 0xBB unlike 0x9D is defined in cp1252.

Curiously looking at the \uxxxx escape sequences:
\u2021 is "double dagger", \u201a is "single low-9 quotation mark" ... what
appears to be the value part of an item in a hard-coded dictionary is about as
comprehensible as the Voynich manuscript.

Trouble with cases like this is as soon as they become interesting, the OP often
snatches somebody's one-liner that "works" (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)

Cheers,
John

Anjanesh Lekshminarayanan

unread,

Feb 26, 2009, 10:51:30 PM2/26/09

to Python List

> (1) what is produced on Anjanesh's machine

>>> sys.getdefaultencoding()
'utf-8'

> (2) it looks like a small snippet from a Python source file!

Its a file containing just JSON data - but has some unicode characters
as well as it has data from the web.

> Anjanesh, Is it a .py file

Its a .json file. I have a bunch of these json files which Im parsing.
using json library.

> Instead of "something like", please report exactly what is there:
>
> print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
>>> print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))

b'":42,"query":"0 1\xc2\xbb\xc3\x9d \\u2021 0\\u201a0 \\u2'

> Trouble with cases like this is as soon as they become interesting, the OP often
snatches somebody's one-liner that "works" (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)

Actually, I moved the files to my Ubuntu PC which has Python 2.5.2 and
didnt give the encoding issue. I just couldnt spend that much time on
why a couple of these files had encoding issues in Py3 since I had to
parse a whole lot of files.

bellca...@gmail.com

unread,

May 19, 2018, 6:59:06 PM5/19/18

to

hello i am having same issue..i believe the code is written in python 2 and i am running python 3.6..i tried at the interpreter..f =
open(filename, encoding="utf-8" and also latin-1..but then when i run my file i still get the error...also my line is at 7414..how do you find this line??...is it better to try to run the file .py in python 2??..thnxz

Chris Angelico

unread,

May 19, 2018, 7:03:09 PM5/19/18

to

You're responding to something from 2009.

Your file is apparently not encoded the way you think it is. You'll
have to figure out what it ACTUALLY is.

ChrisA

Skip Montanaro

unread,

May 19, 2018, 7:48:20 PM5/19/18

to

As Chris indicated, you'll have to figure out the correct encoding. You
might want to check out the chardet module (available on PyPI, I believe)
and see if it can come up with a better guess. I imagine there are other
encoding guessers out there. That's just one I'm familiar with.

Skip

Peter Otten

unread,

May 20, 2018, 3:55:41 AM5/20/18

to

open(filename, encoding="latin-1") will succeed with *every* file -- the
characters may be nonsensical, but you will *not* get a UnicodeDecodeError.

> ...is it better to try to run the file .py in python 2?

Not "better", but perhaps easier. If the code works with Python 2 then I
recommend that you use that.

If you want help to debug it you need to provide some of the relevant source
code and -- first and foremost -- the traceback.

> my line is at 7414..how do you find this line?

Read the traceback carefully. It should contain both filename and line.

bellca...@gmail.com

unread,

May 20, 2018, 7:59:21 AM5/20/18

to

hello Skip
thank you for the reply, but how exactly am i supposed to find oout what is the correct encodeing??

tommy

bellca...@gmail.com

unread,

May 20, 2018, 7:59:54 AM5/20/18

to

hello Chris

Richard Damon

unread,

May 20, 2018, 8:58:32 AM5/20/18

to

Basically, you need to know it from some other source to be totally
certain. This is part of the 'disaster' of 8 bit code pages.

There are a few guesses that can be done that often get you close.
utf-8, IF it validates is almost always what the file is (unless it is
just ASCII, where you generally never even notice the issue).

There are programs that will look at some heuristics in the file and try
to guess. Another option is to open the file in a text editor which will
be more forgiving and show the bad character as an error symbol, and see
if you can recognize what 'language' (human) the user seems to have
been using, and guess the character encoding from that (or just remove
the bad characters if they are just in comments/unimportant strings.)

--
Richard Damon

Skip Montanaro

unread,

May 20, 2018, 9:08:36 AM5/20/18

to

> how exactly am i supposed to find oout what is the correct encodeing?

It seems you are a Python beginner. Rather than just tell you how to use
this one module, I'll point you at some of the ways to get help through
Python.

* On pypi.org, search for "chardet" and see if the author provided online
documentation links.

* At the shell prompt, you might well have a "pydoc" command. Try "pydoc
chardet" after installing it.

* At the Python ">>> " prompt, import the chardet module, then use the
help() function to get some basic help gleaned from the module itself.
(This is basically what the pydoc command does.)

>>> import chardet
>>> help(chardet)
...

* Even more abbreviated than the help() function, the dir() function will
just tell you what attributes an object (not just a module) has.

* Go to docs.python.org and do some tutorial/beginner reading.

* Finally, try searching for "how to get help in python" in your favorite
search engine. There are plenty of useful websites/blogs/articles devoted
to this topic.

Skip

Karsten Hilbert

unread,

May 20, 2018, 9:44:16 AM5/20/18

to

On Sun, May 20, 2018 at 04:59:12AM -0700, bellca...@gmail.com wrote:

> On Saturday, 19 May 2018 19:48:20 UTC-4, Skip Montanaro wrote:
> > As Chris indicated, you'll have to figure out the correct encoding. You
> > might want to check out the chardet module (available on PyPI, I believe)
> > and see if it can come up with a better guess. I imagine there are other
> > encoding guessers out there. That's just one I'm familiar with.
>

> thank you for the reply, but how exactly am i supposed to find oout what is the correct encodeing??

One CAN NOT.

The best you can do is to go ask the canonical source of the
file what encoding the file is _supposed_ to be in.

Then go from there and allow for errors.

Karsten
--

bellca...@gmail.com

unread,

May 20, 2018, 10:30:34 AM5/20/18

to

thank you for the email reply Richard Damon
i will study all your great guidelines

i actually got my script to function by running it in python 2.7
thanx for your kind suggestions very much

tommy

Peter J. Holzer

unread,

May 22, 2018, 5:23:55 PM5/22/18

to

On 2018-05-20 15:43:54 +0200, Karsten Hilbert wrote:
> On Sun, May 20, 2018 at 04:59:12AM -0700, bellca...@gmail.com wrote:
>

> > On Saturday, 19 May 2018 19:48:20 UTC-4, Skip Montanaro wrote:
> > > As Chris indicated, you'll have to figure out the correct encoding. You
> > > might want to check out the chardet module (available on PyPI, I believe)
> > > and see if it can come up with a better guess. I imagine there are other
> > > encoding guessers out there. That's just one I'm familiar with.
> >

> > thank you for the reply, but how exactly am i supposed to find oout what is the correct encodeing??
>

> One CAN NOT.
>
> The best you can do is to go ask the canonical source of the
> file what encoding the file is _supposed_ to be in.

I disagree on both counts.

1) For any given file it is almost always possible to find the correct
encoding (or *a* correct encoding, as there may be more than one).

This may require domain-specific knowledge (e.g. it may be necessary
to recognize the human language and know at least some distinctive
words, or to know some special symbols likely to be used in a data
file), and it almost always takes a bit of detective work and trial
and error. But I don't think I ever encountered a file where I
couldn't figure out the encoding.

(If you have several files in the same encoding, it may not be
possible to figure out the encoding from a subset of them. For
example, the files may all be in ISO-8859-2, but the subset you have
contains only characters <= 0x7F. But if you have several files, they
may not all be the same encoding, either).

2) The canonical source of the file may not know. This is quite frequent
when the source is some non-technical person. Then you get answers
like "it's ASCII" (although the file contains umlauts, which aren't
in ASCII) or "it's ANSI" (which isn't an encoding, although Windows
pretends it is). Or they may not be aware that the file is converted
somewhere in the pipeline, to that the file they generated isn't
actually the file you received. So ask (or check the docs), but
verify!

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | h...@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc

Chris Angelico

unread,

May 22, 2018, 5:38:43 PM5/22/18

to

On Wed, May 23, 2018 at 7:23 AM, Peter J. Holzer <hjp-p...@hjp.at> wrote:
>> The best you can do is to go ask the canonical source of the
>> file what encoding the file is _supposed_ to be in.
>
> I disagree on both counts.
>
> 1) For any given file it is almost always possible to find the correct
> encoding (or *a* correct encoding, as there may be more than one).

You can find an encoding which is capable of decoding a file. That's
not the same thing.

> This may require domain-specific knowledge (e.g. it may be necessary
> to recognize the human language and know at least some distinctive
> words, or to know some special symbols likely to be used in a data
> file), and it almost always takes a bit of detective work and trial
> and error. But I don't think I ever encountered a file where I
> couldn't figure out the encoding.

Look up the old classic "bush hid the facts" hack with Windows
Notepad. A pure ASCII file that got misdetected based on the byte
patterns in it.

If you restrict yourself to ASCII-compatible eight-bit encodings, you
MAY be able to figure out what something is. (I have that exact
situation when parsing subtitles files.) Bizarre constructs like
"Tuuleen jδiseen mδ nostan pδδn" are a strong indication that the
encoding is wrong - if most of a word is ASCII, it's likely that the
non-ASCII bytes represent accented characters, not characters from a
completely different alphabet. But there are a number of annoyingly
similar encodings around, where a large number of the mappings are the
same, but you're left with just a few ambiguous bytes.

And if you're considering non-ASCII-compatible encodings, things get a
lot harder. UTF-16 can represent large slabs of Chinese text using the
same bytes that would represent alphanumeric characters; so how can
you distinguish it from base-64?

I have encountered MANY files where I couldn't figure out the
encoding. Some of them were quite possibly in ancient encodings (some
had CR line endings), some were ambiguous, and on multiple occasions,
I've had to deal with files that had more than one encoding in the
same block of content. (Or more frequently, not files but socket
connections. Same difference.) So no, you cannot always figure out a
file's encoding from its contents. Because that will, on some
occasions, violate the laws of physics - granted, that's merely a
misdemeanour in some states.

ChrisA

Peter J. Holzer

unread,

May 22, 2018, 6:31:19 PM5/22/18

to

On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
> On Wed, May 23, 2018 at 7:23 AM, Peter J. Holzer <hjp-p...@hjp.at> wrote:
> >> The best you can do is to go ask the canonical source of the
> >> file what encoding the file is _supposed_ to be in.
> >
> > I disagree on both counts.
> >
> > 1) For any given file it is almost always possible to find the correct
> > encoding (or *a* correct encoding, as there may be more than one).
>
> You can find an encoding which is capable of decoding a file. That's
> not the same thing.

If the result is correct, it is the same thing.

If I have an input file

4c 69 65 62 65 20 47 72 fc df 65 0a

and I decode it correctly to

Liebe Grüße

it doesn't matter whether I used ISO-8859-1 or ISO-8859-2. The mapping
for all bytes in the input file is the same in both encodings.

> > This may require domain-specific knowledge (e.g. it may be necessary
> > to recognize the human language and know at least some distinctive
> > words, or to know some special symbols likely to be used in a data
> > file), and it almost always takes a bit of detective work and trial
> > and error. But I don't think I ever encountered a file where I
> > couldn't figure out the encoding.
>
> Look up the old classic "bush hid the facts" hack with Windows
> Notepad. A pure ASCII file that got misdetected based on the byte
> patterns in it.

And would you have made the same mistake as notepad? Nope, I'm quite
sure that you are able to recognize an ASCII file with an English
sentence as ASCII. You wouldn't even consider that it could be UTF-16LE.

> If you restrict yourself to ASCII-compatible eight-bit encodings, you
> MAY be able to figure out what something is.

[...]

> But there are a number of annoyingly similar encodings around, where a
> large number of the mappings are the same, but you're left with just a
> few ambiguous bytes.

They are rarely ambiguous if you speak the language.

> And if you're considering non-ASCII-compatible encodings, things get a
> lot harder. UTF-16 can represent large slabs of Chinese text using the
> same bytes that would represent alphanumeric characters; so how can
> you distinguish it from base-64?

I'll ask my Chinese colleague to read it. If he can read it, it's almost
certainly Chinese and not base-64.

As I said, domain knowledge may be necessary. If you are decoding a file
which may contain a Chinese text, you may have to know Chinese to check
whether the decoded text makes sense.

If your job is to figure out the encoding of files which you don't
understand (and hence can't check whether your results are correct) I
will concede that this is impossible.

> I have encountered MANY files where I couldn't figure out the
> encoding. Some of them were quite possibly in ancient encodings (some
> had CR line endings), some were ambiguous, and on multiple occasions,
> I've had to deal with files that had more than one encoding in the
> same block of content.

Well, files with multiple encodings break the assumption that there is
*one* correct encoding. While I have encountered such files, too (as
well as multi-encodings and random errors), I don't think we were
talking about that.

signature.asc

Chris Angelico

unread,

May 22, 2018, 6:43:18 PM5/22/18

to

On Wed, May 23, 2018 at 8:31 AM, Peter J. Holzer <hjp-p...@hjp.at> wrote:
> On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
>> On Wed, May 23, 2018 at 7:23 AM, Peter J. Holzer <hjp-p...@hjp.at> wrote:
>> >> The best you can do is to go ask the canonical source of the
>> >> file what encoding the file is _supposed_ to be in.
>> >
>> > I disagree on both counts.
>> >
>> > 1) For any given file it is almost always possible to find the correct
>> > encoding (or *a* correct encoding, as there may be more than one).
>>
>> You can find an encoding which is capable of decoding a file. That's
>> not the same thing.
>
> If the result is correct, it is the same thing.
>
> If I have an input file
>
> 4c 69 65 62 65 20 47 72 fc df 65 0a
>
> and I decode it correctly to
>
> Liebe Grüße
>
> it doesn't matter whether I used ISO-8859-1 or ISO-8859-2. The mapping
> for all bytes in the input file is the same in both encodings.

Sure, but if you try it as ISO-8859-5 or -7, you won't get an error,
but you also won't get that string. So it DOES matter.

ChrisA

Steven D'Aprano

unread,

May 23, 2018, 2:06:09 AM5/23/18

to

On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:

> On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:

[...]

>> You can find an encoding which is capable of decoding a file. That's
>> not the same thing.
>
> If the result is correct, it is the same thing.

But how do you know what is correct and what isn't? In the most general
case, even if you know the language nominally being used, you might not
be able to recognise good output from bad:

Max Steele strained his mighty thews against his bonds, but
the §-rays had left him as weak as a kitten. The evil Galactic
Emperor, Giµx-Õƒin The Terrible of the planet Œe∂¥, laughed: "I
have you now, Steele, and by this time tomorrow my armies will
have overrun your pitiful Earth defences!"

If this text is encoding using MacRoman, then decoded in Latin-1, it
works, and looks barely any more stupid than the original:

Max Steele strained his mighty thews against his bonds, but
the ¤-rays had left him as weak as a kitten. The evil Galactic
Emperor, Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I
have you now, Steele, and by this time tomorrow my armies will
have overrun your pitiful Earth defences!"

but it clearly isn't the original text.

Mojibake is especially difficult to deal with when you are dealing with
short text snippets like file names or user names which can contain
arbitrary characters, where there is rarely any way to recognise the
"correct" string. If you think Giµx-Õƒin The Terrible is a ludicrous
example of text, you ought to look at user names on web forums.

--
Steve

wxjm...@gmail.com

unread,

May 23, 2018, 8:43:27 AM5/23/18

to

Python is doing this mess by default !
(Among other buggy things)

Dan Stromberg

unread,

May 23, 2018, 4:48:39 PM5/23/18

to

As suggested by others, if this is a text file, request the encoding
from the person who created it. chardet can't really autodetect all
encodings, and neither can guessing encodings in a more manual way -
though if there's no one to ask, sometimes chardet is better than not
reading the file :)

Or, if the file is not text, if you want to treat it as a binary blob:

$ pythons --command 'file_ = open("/usr/local/pypy3-6.0.0/bin/pypy3",
"rb"); data = file_.read(); print(type(data)); file_.close()'
below cmd output started 2018 Wed May 23 01:43:35 PM PDT
/usr/local/cpython-1.0/bin/python (1.0.1) good <type 'string'>
/usr/local/cpython-1.1/bin/python (1.1) good <type 'string'>
/usr/local/cpython-1.2/bin/python (1.2) good <type 'string'>
/usr/local/cpython-1.3/bin/python (1.3) good <type 'string'>
/usr/local/cpython-1.4/bin/python (1.4) good <type 'string'>
/usr/local/cpython-1.5/bin/python (1.5.2) good <type 'string'>
/usr/local/cpython-1.6/bin/python (1.6.1) good <type 'string'>
/usr/local/cpython-2.0/bin/python (2.0.1) good <type 'string'>
/usr/local/cpython-2.1/bin/python (2.1.0) good <type 'string'>
/usr/local/cpython-2.2/bin/python (2.2.0) good <type 'str'>
/usr/local/cpython-2.3/bin/python (2.3.0) good <type 'str'>
/usr/local/cpython-2.4/bin/python (2.4.0) good <type 'str'>
/usr/local/cpython-2.5/bin/python (2.5.6) good <type 'str'>
/usr/local/cpython-2.6/bin/python (2.6.9) good <type 'str'>
/usr/local/cpython-2.7/bin/python (2.7.13) good <type 'str'>
/usr/local/cpython-3.0/bin/python (3.0.1) good <class 'bytes'>
/usr/local/cpython-3.1/bin/python (3.1.5) good <class 'bytes'>
/usr/local/cpython-3.2/bin/python (3.2.5) good <class 'bytes'>
/usr/local/cpython-3.3/bin/python (3.3.3) good <class 'bytes'>
/usr/local/cpython-3.4/bin/python (3.4.2) good <class 'bytes'>
/usr/local/cpython-3.5/bin/python (3.5.0) good <class 'bytes'>
/usr/local/cpython-3.6/bin/python (3.6.0) good <class 'bytes'>
/usr/local/cpython-3.7/bin/python (3.7.0b3) good <class 'bytes'>
/usr/local/jython-2.7/bin/jython (2.7.0) good <type 'str'>
/usr/local/pypy-5.10.0/bin/pypy (2.7.13) good <type 'str'>
/usr/local/pypy-5.3.1/bin/pypy (2.7.10) good <type 'str'>
/usr/local/pypy-5.9.0/bin/pypy (2.7.13) good <type 'str'>
/usr/local/pypy-6.0.0/bin/pypy (2.7.13) good <type 'str'>
/usr/local/pypy3-5.10.0/bin/pypy3 (3.5.3) good <class 'bytes'>
/usr/local/pypy3-5.5.0/bin/pypy3 (3.3.5) good <class 'bytes'>
/usr/local/pypy3-5.8.0-with-lzma-fixes/bin/pypy3 (3.5.3) good <class 'bytes'>
/usr/local/pypy3-5.8.0/bin/pypy3 (3.5.3) good <class 'bytes'>
/usr/local/pypy3-5.9.0/bin/pypy3 (3.5.3) good <class 'bytes'>
/usr/local/pypy3-6.0.0/bin/pypy3 (3.5.3) good <class 'bytes'>
/usr/local/micropython-git-2017-06-16/bin/micropython (3.4.0) good
<class 'bytes'>

HTH

Chris Angelico

unread,

unread,

May 29, 2018, 8:04:36 AM5/29/18

to

On 2018-05-29 21:13:43 +1000, Chris Angelico wrote:
> You can always solve a subset of problems. Using your own knowledge of
> German, you are able to better solve problems involving German text.
> But that doesn't make you any better than chardet at validating
> Chinese text, or Korean text, or Klingon text, or any other language
> you don't know.

But I don't have to. Chardet has to be reasonably good at identifying
any encoding. I only have to be good at identifying the encoding of
files which I need to import (or otherwise process.).

Please go back to the original posting. The poster has one file which he
wants to read, and asked how to determine the encoding. He was told
categorically that this is impossible and he must ask the source.

THIS is what I'm responding to, not the problem of finding a generic
solution which works for every possible file.

The OP has one file. He wants to read it. The very fact that he wants to
read this particular file makes it very likely that he knows something
about the contents of the file. So he has domain knowledge. Which makes
it very likely that he can distinguish a correct from an incorrect
decoding. He probably can't distinguish Korean poetry from a Vietnamese
shopping list, but his file probably isn't either.

signature.asc

Steven D'Aprano

unread,

May 29, 2018, 8:47:02 AM5/29/18

to

On Tue, 29 May 2018 10:34:50 +0200, Peter J. Holzer wrote:

> On 2018-05-23 06:03:38 +0000, Steven D'Aprano wrote:

>> On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:
>> > On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
>> >> You can find an encoding which is capable of decoding a file. That's
>> >> not the same thing.
>> >
>> > If the result is correct, it is the same thing.
>>
>> But how do you know what is correct and what isn't?

[...]

>> If this text is encoding using MacRoman, then decoded in Latin-1, it
>> works, and looks barely any more stupid than the original:
>>
>> Max Steele strained his mighty thews against his bonds, but the
>> ¤-rays had left him as weak as a kitten. The evil Galactic Emperor,
>> Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I have you
>> now, Steele, and by this time tomorrow my armies will have overrun
>> your pitiful Earth defences!"
>>
>> but it clearly isn't the original text.
>
> Please note that I wrote "almost always", not "always". It is of course
> possible to construct contrived examples where it is impossible to find
> the correct encoding, because all encodings lead to equally ludicrous
> results.

Whether they are ludicrous is not the point, the point is whether it is
the original text intended.

What you describe works for the EASY cases: you have a small number of
text files in some human-readable language, the text files are all valid
texts in that language, and you have an expert in that language on hand
able to distinguish between such valid and invalid decoded texts.

If that applies for your text files, great, you have nothing to fear from
encoding issues! Even if the supplier of the files wouldn't know ASCII
from EBCDIC if it fell on them from a great height, you can probably make
an educated guess what the encoding is. Wonderful.

But that's not always the case. In the real world, especially now that we
interchange documents from all over the world, it isn't the hard cases
that are contrived. Depending on the type of document (e.g. web pages you
scrape are probably different from emails, which are different from
commercial CSV files...) being able to just look at the file and deduce
the correct encoding is the contrived example.

Depending on where the text is coming from:

- you might not have an expert on hand who can distinguish between
valid and invalid text;

- you might have to process a large number of files (thousands or
millions) automatically, and cannot hand-process those that have
encoding problems;

- your files might not even be in a single consistent encoding, or
may have Mojibake introduced at some earlier point that you do not
have control over;

- you might not know what language the text is supposed to be;

- or it might contain isolated words in some unknown language;

e.g. your text might be nearly all ASCII English, except for a word
"Čezare" (if using the Czech Kamenický encoding) or "Çezare" (if
using the Polish Mazovia encoding) or "Äezare" (Mac Roman).

How many languages do you need to check to determine which is
correct? (Hint: all three words are valid.)

- not all encoding problems are as equally easy to resolve as
your earlier German/Russian example.

E.g. Like Japanese, Russian has a number of incompatible and popular
encodings. Mojibake is a Japanese term, but the Russians have their own
word for it: krakozyabry (кракозя́бры).

Dealing with bad data is *hard*.

https://www.safaribooksonline.com/library/view/bad-data-
handbook/9781449324957/ch04.html

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

Steven D'Aprano

unread,

May 29, 2018, 12:23:11 PM5/29/18

to

On Tue, 29 May 2018 14:04:19 +0200, Peter J. Holzer wrote:

> The OP has one file.

We don't know that. All we know is that he had one file which he was
unable to read. For all we know, he has a million files, and this was
merely the first of many failures.

> He wants to read it. The very fact that he wants to
> read this particular file makes it very likely that he knows something
> about the contents of the file. So he has domain knowledge.

An unjustified assumption. I've wanted to read many files with only the
vaguest guess of what they might contain.

As for his domain knowledge, look again at the OP's post. His solution
was to paper over the error, make the error go away, by moving to Python
2 which is more lax about getting the encoding right:

"i actually got my script to function by running it in python 2.7"

So he didn't identify the correct encoding, nor did he use an error
handler, or fix the bug in his code. He just downgraded to an older
version of Python, because it made the exception (but not the error) go
away.

My prediction is that he has replaced an explicit exception with a silent
failure, preferring mojibake to actually dealing with the problem.

Peter J. Holzer

unread,

May 30, 2018, 6:08:35 PM5/30/18

to

On 2018-05-29 16:20:36 +0000, Steven D'Aprano wrote:
> On Tue, 29 May 2018 14:04:19 +0200, Peter J. Holzer wrote:
>
> > The OP has one file.
>
> We don't know that. All we know is that he had one file which he was
> unable to read. For all we know, he has a million files, and this was
> merely the first of many failures.

This is of course possible. It is also possible that the file is updated
daily and the person updating the file is always choosing a random
encoding[2], so his program will always fail the next day.

But that isn't what he has told us. And I don't find it very helpful to
invent some specific scenario and base the answers on that invented
scenario instead of what the OP has told us.

Cameron Simpson

unread,

Jun 8, 2018, 6:26:28 PM6/8/18

to

On 05Jun2018 06:42, bellca...@gmail.com <bellca...@gmail.com> wrote:
>On Sunday, 3 June 2018 20:11:43 UTC-4, Steven D'Aprano wrote:
>> Don't retype a summary of what you think the error is. "character
>> undefieed" is not a thing, and there is no such thing as "byte 1x09".
>>
>> You need to COPY AND PASTE the EXACT error that you get. Not just the
>> last line, the error message, but the FULL TRACEBACK starting from the
>> line "Traceback" and going to the end.

[...]

>
>here is the exact error full message
>in the attachment...UPDATE..i am manually modifying this reply..i tried to answer by my gmail but i get errors and i couldnt find this webpage till today and it doesnt accept attachments..so many you can for future provide an email if thats ok...anyway i will write the error manually here:

Many of us read this group/list via the mailing list pytho...@python.org.
I've CCed it here. Just avoid Google Groups, they're an awful interface to both
usenet and mailing lists.

>File
>"C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py",
>line 23, in decode
>return codecs.charmap_decode(input,self.errors,decoding_table[0]
>UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character maps to <undefined>

As Steven has remarked, this is not the complete traceback he requested, just
the end. We need to know the entire execution stack.

>for the record i did not puprosely set the code or decode o encode to cp-1252;
>this is a 3rd party script i have from the internet thats all

Can you say where it came from and how you fetched it? That may affect how the
file got into this situation and how it might be repaired.

It might also let us fetch the file ourselves to look at it.

>this a set of files that runs find in python 2.7
>i am trying to run it in python 3 becuz i was told in 2020 python 2 will no longer be supported
>not sure if that really matters for my script

It may not matter, but as a general rule you should try to use Python 3 for new
stuff. Python 2 is effectively end of life.

>it runs completey fine in python 2, so for me the issue is with python 3 and
>its changes relative to python 2

It is possible that Python 2 is just glossing over the problem; Python 3 has a
more rigorous view of character data.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Steven D'Aprano

unread,

Jun 8, 2018, 9:02:34 PM6/8/18

to

On Sat, 09 Jun 2018 08:26:10 +1000, Cameron Simpson wrote:

> It is possible that Python 2 is just glossing over the problem; Python 3
> has a more rigorous view of character data.

I would say that is more than just possible, it is almost certain.

bellca...@gmail.com

unread,

Jun 10, 2018, 9:24:41 AM6/10/18

to

***********************************************
Traceback (most recent call last):

File "createIndex.py", line 132, in <module>
c.createindex()

File "creatIndex.py", line 102, in createIndex
pagedict=self.parseCollection()

File "createIndex.py", line 47, in parseCollection
for line in self.collFile:

File "C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode

return codecs.charmap_decode(input,self.errors,decoding_table[0]
UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character maps to <undefined>

*****************************************************

Steven D'Aprano

unread,

Jun 10, 2018, 10:23:47 AM6/10/18

to

Do you enjoy wasting your own time (as well as ours) by failing to follow
instructions?

We can't read your mind to see the code you are using, and I am getting
frustrated from telling you the same thing again and again.

PLEASE PLEASE PLEASE PLEASE help us to help you.

Start by reading this:

http://sscce.org/

What is self.collFile? How does it get opened?

bellca...@gmail.com

unread,

Jun 10, 2018, 12:49:49 PM6/10/18

to

excuse but sorry
i took the time to manually write the code error from the traceback as you said
and thats because i cant seem to find a way to attach files here..which would make it so easier for me and also i could attach snippets of the actual source code..and i asked the forum how i can attach files or also i asked for an email adress but i didnt get a response....also i find this is why programmers can have a bad reputation..cuz of examples of people like yourself barking out orders and getting upset for no good reason at tall..now one including myself is giving you obligations to answer my queires.... ok??..im not your slave the way you say that i have to follow instructions..so why do you kindly just buzz off of my thread since it makes you and myself really get annoyed..ok..

bellca...@gmail.com

unread,

Jun 10, 2018, 4:04:41 PM6/10/18

to

here is the full error once again
to summarize, my script works fine in python2
i get this error trying to run it in python3
plz see below after the error, my settings for python 2 and python 3
for me it seems i need to change some settings to 'utf-8'..either just in python 3, since thats where i am having issues or change the settings to 'utf-8' both in python 2 and 3....i would appreciate feedback b4 i do some trial and error
thanks for the consideration
tommy

***********************************************
Traceback (most recent call last):

File "createIndex.py", line 132, in <module>
c.createindex()

File "creatIndex.py", line 102, in createIndex
pagedict=self.parseCollection()

File "createIndex.py", line 47, in parseCollection
for line in self.collFile:

File "C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode

return codecs.charmap_decode(input,self.errors,decoding_table[0]
UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character maps to <undefined>

*****************************************************

***************************************************
python 3 settings
import sys
import locale
locale.getpreferredencoding()
'cp1252'
sys.stdout.encoding
'cp1252'
sys.getdefaultencoding()
'utf-8'
sys.getfilesystemencoding()
'utf-8'
sys.stdin.encoding
'cp1252'
sys.stderr.encoding
'cp1252'

PYTHON 2 settings
import sys
import locale
locale.getpreferredencoding()
'cp1252'
sys.stdout.encoding
'cp1252'
sys.getdefaultencoding()
'ascii'
sys.getfilesystemencoding()
'mbcs'
sys.stdin.encoding
'cp1252'
sys.stderr.encoding
'cp1252'
***************************************

Chris Angelico

unread,

Jun 10, 2018, 4:07:18 PM6/10/18

to

On Mon, Jun 11, 2018 at 2:49 AM, <bellca...@gmail.com> wrote:
>
> excuse but sorry
> i took the time to manually write the code error from the traceback as you said
> and thats because i cant seem to find a way to attach files here..which would make it so easier for me and also i could attach snippets of the actual source code..and i asked the forum how i can attach files or also i asked for an email adress but i didnt get a response....also i find this is why programmers can have a bad reputation..cuz of examples of people like yourself barking out orders and getting upset for no good reason at tall..now one including myself is giving you obligations to answer my queires.... ok??..im not your slave the way you say that i have to follow instructions..so why do you kindly just buzz off of my thread since it makes you and myself really get annoyed..ok..

> --

Nope, you're not our slave. But here's the thing: none of us is your
slave either. If you don't want help, we don't have to provide any.

ChrisA

Cameron Simpson

unread,

Jun 10, 2018, 5:29:59 PM6/10/18

to

On 10Jun2018 13:04, bellca...@gmail.com <bellca...@gmail.com> wrote:
>here is the full error once again
>to summarize, my script works fine in python2
>i get this error trying to run it in python3
>plz see below after the error, my settings for python 2 and python 3
>for me it seems i need to change some settings to 'utf-8'..either just in python 3, since thats where i am having issues or change the settings to 'utf-8' both in python 2 and 3....i would appreciate feedback b4 i do some trial and error
>thanks for the consideration
>tommy
>
>***********************************************
>Traceback (most recent call last):
>File "createIndex.py", line 132, in <module>
>c.createindex()
>File "creatIndex.py", line 102, in createIndex
>pagedict=self.parseCollection()
>File "createIndex.py", line 47, in parseCollection
>for line in self.collFile:
>File
>"C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py",
>line 23, in decode
>return codecs.charmap_decode(input,self.errors,decoding_table[0]
>UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character maps to <undefined>

Ok, this is more helpful. It says that the decoding error, which occurred in
...\cp1252.py, was decoding lines from the file self.collFile.

What is that file? And how was it opened?

Also, your settings below may indeed be important.

>***************************************************
>python 3 settings
>import sys
> import locale
>locale.getpreferredencoding()
>'cp1252'

The setting above is the default encoding used when you open a file in text
mode in Python 3, but you can override it.

In Python 3 this matters a lot, because Python 3 strings are Unicode. In Python
2, strings are just bytes, and are not "decoded" (there is a whole separate
"unicode" type for that when it matters).

So in Python 3 the text file reader is decoding the text in the file according
to what it expects the encoding to be.

Find the place where self.collFile is opened. You can specify the decoding
method there by adding the "encoding=" parameter to the open() call. It is
defaulting to "cp1252" because that is what locale.getpreferredencoding()
returns, but presumably the actual file data are not encoded that way.

You can (a) find out what encoding _is_ used in the file and specify that or
(b) tell Python to be less picky. Choice (a) is better if it is feasible.

If you have to guess because you don't know the encoding, one possibility is
that collFile contains utf-8 or utf-16; of these 2, utf-8 seems more likely
given the 0x9d byte causing the trouble. Try adding:

encoding='utf-8'

to the open() call, eg:

self.collFile = open('path-to-the-coll-file', encoding='utf-8')

at the appropriate place.

If that just produces a different decoding error, you have 2 choices: pick an
encoding where every byte is "valid", such as 'iso8859-1', or to tell the
decode to just cope with th errors by adding the errors="replace" or
"errors="ignore" or errors="namereplace" parameter to the open() call.

Both these choices have downsides.

There are several ISO8859 encodings, and they might all be wrong for your file,
leading to _incorrect_ text lines.

The errors="..." parameter also has downsides: you will also end up with
missing (errors="ignore") or incorrect (errors="replace" or
errors="namereplace") text, because the decoder has to do something with the
data: drop it or replace it with something wrong. The former loses data while
the latter puts in bad data, but at least it is visible if you inspect the data
later.

The full documentation for Python 3's open() call is here:

https://docs.python.org/3/library/functions.html#open

where the various encoding= and errors= choices are described.

Cheers,
Cameron Simpson <c...@cskk.id.au>

bellca...@gmail.com

unread,

Jun 10, 2018, 7:56:07 PM6/10/18

to

thank you for the reply
let me try these tips and suggestions and i will update here
thanxz alot
and thnxz also to all who post ..i appreciate it..
regards
tommy

bellca...@gmail.com

unread,

Jun 13, 2018, 6:56:10 AM6/13/18

to

On Sunday, 10 June 2018 17:29:59 UTC-4, Cameron Simpson wrote:

hello community forums
still failed but i tried many things and i beleive we are getting close to getting

1st is this script is from a library module online open source so many things in source code i may not trully be sure since i cant reach the original authors or sub authors..so

the collFile has to be like a variable that would refer to the file Collection.dat..thats my best guess
also in the error line , it doesnt actually open the file ...so i didnt try your code but i did try somethin similar b4 and it didnt work..here is a snippet of the code line error
*******
def parseCollection(self):
''' returns the id, title and text of the next page in the collection '''
doc=[]
for line in self.collFile:
if line=='</page>\n':
break
doc.append(line)

*********
so as you can see there is not open file line to try to encode to utf-8

now here is what i tried anyways
*******************
# -*- encoding: utf-8 -*-
#for line in self.collFile.decode("utf-8"):
#pagedict=self.parseCollection.encode("utf-8")()
**********************

i tried the 1st line at the top of the script.....no effect
i tried the 2nd line at the error line for the collFile....no effect
i tried the 3rd line and the parseCollection line...no effect

no here is what i found on stackoverflow and i had some progress but still get errors , so maybe i am not inserting the code in the right place

i tried this in various spots in the createindex.py file

******************
import locale
def getpreferredencoding(do_setlocale = True):
return "utf-8"
locale.getpreferredencoding = getpreferredencoding
****************************
i tried this in the actual file, in many different spots...but no effect

so i tried it at the python 3.6.4 interpeter prompt and when i enter the code and after last line..i press enter and then enter the finction
locale.getpreferredencoding()......i do finally get "utf-8".....instaed of default "cp1252"

but when i go to run my script,,,i still get the same cant decode byte in cp1252 file....................

so to me it seems my code that changed the preferedencoding is not posting to my file
how can i make it permanent or actually have an effect cuz i do see the change at the interperter..but not yet at the script

tommy

bellca...@gmail.com

unread,

Jun 13, 2018, 7:01:37 AM6/13/18

to

On Sunday, 10 June 2018 17:29:59 UTC-4, Cameron Simpson wrote:

hello just to clarify a typo from my last post(i make many , not verify good with key buttons)...

#for line in self.collFile.decode("utf-8"):

i actually write.encode...then i tried the decode
but both dont have any effect

thank you for any comments
tommy

INADA Naoki

unread,

Jun 13, 2018, 7:14:06 AM6/13/18

to

> 1st is this script is from a library module online open source

If it's open source, why didn't you show the link to the soruce?
I assume your code is this:

https://github.com/siddharth2010/String-Search/blob/6770c7a1e811a5d812e7f9f7c5c83a12e5b28877/createIndex.py

And self.collFile is opened here:

https://github.com/siddharth2010/String-Search/blob/6770c7a1e811a5d812e7f9f7c5c83a12e5b28877/createIndex.py#L91

You need to add `encoding='utf-8'` argument.

Steven D'Aprano

unread,

Jun 13, 2018, 7:20:06 AM6/13/18

to

On Wed, 13 Jun 2018 04:01:24 -0700, bellcanadardp wrote:

> for line in self.collFile.decode("utf-8"):
> i actually write.encode...then i tried the decode but both dont have any
> effect

Raising AttributeError isn't an effect?

py> f = open("/tmp/x")
py> f.write.decode

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
AttributeError: 'builtin_function_or_method' object has no attribute
'decode'

Steven D'Aprano

unread,

Jun 13, 2018, 9:12:32 AM6/13/18

to

On Wed, 13 Jun 2018 03:55:58 -0700, bellcanadardp wrote:

> the collFile has to be like a variable that would refer to the file
> Collection.dat..thats my best guess also in the error line , it doesnt
> actually open the file ...

The file has to be opened if you are reading from it. If it isn't opened
in the line of code you are looking at, look at the rest of the code.
*Somewhere* is must be opened.

bellca...@gmail.com

unread,

Jun 13, 2018, 9:28:41 AM6/13/18

to

thank you INADA
here is the line of code solution
**************
self.collFile=open(self.collectionFile,'r',encoding='utf-8')
******************
it was tricky only because the modification had to be done not at the actual error line but further down in the file where INADA correctly pointed out where the collFile is opened

thank you to all who posted, i truly appreciate all posts

tommy

bellca...@gmail.com

unread,

Jun 13, 2018, 10:19:52 AM6/13/18

to

yes thank you
further down in the script..the collectionFile is indeed opened
and all it took was this line, to get the full program to work in python 3

**************
self.collFile=open(self.collectionFile,'r',encoding='utf-8')
******************

where
,encoding='utf-8'
is the only change that had to be made, pointed out from INADA
(plus i used the 2to3.py, which corrected many print statements)

thanks alot for your efforts...im sure i will need this forum for future errors

thanks again to all who posted for this python 2 to python3 utf-8 unicodedecode error
its very kind and built on the online programmers community spirit

tommy

pjmc...@gmail.com

unread,

Aug 30, 2018, 8:21:47 AM8/30/18

to

August 30 2018
reference to same open source script which you solved for unoicode error

hello i have qustion
my script runs correctly ..either in the original state on python 2.7 or after several adjustments in python 3

my question is ... at the moment i can only run it on windows cmd prompt with a multiple line entry as so::

python createIndex_tfidf.py stopWords.dat testCollection.dat testIndex.dat titleIndex.dat

and then to query and use the newly created index as so:

python queryIndex_tfidf.py stopWords.dat testIndex.dat titleIndex.dat

how can i run just one file at a time?..or actually link to a front end GUI ,so when an question or word or words is input to the input box..it can go to the actiona dnrun the above mentioned lines of code

any one on the forum know??

if you have the time kindly reply when you have some time

thank you very much
tommy

pjmc...@gmail.com

unread,

Aug 30, 2018, 8:27:07 AM8/30/18

to

On Thursday, August 30, 2018 at 8:21:47 AM UTC-4, pjmc...@gmail.com wrote:
> On Wednesday, June 13, 2018 at 7:14:06 AM UTC-4, INADA Naoki wrote:
> > > 1st is this script is from a library module online open source
> >
> > If it's open source, why didn't you show the link to the soruce?
> > I assume your code is this:
> >
> > https://github.com/siddharth2010/String-Search/blob/6770c7a1e811a5d812e7f9f7c5c83a12e5b28877/createIndex.py
> >
> > And self.collFile is opened here:
> >
> > https://github.com/siddharth2010/String-Search/blob/6770c7a1e811a5d812e7f9f7c5c83a12e5b28877/createIndex.py#L91
> >
> > You need to add `encoding='utf-8'` argument.
>
> August 30 2018
> reference to same open source script which you solved for unoicode error
>
> hello i have qustion
> my script runs correctly ..either in the original state on python 2.7 or after several adjustments in python 3
>
> my question is ... at the moment i can only run it on windows cmd prompt with a multiple line entry as so::
>
> python createIndex_tfidf.py stopWords.dat testCollection.dat testIndex.dat titleIndex.dat
>
> and then to query and use the newly created index as so:
>
> python queryIndex_tfidf.py stopWords.dat testIndex.dat titleIndex.dat
>

> how can i run just one file at a time?..or actually link to a front end GUI ,so when an question or word or words is input to the input box..so it can go to the line of code action and run the above mentioned lines of code

Steven D'Aprano

unread,

Aug 30, 2018, 9:28:09 AM8/30/18

to

On Thu, 30 Aug 2018 05:21:30 -0700, pjmclenon wrote:

> my question is ... at the moment i can only run it on windows cmd prompt
> with a multiple line entry as so::
>
> python createIndex_tfidf.py stopWords.dat testCollection.dat
> testIndex.dat titleIndex.dat
>
> and then to query and use the newly created index as so:
>
> python queryIndex_tfidf.py stopWords.dat testIndex.dat titleIndex.dat
>
> how can i run just one file at a time?

I don't understand the question. You are running one file at a time.
First you run createIndex_tfidf.py, then you run queryIndex_tfidf.py

Maybe you mean to ask how to combine them both to one call of Python?

(1) Re-write the createIndex_tfidf.py and queryIndex_tfidf.py files to be
in a single file.

(2) Or, create a third file which runs them both one after another.

That third file doesn't even need to be a Python script. It could be a
shell script, it would look something like this:

python createIndex_tfidf.py stopWords.dat testCollection.dat
testIndex.dat titleIndex.dat

python queryIndex_tfidf.py stopWords.dat testIndex.dat titleIndex.dat

and you would then call it from whatever command line shell you use.

> ..or actually link to a front end
> GUI ,so when an question or word or words is input to the input box..it
> can go to the actiona dnrun the above mentioned lines of code

You can't "link to a front end GUI", you have to write a GUI application
which calls your scripts.

There are many choices: tkinter is provided in the Python standard
library, but some people prefer wxPython, PyQT4, or other GUI toolkits.

https://duckduckgo.com/?q=python+gui+toolkits

pjmc...@gmail.com

unread,

Aug 30, 2018, 12:57:50 PM8/30/18

to

thank you for the reply

actually your response was pretty much exactly what i am trying to do

so i will explain what i meant more better
also i will have to learn shell script cuz i looked it up online and it seems i have to download something and then some other stuff....unless you have the short steps on how to write a shell script for a win 7 cmd prompt to call 2 or more python scripts

so what i meant to say in my 1st post is i have to 1st run the create index program , which i have to run as so:

python createIndex_tfidf.py stopWords.dat testCollection.dat
testIndex.dat titleIndex.dat

i tried b4 a while ago to run the .py file alone but it returns index out of range errors..so thats why i have to run it with the other 4 .dat files

it takes a few minutes to build the index and then the prompt returns me to the folder im in and
then i run this:

python queryIndex_tfidf.py stopWords.dat testIndex.dat titleIndex.dat

same reason as b4 , i cant run just .py file alone , as it returns index out of range errors

so that takes another few more minutes and then the cursor keeps blinking and i have to adjust the script some how because i just wait till i can try to input a query at the blinking cursor

so for example..hello world
and it will return the titles up to 100, cuz thats the max returns i put in the script

so thats completetly perfect what you said ..create a 3rd script..a shell script to call the total i beleive its 9 files in total.....but like i said i dont know shell scripting , so i will try to learn it ..hope its not too complicated...or i can make a 3rd python script ...but not sure how to call the 2 python files and 7 .dat files...sorry even though its a year or so i know python..still get lost on import and from and include directives

as for the GUI ok i will check out your advice..cuz actually i do have a GUI version of this program here..and i wanted to merge them..but its seeming to be complex...i would rather make the GUI for this program cuz its actually exactly as i like it
thank you for any more explanation on how i can make and run the 3rd merge file script

tommy

MRAB

unread,

Aug 30, 2018, 1:29:48 PM8/30/18

to

On Windows you can create a .bat batch file.

Put what you would enter at the command line into a plain textfile and
give the file the extension ".bat".

Then, to run it, just type the name of the file on the command line.

pjmc...@gmail.com

unread,

Aug 30, 2018, 2:05:16 PM8/30/18

to

ok i will try than...didint know its staight as creating .bat file and put my command line entries to run the program in the file and then type for example..run.bat inside the command prompt..i will try this and post here if any errors but it should be good

tommy

unread,

Oct 13, 2018, 7:24:14 PM10/13/18

to

with open(join("docs", path), encoding="utf-8") as f:

pjmc...@gmail.com

unread,

Oct 15, 2018, 12:01:06 PM10/15/18

to

THANK you very much
that works
i had forgot a comma , thats what gave me the error

thaxz again
Jesiica

pjmc...@gmail.com

unread,

Oct 20, 2018, 8:24:53 AM10/20/18

to

On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:

hello MRAB and google forum

i have a sort of decode error it seems now very close to the line in my script
which you solved for my 2 previous encode errors in python 3

the error now is
**************
UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid start byte
*****************
and it seems to refer to my code line:
***********
data = f.read()
***************
which is part of this block of code
********************
# Read content of files
for path in files:

with open(join("docs", path), encoding="utf-8") as f:

#with open(join("docs", path)) as f:
data = f.read()
process_data(data)
***********************************************

would the solution fix be this?
**********************
data = f.read(), decoding = "utf-8" #OR
data = f.read(), decoding = "ascii" # is this the right fix or previous or both wrong??

thxz for any solutions
im in python 3

jessica

Peter J. Holzer

unread,

Oct 20, 2018, 8:47:45 AM10/20/18

to

On 2018-10-20 05:24:37 -0700, pjmc...@gmail.com wrote:
> On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:
> > with open(join("docs", path), encoding="utf-8") as f:
>
> hello MRAB and google forum

I feel somewhat excluded by this salutaton, as I'm not MRAB and I don't
read this on Google Groups, but I'll answer anyway ;-).

> i have a sort of decode error it seems now very close to the line in my script
> which you solved for my 2 previous encode errors in python 3
>
> the error now is
> **************
> UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid start byte

[...]

> would the solution fix be this?
> **********************
> data = f.read(), decoding = "utf-8" #OR
> data = f.read(), decoding = "ascii" # is this the right fix or previous or both wrong??

signature.asc

MRAB

unread,

Oct 20, 2018, 11:39:39 AM10/20/18

to

On 2018-10-20 13:47, Peter J. Holzer wrote:
> On 2018-10-20 05:24:37 -0700, pjmc...@gmail.com wrote:

>> On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:
>> > with open(join("docs", path), encoding="utf-8") as f:
>>
>> hello MRAB and google forum
>

> I feel somewhat excluded by this salutaton, as I'm not MRAB and I don't
> read this on Google Groups, but I'll answer anyway ;-).
>

Well, I am MRAB, and I'm not on Google Groups either! :-)

>> i have a sort of decode error it seems now very close to the line in my script
>> which you solved for my 2 previous encode errors in python 3
>>
>> the error now is
>> **************
>> UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid start byte
>

> [...]

>
>> would the solution fix be this?
>> **********************
>> data = f.read(), decoding = "utf-8" #OR
>> data = f.read(), decoding = "ascii" # is this the right fix or previous or both wrong??
>

> 0xB0 isn't a valid ASCII character, so you'll get a decoding error for
> open(..., encoding="ascii"), too.
>
> You will have to find out the correct encoding for your file and use
> that.
>

I concur; it appears that the file isn't encoded in UTF-8, or ASCII
either (valid ASCII is also valid UTF-8).

Terry Reedy

unread,

Oct 20, 2018, 1:23:50 PM10/20/18

to

On 10/20/2018 8:24 AM, pjmc...@gmail.com wrote:
> On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:

> i have a sort of decode error

> UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid start byte
> *****************
> and it seems to refer to my code line:
> ***********
> data = f.read()
> ***************
> which is part of this block of code
> ********************
> # Read content of files
> for path in files:
> with open(join("docs", path), encoding="utf-8") as f:
> #with open(join("docs", path)) as f:
> data = f.read()
> process_data(data)
> ***********************************************
>
> would the solution fix be this?
> **********************
> data = f.read(), decoding = "utf-8" #OR
> data = f.read(), decoding = "ascii" # is this the right fix or previous or both wrong??

Both statements are invalid syntax. The encoding is set in the open
statement.

What you need to find out: is '0xb0' a one-byte error or is 'utf-8' the
wrong encoding? Things I might do:

1. Change the encoding in open() to 'ascii' and see if the exception
message still refers to position 83064 or if there is a non-ascii
character earlier in the file. The latter would mean that there is at
least one earlier non-ascii sequence that was decoded as uft-8. This
would suggest that 'utf-8' might be correct and that the '0xb0' byte is
an error.

2. In the latter case, add "errors='handler'", where 'handler' is
something other than the default 'strict'. Look in the doc or see
help(open) for alternatives.

3. In open(), replace "encoding='utf-8'" with "mode='rb'" so that
f.read() creates data as bytes instead of a text string. Then print,
say, data[83000:83200] to see the context of the non-ascii byte.

4. Change to encoding in open() to 'latin-1'. The file will then be
read as text without error, even if latin-1 is the wrong encoding.

--
Terry Jan Reedy

pjmc...@gmail.com

unread,

Oct 21, 2018, 7:33:39 AM10/21/18

to

On Thursday, January 29, 2009 at 11:24:46 AM UTC-5, Anjanesh Lekshminarayanan wrote:
> Im reading a file. But there seems to be some encoding error.
>
> >>> f = open(filename)
> >>> data = f.read()

> Traceback (most recent call last):

> File "<pyshell#2>", line 1, in <module>
> data = f.read()
> File "C:\Python30\lib\io.py", line 1724, in read
> decoder.decode(self.buffer.read(), final=True))
> File "C:\Python30\lib\io.py", line 1295, in decode
> output = self.decoder.decode(input, final=final)
> File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]

> UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position

> 10442: character maps to <undefined>
>
> The string at position 10442 is something like this :
> "query":"0 1Â»Ã \u2021 0\u201a0 \u2021Â»Ã ","
>
> So what encoding value am I supposed to give ? I tried f =
> open(filename, encoding="cp1252") but still same error. I guess
> Python3 auto-detects it as cp1252
> --
> Anjanesh Lekshmnarayanan

hello peter, mrab and terry

thank you all for answering
i decided to use the last option by terry....latin -1....cuz it seemed the easiest to try

it does work but i read somewhere if im not wrong ..that this is not really the advised thing to do ....but also my file was just simple text....i did see some weird caracters...and maybe that caused the utf-8 error..but i dont know how to post a pic here to show the caracters in my text file...but anyway for now i guess i will keep this setting..but not sure why this seems to be not the ideal thing to do with encoding at latin -1 , which is extened ascii right?/..and thats 256 options?/..and even with the unknown caraters i found in the file..isnt utf-8 like over 1 million options?

thzx alot forum
Jessica

pjmc...@gmail.com

unread,

Oct 21, 2018, 2:33:20 PM10/21/18

to

hello terry
just want to add
that i tried also setting in notepad ++ encoding to utf-8 from ansi and then i encoded utf-8 in my file but i got same error

then i tried encoding ascii in my file and it worked
so encdoong ascii and latin-1 work
not sure why utf-8 gives an error when thats the most wide all caracters inclusive right?/

thxz
jessica

Marko Rauhamaa

unread,

Oct 21, 2018, 2:48:17 PM10/21/18

to

pjmc...@gmail.com:

> not sure why utf-8 gives an error when thats the most wide all caracters
> inclusive right?/

Not all sequences of bytes are legal in UTF-8. For example,

>>> b'\x80'.decode("utf-8")

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Not all sequences of bytes are legal in ASCII, either.

However, all sequences of bytes are legal in Latin-1 (among others). Of
course, decoding with Latin-1 gives you gibberish unless the data really
is Latin-1. But you'll never get a UnicodeDecodeError.

Marko