Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Py3: Read file with Unicode characters

2 views
Skip to first unread message

Gnarlodious

unread,
Apr 8, 2010, 10:48:09 AM4/8/10
to
Attempting to read a file containing Unicode characters such as ±:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
5007: ordinal not in range(128)

I did succeed by converting all the characters to HTML entities such
as "±", but I want the characters to be the actual font in the
source file. What am I doing wrong? My understanding is that ALL
strings in Py3 are unicode so... confused.

-- Gnarlie

Martin v. Loewis

unread,
Apr 8, 2010, 11:14:36 AM4/8/10
to Gnarlodious
Gnarlodious wrote:
> Attempting to read a file containing Unicode characters such as ą:

> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 5007: ordinal not in range(128)
>
> I did succeed by converting all the characters to HTML entities such
> as "±", but I want the characters to be the actual font in the
> source file. What am I doing wrong? My understanding is that ALL
> strings in Py3 are unicode so... confused.

When opening the file, you need to specify the file encoding. If you
don't, it defaults to ASCII (in your situation; the specific default
depends on the environment).

Regards,
Martin

Gnarlodious

unread,
Apr 8, 2010, 11:52:06 AM4/8/10
to
On Apr 8, 9:14 am, "Martin v. Loewis" wrote:

> When opening the file, you need to specify the file encoding.

OK, I had tried this:

open(path, 'r').read().encode('utf-8')

however I get error

TypeError: Can't convert 'bytes' object to str implicitly

I had assumed a Unicode string was a Unicode string, so why is it a
bytes string?

Sorry, doing Unicode in Py3 has really been a challenge.

-- Gnarlie

Martin v. Loewis

unread,
Apr 8, 2010, 1:04:38 PM4/8/10
to Gnarlodious
Gnarlodious wrote:
> On Apr 8, 9:14 am, "Martin v. Loewis" wrote:
>
>> When opening the file, you need to specify the file encoding.
>
> OK, I had tried this:
>
> open(path, 'r').read().encode('utf-8')

No, when *opening* the file, you need to specify the encoding:

open(path, 'r', encoding='utf-8').read()

> Sorry, doing Unicode in Py3 has really been a challenge.

That's because you need to re-learn some things.

Regards,
Martin

Gnarlodious

unread,
Apr 8, 2010, 1:37:56 PM4/8/10
to
On Apr 8, 11:04 am, "Martin v. Loewis" wrote:

> That's because you need to re-learn some things.

Apparently so, every little item is a lesson. Thank you.

-- Gnarlie

0 new messages