Problems with Unicode in IDLE

Uncle Bruce

unread,

May 1, 2009, 8:12:52 PM5/1/09

to nltk-users

I've started thru the NLTK book, but I've bogged down on Chapter 3
where it gets into Unicode and UTF-8 encodings.

I tried entering the lineː "#-*- coding: utf-8 -*-"

but it doesn't seem to work. If I paste a character from a high
codepoint into the input line, all I get is a complaint about invalid
input.

WORD, Toolbox and other programs work fine using utf-8 encoding.
What, exactly do I need to do to get IDLE to use utf-8?

(I'm running VISTA-64)

Uncle Bruce in Toronto

Steven Bird

unread,

May 2, 2009, 3:30:43 AM5/2/09

to nltk-...@googlegroups.com

2009/5/2 Uncle Bruce <bruc...@rogers.com>:

> WORD, Toolbox and other programs work fine using utf-8 encoding.
> What, exactly do I need to do to get IDLE to use utf-8?

Please see comp.lang.python about this

http://groups.google.com/group/comp.lang.python/search?group=comp.lang.python&q=idle+unicode

Uncle Bruce

unread,

May 2, 2009, 9:20:17 AM5/2/09

to nltk-users

I tried your link, but the notes there left me even more confused. I
did check out a few things that were mentioned in the notes, though.
1) my OS is VISTA64; Python 2.5.4; IDLE 1.2.4
2) sys.stdin.encoding is 'cp1252' (seems to be the standard Windows
code page for single byte encodings)
3) my Options / General / Default Source Encoding is set to UTF-8
4) my Base Editor Font is Doulos SIL
5) my background is decades in programming, but zero in Python and
zero in the UNIX world.

the IDLE window seems to handle single byte unicode OK
>>> a = u'\xf0'
>>> print a
ð
works fine.

>>> b = 'ð'
>>> print b
ð
works fine.

But as soon as I use characters past single byte codepoint values,
IDLE rejects the inputː
>>> c = 'ǣ'
Unsupported characters in input

These results seem to be independent of whether or not I have executed

the lineː
"#-*- coding: utf-8 -*-"

As a first step, I need to figure out how to get IDLE to accept a wide
range of linguistic characters on the input line, and I haven't found
the magic formula to make it do that yet.

Any help will be appreciated.

Uncle Bruce, still frustrated in Toronto

On May 2, 3:30 am, Steven Bird <stevenbi...@gmail.com> wrote:
> 2009/5/2 Uncle Bruce <bruce...@rogers.com>:

>
> > WORD, Toolbox and other programs work fine using utf-8 encoding.
> > What, exactly do I need to do to get IDLE to use utf-8?
>
> Please see comp.lang.python about this
>

> http://groups.google.com/group/comp.lang.python/search?group=comp.lan...

Javier Pueyo

unread,

May 2, 2009, 10:21:05 AM5/2/09

to nltk-...@googlegroups.com

El sáb, 02-05-2009 a las 06:20 -0700, Uncle Bruce escribió:

> But as soon as I use characters past single byte codepoint values,
> IDLE rejects the inputː
> >>> c = 'ǣ'
> Unsupported characters in input

I have never used IDLE but I installed and tried your examples in my
linux machine and the ǣ example worked well after I took a look at my
IDLE settings and changed the IDLE default interface font for one
containing those unicode characters (Times New Roman for example). I
have no idea whether the IDLE works the same way in Windows or not, but
since it seems to be an IDLE issue it might help you to post a question
in their user list.

--J

Noorhan Abbas

unread,

May 2, 2009, 12:47:07 PM5/2/09

to nltk-...@googlegroups.com

Hello,

I am using unicode and Idle. Idle will handle unicode properly if you create a new file and save your stuff there.

For instance, I used Idle with the Arabic language in the following manner:

- Open a new file

- import codecs (module)

- You can use codecs.encode(string, "UTF-8") or codecs.decode(string, "UTF-8")

I hope this will be of any help...

Good luck,

Nora

From: Javier Pueyo <javier...@gmail.com>
To: nltk-...@googlegroups.com
Sent: Saturday, 2 May, 2009 15:21:05
Subject: [nltk-users] Re: Problems with Unicode in IDLE

Uncle Bruce

unread,

May 2, 2009, 5:04:44 PM5/2/09

to nltk-users

I guess my first question is:
Have you ever been able to enter an Arabic character e.g. "ك" directly
into the IDLE 'command line'? I.e. make it part of your code, even if
only part of a unicode string constant.

I understand that I can read files, process them and write them out
again - but can I use high codepoint characters in my code without
having to convert them to ugly looking hexadecimal sequences?

Bruce in Toronto

On May 2, 12:47 pm, Noorhan Abbas <noorhanab...@yahoo.co.uk> wrote:
> Hello,
> I am using unicode and Idle. Idle will handle unicode properly if you create a new file and save your stuff there.
> For instance, I used Idle with the Arabic language in the following manner:
> - Open a new file
> - import codecs (module)
> - You can use codecs.encode(string, "UTF-8") or codecs.decode(string, "UTF-8")
>
> I hope this will be of any help...
>
> Good luck,
> Nora
>
>
>
> ________________________________

> From: Javier Pueyo <javier.pu...@gmail.com>

Noorhan Abbas

unread,

May 2, 2009, 5:09:57 PM5/2/09

to nltk-...@googlegroups.com

Well, no I couldn't do that...it has to be through files.

Nora

From: Uncle Bruce <bruc...@rogers.com>
To: nltk-users <nltk-...@googlegroups.com>
Sent: Saturday, 2 May, 2009 22:04:44

Uncle Bruce

unread,

May 2, 2009, 5:22:02 PM5/2/09

to nltk-users

Perhaps the Python Interpreter can't handle anything outside of the
LATIN-1 character set. I'll post a separate query to ask about that.
(after I do a bit more reading in the HELP file).

Bruce

On May 2, 5:09 pm, Noorhan Abbas <noorhanab...@yahoo.co.uk> wrote:
> Well, no I couldn't do that...it has to be through files.
>
> Nora
>
> ________________________________

> From: Uncle Bruce <bruce...@rogers.com>

Reply all

Reply to author

Forward