use Unicode text with nltk.text.Text()

Arthit Suriyawongkul

unread,

Apr 24, 2009, 11:27:44 AM4/24/09

to nltk-...@googlegroups.com

Dear NLTK,

I'm trying to put a list containing Unicode strings into nltk.Text(),
but an error occurs:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/nltk/text.py",
line 283, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

I'm tried to start Python shell with "python -U", but it doesn't work as well.
Anyone has any idea on this ?

Or should I workaround it by change the code in text.py to something
u" ".join(....) ?

cheers,
Art

p.s. thanks Steven Bird for fixed the nltk-users link on
http://www.nltk.org/ :)

吴轲

unread,

Apr 24, 2009, 11:38:51 AM4/24/09

to nltk-...@googlegroups.com

2009/4/24 Arthit Suriyawongkul <art...@gmail.com>

Dear NLTK,

I'm trying to put a list containing Unicode strings into nltk.Text(),
but an error occurs:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/nltk/text.py",
line 283, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

In Python before 3.0, `str' constructs a byte string instead of a Unicode string, so you probably need to either encode your tokens before applying them to `str' or use `unicode' instead of `str'.

Arthit Suriyawongkul

unread,

Apr 25, 2009, 1:11:34 AM4/25/09

to nltk-...@googlegroups.com

On 24 เม.ย., 22:38, 吴轲 <ngu....@gmail.com> wrote:
> 2009/4/24 Arthit Suriyawongkul <art...@gmail.com>

>
> In Python before 3.0, `str' constructs a byte string instead of a Unicode
> string, so you probably need to either encode your tokens before applying
> them to `str' or use `unicode' instead of `str'.

thank you, it's work!

instead of appending u"xxx", I append u"xxx".encode("utf-8", "replace").

thanks,
Art

--
สิ่งเดียวที่ประชาชนจะมีในการปกป้องตัวเอง
นั้นคือความเมตตาและความเข้มแข็งของรัฐ
ไม่ใช่สิทธิเสรีภาพที่ใช้ห่อหุ้มตัวเขา
(อะจ๊าก)
http://bit.ly/4DXd

Reply all

Reply to author

Forward