converting non-ASCII text to Nltk.Text

Yuliya Morozova

unread,

May 18, 2012, 11:46:04 AM5/18/12

to nltk-users

I would like to tokenize a non-ASCII text and then convert the
resulting list of words to Nltk.text (in order to build a concordance
etc.)

As suggested in this discussion:
https://groups.google.com/group/nltk-users/browse_thread/thread/730d60be0baee773/fa82fa26553619e7?hl=es&lnk=gst&q=unicode+problem#fa82fa26553619e7

tokenzizing can be made with the Punkt tokenizer.

>>> text = u'alles ist sch\xf6n'
>>> words = PunktWordTokenizer().tokenize(text)

When I try to convert the result to nltk.Text I get this error
message:

>>> words = nltk.Text(words)
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
words = nltk.Text(words)
File "C:\Program Files (x86)\Python(26)\lib\site-packages\nltk
\text.py", line 285, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 3: ordinal not in range(128)

Thank you very much,
Yuliya.

Ben Martin

unread,

May 22, 2012, 2:04:02 PM5/22/12

to nltk-...@googlegroups.com

Hi Yuliya,

Try converting your text first to utf-8. That should work. Your first 3 lines would be:

>>> text = u'alles ist sch\xf6n'

>>> text = text.encode('utf-8')
>>> words = PunktWordTokenizer().tokenize(text)

That should get rid of your error.

-Ben

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Yuliya Morozova

unread,

May 23, 2012, 3:30:02 AM5/23/12

to nltk-users

Hello Ben!
Thank you for looking into this issue:-)
I'm getting the same error message:

>>> text = u'alles ist sch\xf6n'
>>> text = text.encode('utf-8')
>>> words = PunktWordTokenizer().tokenize(text)

Traceback (most recent call last):

File "<pyshell#6>", line 1, in <module>
text = nltk.Text(words)

File "C:\Program Files (x86)\Python(26)\lib\site-packages\nltk
\text.py", line 285, in __init__

self.name = " ".join(map(str, [x.encode('utf-8') for x in tokens[:
8]])) + "..."
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position

3: ordinal not in range(128)

On 22 mayo, 22:04, Ben Martin <g.ben.mar...@gmail.com> wrote:
> Hi Yuliya,
>
> Try converting your text first to utf-8. That should work. Your first 3
> lines would be:
>
> >>> text = u'alles ist sch\xf6n'
> >>> text = text.encode('utf-8')
> >>> words = PunktWordTokenizer().tokenize(text)
>
> That should get rid of your error.
>
> -Ben
>
> On Fri, May 18, 2012 at 8:46 AM, Yuliya Morozova <
>
>
>
>
>
>
>
> miss.yuliya.moroz...@gmail.com> wrote:
> > I would like to tokenize a non-ASCII text and then convert the
> > resulting list of words to Nltk.text (in order to build a concordance
> > etc.)
>
> > As suggested in this discussion:
>

> >https://groups.google.com/group/nltk-users/browse_thread/thread/730d6...

Mikhail Korobov

unread,

May 23, 2012, 4:22:26 AM5/23/12

to nltk-...@googlegroups.com

Hi Yuliya,

" ".join(map(str, [x.encode('utf-8') for x in tokens

is definitely a bug in NLTK because x.encode('utf8') will fail for non-ascii byte strings (python will try to decode byte string to unicode using 'ascii' codec and then encode resulting unicode string to utf8). Moreover, str(x.encode('utf8')) is useless because its argument is already a string.

This is partially fixed in recent NLTK (see https://github.com/nltk/nltk/blob/master/nltk/text.py#L298 ) - the code should work for utf8 strings (but not for unicode strings because str(unicode_string) will try to encode unicode string to ascii and that'll fail for non-ascii strings).

So you may try to update your NLTK (pip install -U nltk) and use Ben's advice.

среда, 23 мая 2012 г., 13:30:02 UTC+6 пользователь Yuliya Morozova написал:

> > nltk-users+unsubscribe@googlegroups.com.

Yuliya Morozova

unread,

May 26, 2012, 10:06:49 AM5/26/12

to nltk-users

Thank you!
It works with the example of

u'alles ist sch\xf6n'

How can I read a file into this format?
The usual procedure of open(file).read() does not work.
codecs.open(file,encoding).read() does not work either.
Thank you very much,
Yulia.

On 23 mayo, 12:22, Mikhail Korobov <kmik...@googlemail.com> wrote:
> Hi Yuliya,
>
> " ".join(map(str, [x.encode('utf-8') for x in tokens
>
> is definitely a bug in NLTK because x.encode('utf8') will fail for
> non-ascii byte strings (python will try to decode byte string to unicode
> using 'ascii' codec and then encode resulting unicode string to utf8).
> Moreover, str(x.encode('utf8')) is useless because its argument is already
> a string.
>
> This is partially fixed in recent NLTK

> (seehttps://github.com/nltk/nltk/blob/master/nltk/text.py#L298) - the

> > > > nltk-users+...@googlegroups.com.

Mikhail Korobov

unread,

May 26, 2012, 10:53:54 AM5/26/12

to nltk-...@googlegroups.com

What encoding is your file in?

If it is utf8 then open(filename).read() should work as is.

If your file is e.g. in 'cp1251' encoding then the following should work (note that the second argument for codecs.open is not the encoding):

with codecs.open(filename, 'rt', 'cp1251') as f:

utf8_text = f.read().encode('utf8')

words = PunktWordTokenizer().tokenize(utf8_text)

суббота, 26 мая 2012 г., 20:06:49 UTC+6 пользователь Yuliya Morozova написал:

> > > > nltk-users+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward