Hello Ben!
Thank you for looking into this issue:-)
I'm getting the same error message:
>>> text = u'alles ist sch\xf6n'
>>> text = text.encode('utf-8')
>>> words = PunktWordTokenizer().tokenize(text)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
text = nltk.Text(words)
File "C:\Program Files (x86)\Python(26)\lib\site-packages\nltk
\text.py", line 285, in __init__
self.name = " ".join(map(str, [x.encode('utf-8') for x in tokens[:
8]])) + "..."
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
3: ordinal not in range(128)
On 22 mayo, 22:04, Ben Martin <
g.ben.mar...@gmail.com> wrote:
> Hi Yuliya,
>
> Try converting your text first to utf-8. That should work. Your first 3
> lines would be:
>
> >>> text = u'alles ist sch\xf6n'
> >>> text = text.encode('utf-8')
> >>> words = PunktWordTokenizer().tokenize(text)
>
> That should get rid of your error.
>
> -Ben
>
> On Fri, May 18, 2012 at 8:46 AM, Yuliya Morozova <
>
>
>
>
>
>
>
>
miss.yuliya.moroz...@gmail.com> wrote:
> > I would like to tokenize a non-ASCII text and then convert the
> > resulting list of words to Nltk.text (in order to build a concordance
> > etc.)
>
> > As suggested in this discussion:
>
> >
https://groups.google.com/group/nltk-users/browse_thread/thread/730d6...