decode NLTK text

barbara dantas

unread,

Nov 7, 2012, 2:23:25 PM11/7/12

to nltk-...@googlegroups.com

Hi, I am using NLTK for an analysis in Portuguese.

The problem is that I am using a corpus that is not from NLTK.

I have already converted it into nltk.text but, it can´t 'read' special characteres like é, í, ç,...

So, I really need help here, because if I my decoded text, that is a string type, I can´t do collocations, for example.

How do I decode NLTK text into utf8?

Álvaro Justen [Turicas]

unread,

Nov 7, 2012, 8:08:32 PM11/7/12

to nltk-...@googlegroups.com

Hello, Bárbara, how are you doing?
Probably you need to decode your string (using the same codec was used
to encode it), so you'll have an unicode object and then you can use
it with NLTK.
But it is just a guess -- NLTK needs some improvements related to
Unicode (I'm trying to help in it), but it will probably work.

I'm brazilian and can help you in Portuguese if do you want.

[]s

> --
>
>

--
Álvaro Justen "Turicas"
http://blog.justen.eng.br http://twitter.com/turicas
http://CursoDeArduino.com.br http://github.com/turicas
+55 21 9898-0141

barbara dantas

unread,

Nov 8, 2012, 2:17:33 PM11/8/12

to nltk-...@googlegroups.com

Oi Álvaro! eu gostaria mto de ajuda sim. Estou usando o C-ORAL-BRASIL e eu consigo decodificar meu arquivo, uma vez. Depois da bagunça!

como fazer?

Álvaro Justen [Turicas]

unread,

Nov 8, 2012, 2:25:44 PM11/8/12

to nltk-...@googlegroups.com

I'll answer you in private since this list is English-only.

On Thu, Nov 8, 2012 at 5:17 PM, barbara dantas

Reply all

Reply to author

Forward