I'm working with some text in spanish and I have a problem.
I'm studying from chap. 3 of Natural Language Processing with Python.
Now, the first 'sentence' of my text is (this text is saved in a .txt
file):
'De cómo don Quijote de la Mancha volvió a sus desvancimientos de
caballero andante, y de la venida a su lugar del Argamesilla de
ciertos caballeros granadinos'
And using the code (similar to sec. 3.3 of the book):
###############################
from __future__ import division
import nltk, re, pprint
import codecs
import unicodedata
camino = ('D:/Quijote.txt')
lines = codecs.open(camino, encoding='utf-16').readlines()
una=lines[0]
print una
una=una.lower()
una=una.encode('unicode_escape')
print una
print nltk.word_tokenize(una)
###############################
produce the following output:
###############################
De cómo don Quijote de la Mancha volvió a sus desvancimientos de
caballero andante, y de la venida a su lugar del Argamesilla de
ciertos caballeros granadinos
de c\xf3mo don quijote de la mancha volvi\xf3 a sus desvancimientos de
caballero andante, y de la venida a su lugar del argamesilla de
ciertos caballeros granadinos\r\n
['de', 'c', '\\', 'xf3mo', 'don', 'quijote', 'de', 'la', 'mancha',
'volvi', '\\', 'xf3', 'a', 'sus', 'desvancimientos', 'de',
'caballero', 'andante', ',', 'y', 'de', 'la', 'venida', 'a', 'su',
'lugar', 'del', 'argamesilla', 'de', 'ciertos', 'caballeros',
'granadinos', '\\', 'r', '\\', 'n']
###############################
As you can see, the last instruction, nltk.word_tokenize(una), doesn't
tokenize in a correct form.
Do you know how can I fix it?
Thanks a lot.
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
\ufeffde c\xf3mo don quijote de la mancha volvi\xf3 a sus
desvancimientos de caballero andante, y de la venida a su lugar del
argamesilla de ciertos caballeros granadinos\r\n
['\\', 'ufeffde', 'c', '\\', 'xf3mo', 'don', 'quijote', 'de', 'la',
'mancha', 'volvi', '\\', 'xf3', 'a', 'sus', 'desvancimientos', 'de',
'caballero', 'andante', ',', 'y', 'de', 'la', 'venida', 'a', 'su',
'lugar', 'del', 'argamesilla', 'de', 'ciertos', 'caballeros',
'granadinos', '\\', 'r', '\\', 'n']
I think this is worst than latter.
:(
On 7 ene, 15:11, Rafael Calsaverini <rafael.calsaver...@gmail.com>
wrote:
> I think you need to use a utf8 encoding and a tokenizer that can process
> utf8 text. I had the same problem with portuguese text and I created my own
> regexp tokenizer. There's a manual on how to do this in the nltk website in
> the "how to" section.
> ---
> Rafael Calsaverini
> Dep. de Física Geral, Sala 336
> Instituto de Física - Universidade de São Paulo
>
> rafael.calsaver...@gmail.comhttp://stoa.usp.br/calsaverini/weblog
> CEL: (11) 7525-6222
> USP: (11) 3091-6803
>
> > nltk-users+...@googlegroups.com<nltk-users%2Bunsu...@googlegroups.com>
Cheers.
-Steven Bird