Tokenizing in spanish.-

López

unread,

Jan 7, 2010, 2:07:52 PM1/7/10

to nltk-users

Hello all,

I'm working with some text in spanish and I have a problem.

I'm studying from chap. 3 of Natural Language Processing with Python.

Now, the first 'sentence' of my text is (this text is saved in a .txt
file):

'De cómo don Quijote de la Mancha volvió a sus desvancimientos de
caballero andante, y de la venida a su lugar del Argamesilla de
ciertos caballeros granadinos'

And using the code (similar to sec. 3.3 of the book):

###############################
from __future__ import division
import nltk, re, pprint
import codecs
import unicodedata

camino = ('D:/Quijote.txt')
lines = codecs.open(camino, encoding='utf-16').readlines()
una=lines[0]
print una
una=una.lower()
una=una.encode('unicode_escape')
print una
print nltk.word_tokenize(una)
###############################

produce the following output:

###############################
De cómo don Quijote de la Mancha volvió a sus desvancimientos de
caballero andante, y de la venida a su lugar del Argamesilla de
ciertos caballeros granadinos

de c\xf3mo don quijote de la mancha volvi\xf3 a sus desvancimientos de
caballero andante, y de la venida a su lugar del argamesilla de
ciertos caballeros granadinos\r\n

['de', 'c', '\\', 'xf3mo', 'don', 'quijote', 'de', 'la', 'mancha',
'volvi', '\\', 'xf3', 'a', 'sus', 'desvancimientos', 'de',
'caballero', 'andante', ',', 'y', 'de', 'la', 'venida', 'a', 'su',
'lugar', 'del', 'argamesilla', 'de', 'ciertos', 'caballeros',
'granadinos', '\\', 'r', '\\', 'n']
###############################

As you can see, the last instruction, nltk.word_tokenize(una), doesn't
tokenize in a correct form.

Do you know how can I fix it?

Thanks a lot.

Rafael Calsaverini

unread,

Jan 7, 2010, 2:11:44 PM1/7/10

to nltk-...@googlegroups.com

I think you need to use a utf8 encoding and a tokenizer that can process utf8 text. I had the same problem with portuguese text and I created my own regexp tokenizer. There's a manual on how to do this in the nltk website in the "how to" section.
---
Rafael Calsaverini
Dep. de Física Geral, Sala 336
Instituto de Física - Universidade de São Paulo

rafael.ca...@gmail.com
http://stoa.usp.br/calsaverini/weblog
CEL: (11) 7525-6222
USP: (11) 3091-6803

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

López

unread,

Jan 7, 2010, 2:36:47 PM1/7/10

to nltk-users

When I use UTF-8 it produces an output as follow:

\ufeffde c\xf3mo don quijote de la mancha volvi\xf3 a sus

desvancimientos de caballero andante, y de la venida a su lugar del
argamesilla de ciertos caballeros granadinos\r\n

['\\', 'ufeffde', 'c', '\\', 'xf3mo', 'don', 'quijote', 'de', 'la',

'mancha', 'volvi', '\\', 'xf3', 'a', 'sus', 'desvancimientos', 'de',
'caballero', 'andante', ',', 'y', 'de', 'la', 'venida', 'a', 'su',
'lugar', 'del', 'argamesilla', 'de', 'ciertos', 'caballeros',
'granadinos', '\\', 'r', '\\', 'n']

I think this is worst than latter.

:(

On 7 ene, 15:11, Rafael Calsaverini <rafael.calsaver...@gmail.com>
wrote:

> I think you need to use a utf8 encoding and a tokenizer that can process
> utf8 text. I had the same problem with portuguese text and I created my own
> regexp tokenizer. There's a manual on how to do this in the nltk website in
> the "how to" section.
> ---
> Rafael Calsaverini
> Dep. de Física Geral, Sala 336
> Instituto de Física - Universidade de São Paulo
>

> rafael.calsaver...@gmail.comhttp://stoa.usp.br/calsaverini/weblog

> CEL: (11) 7525-6222
> USP: (11) 3091-6803
>

> > nltk-users+...@googlegroups.com<nltk-users%2Bunsu...@googlegroups.com>

López

unread,

Jan 7, 2010, 3:27:30 PM1/7/10

to nltk-users

A similar question (and answer) is in:

http://groups.google.co.ve/group/nltk-users/browse_thread/thread/730d60be0baee773/fa82fa26553619e7?hl=es&lnk=gst&q=unicode+problem#fa82fa26553619e7

Cheers.

Steven Bird

unread,

Jan 8, 2010, 4:30:55 AM1/8/10

to nltk-users

You should tokenize the Unicode string, not the encoded version of the
string (which has those backslash-escapes). Please take a closer look
at the examples in section 3.3 of the NLTK book.

-Steven Bird

Reply all

Reply to author

Forward