Unicode Parsing Problem

Visto 566 veces
Saltar al primer mensaje no leído

Heidi

no leída,
5 nov 2009, 10:29:195/11/09
a nltk-users
I'm having a problem parsing unicode text -- NLTK uses the unicode
characters as word boundaries. I followed the example in the docs
(Section 3.3) but got different results:

>>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
>>> lines = codecs.open(path,encoding='latin2').readlines()
>>> line = lines[2]
>>> print line.encode('unicode_escape')
Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al
\u0105sk, zosta\u0142y\n
>>> print line
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

>>> line.find(u'zosta\u0142y')
54
>>> line = line.lower()
>>> print line
niemców pod koniec ii wojny światowej na dolny śląsk, zostały

>>> nltk.word_tokenize(line)
[u'niemc', u'\xf3', u'w', u'pod', u'koniec', u'ii', u'wojny',
u'\u015b', u'wiatowej', u'na', u'dolny', u'\u015b', u'l', u'\u0105',
u'sk', u',', u'zosta', u'\u0142', u'y']
>>>

I just downloaded the latest nltk (2.0b6) and re-ran the test, with
the same results. I'm running Python v. 2.6.2 under Mac OS X 10.5.8.

Any help or suggestions?

Thanks,
Heidi

Emmanuel Ruellan

no leída,
5 nov 2009, 11:56:025/11/09
a nltk-...@googlegroups.com,heidi...@gmail.com
2009/11/5 Heidi <heidi...@gmail.com>


I'm having a problem parsing unicode text -- NLTK uses the unicode
characters as word boundaries. I followed the example in the docs
(Section 3.3) but got different results:

I had the same issue and I resorted to writing my own tokenizing function, which was a good exercise anyway. I did something like that:

>>> import re

>>> token_re = re.compile('\w+')

(A regular expression with no special precautions to handle Unicode)

>>> unicode_token_re = re.compile('\w+', re.U)

(A regular expression with the Unicode flag)

>>> polish_text = u"Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n"

>>> print polish_text


Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

>>> print ' | '.join(token_re.findall(polish_text))

Niemc | w | pod | koniec | II | wojny | wiatowej | na | Dolny | l | sk | zosta | y

(Wrong result!)

>>> print ' | '.join(unicode_token_re.findall(polish_text))

Niemców | pod | koniec | II | wojny | światowej | na | Dolny | Śląsk | zostały

(This is OK)

Robert Felty

no leída,
5 nov 2009, 12:05:475/11/09
a nltk-...@googlegroups.com
I have discovered that this seems to be a problem with the current
default tokenizer (treebank). The Punkt tokenizer seems to work ok.

>>> from nltk.tokenize import PunktWordTokenizer
>>> text = u'alles ist sch\xf6n'
>>> PunktWordTokenizer().tokenize(text)
[u'alles', u'ist', u'sch\xf6n']

Rob

Heidi

no leída,
5 nov 2009, 12:26:415/11/09
a nltk-users
Thank you! I feel less (aarrrggh someone stop me, too late!) punk'd
now.

On Nov 5, 9:05 am, Robert Felty <robfe...@gmail.com> wrote:
> I have discovered that this seems to be a problem with the current  
> default tokenizer (treebank). The Punkt tokenizer seems to work ok.
>
>  >>> from nltk.tokenize import PunktWordTokenizer
>  >>> text = u'alles ist sch\xf6n'
>  >>> PunktWordTokenizer().tokenize(text)
> [u'alles', u'ist', u'sch\xf6n']
>
> Rob
> On Nov 5, 2009, at 9:56 AM, Emmanuel Ruellan wrote:
>
> > 2009/11/5 Heidi <heidi.k...@gmail.com>
Responder a todos
Responder al autor
Reenviar
0 mensajes nuevos