Unicode Parsing Problem

Heidi

no leída,

5 nov 2009, 10:29:195/11/09

a nltk-users

I'm having a problem parsing unicode text -- NLTK uses the unicode
characters as word boundaries. I followed the example in the docs
(Section 3.3) but got different results:

>>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
>>> lines = codecs.open(path,encoding='latin2').readlines()
>>> line = lines[2]
>>> print line.encode('unicode_escape')
Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al
\u0105sk, zosta\u0142y\n
>>> print line
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

>>> line.find(u'zosta\u0142y')
54
>>> line = line.lower()
>>> print line
niemców pod koniec ii wojny światowej na dolny śląsk, zostały

>>> nltk.word_tokenize(line)
[u'niemc', u'\xf3', u'w', u'pod', u'koniec', u'ii', u'wojny',
u'\u015b', u'wiatowej', u'na', u'dolny', u'\u015b', u'l', u'\u0105',
u'sk', u',', u'zosta', u'\u0142', u'y']
>>>

I just downloaded the latest nltk (2.0b6) and re-ran the test, with
the same results. I'm running Python v. 2.6.2 under Mac OS X 10.5.8.

Any help or suggestions?

Thanks,
Heidi

Emmanuel Ruellan

no leída,

5 nov 2009, 11:56:025/11/09

a nltk-...@googlegroups.com,heidi...@gmail.com

2009/11/5 Heidi <heidi...@gmail.com>

I'm having a problem parsing unicode text -- NLTK uses the unicode
characters as word boundaries. I followed the example in the docs
(Section 3.3) but got different results:

I had the same issue and I resorted to writing my own tokenizing function, which was a good exercise anyway. I did something like that:

>>> import re

>>> token_re = re.compile('\w+')

(A regular expression with no special precautions to handle Unicode)

>>> unicode_token_re = re.compile('\w+', re.U)

(A regular expression with the Unicode flag)

>>> polish_text = u"Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n"

>>> print polish_text

Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

Robert Felty

no leída,

5 nov 2009, 12:05:475/11/09

a nltk-...@googlegroups.com

I have discovered that this seems to be a problem with the current
default tokenizer (treebank). The Punkt tokenizer seems to work ok.

>>> from nltk.tokenize import PunktWordTokenizer
>>> text = u'alles ist sch\xf6n'
>>> PunktWordTokenizer().tokenize(text)
[u'alles', u'ist', u'sch\xf6n']

Rob

Heidi

no leída,

5 nov 2009, 12:26:415/11/09

a nltk-users

Thank you! I feel less (aarrrggh someone stop me, too late!) punk'd
now.

On Nov 5, 9:05 am, Robert Felty <robfe...@gmail.com> wrote:
> I have discovered that this seems to be a problem with the current
> default tokenizer (treebank). The Punkt tokenizer seems to work ok.
>
> >>> from nltk.tokenize import PunktWordTokenizer
> >>> text = u'alles ist sch\xf6n'
> >>> PunktWordTokenizer().tokenize(text)
> [u'alles', u'ist', u'sch\xf6n']
>
> Rob
> On Nov 5, 2009, at 9:56 AM, Emmanuel Ruellan wrote:
>

> > 2009/11/5 Heidi <heidi.k...@gmail.com>

Responder a todos

Responder al autor

Reenviar