I'm having a problem parsing unicode text -- NLTK uses the unicode
characters as word boundaries. I followed the example in the docs
(Section 3.3) but got different results:
I had the same issue and I resorted to writing my own tokenizing function, which was a good exercise anyway. I did something like that:
>>> import re
>>> token_re = re.compile('\w+')
(A regular expression with no special precautions to handle Unicode)
>>> unicode_token_re = re.compile('\w+', re.U)
(A regular expression with the Unicode flag)
>>> polish_text = u"Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n"
>>> print polish_text
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
>>> print ' | '.join(token_re.findall(polish_text))
Niemc | w | pod | koniec | II | wojny | wiatowej | na | Dolny | l | sk | zosta | y
(Wrong result!)
>>> print ' | '.join(unicode_token_re.findall(polish_text))
Niemców | pod | koniec | II | wojny | światowej | na | Dolny | Śląsk | zostały
(This is OK)