Hello everyone,
I am currently working on French texts using the NLTK to analyze them
in scope of a research project. Tokenizing them, I get the well-known
Unicode representations of diacritics, etc.:
>>> path = nltk.data.find("C:\Python26\LeHorla1.txt")
>>> lines = codecs.open(path, encoding="latin1").readlines()
>>> line = lines[0]
>>> print line.encode("unicode_escape")
Quelle journ\xe9e admirable! J'ai pass\xe9 toute la matin\xe9e
\xe9tendu sur l'herbe!
>>> print line
Quelle journée admirable! J'ai passé toute la matinée étendu sur
l'herbe!
>>> tokens = PunktWordTokenizer().tokenize(line)
>>> tokens
[u'Quelle', u'journ\xe9e', u'admirable', u'!', u'J', u"'ai", u'pass
\xe9', u'toute', u'la', u'matin\xe9e', u'\xe9tendu', u'sur', u'l',
u"'herbe", u'!']
Is there any way to display the missing characters in this list of
tokens? My operating system and locale are set up to render Latin-1
encoded and UTF-8 encoded characters.
Any help or suggestions?
Thanking you in anticipation,
LeChevalier
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to
nltk-...@googlegroups.com.
To unsubscribe from this group, send email to
nltk-users+...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/nltk-users?hl=en.