[nltk-users] Tokenizing French text/Unicode representations

681 views

Skip to first unread message

LeChevalier

unread,

May 25, 2010, 6:51:53 AM5/25/10

to nltk-users

Hello everyone,

I am currently working on French texts using the NLTK to analyze them
in scope of a research project. Tokenizing them, I get the well-known
Unicode representations of diacritics, etc.:

>>> path = nltk.data.find("C:\Python26\LeHorla1.txt")
>>> lines = codecs.open(path, encoding="latin1").readlines()
>>> line = lines[0]
>>> print line.encode("unicode_escape")
Quelle journ\xe9e admirable! J'ai pass\xe9 toute la matin\xe9e
\xe9tendu sur l'herbe!
>>> print line
Quelle journée admirable! J'ai passé toute la matinée étendu sur
l'herbe!

>>> tokens = PunktWordTokenizer().tokenize(line)
>>> tokens
[u'Quelle', u'journ\xe9e', u'admirable', u'!', u'J', u"'ai", u'pass
\xe9', u'toute', u'la', u'matin\xe9e', u'\xe9tendu', u'sur', u'l',
u"'herbe", u'!']

Is there any way to display the missing characters in this list of
tokens? My operating system and locale are set up to render Latin-1
encoded and UTF-8 encoded characters.

Any help or suggestions?

Thanking you in anticipation,
LeChevalier

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Peter Ljunglöf

unread,

May 25, 2010, 2:10:16 PM5/25/10

to nltk-...@googlegroups.com

Hi,

25 maj 2010 kl. 12.51 skrev LeChevalier:

>>>> tokens = PunktWordTokenizer().tokenize(line)
>>>> tokens
> [u'Quelle', u'journ\xe9e', u'admirable', u'!', u'J', u"'ai", u'pass
> \xe9', u'toute', u'la', u'matin\xe9e', u'\xe9tendu', u'sur', u'l',
> u"'herbe", u'!']
>
> Is there any way to display the missing characters in this list of
> tokens? My operating system and locale are set up to render Latin-1
> encoded and UTF-8 encoded characters.

The crux is that you can't Python lists (and sets and dicts) are not pretty-printed by the print command. (Which has to do with how the __str__ method of the built-in complex types is defined). So instead of:

>>> print tokens

[u'Quelle', u'journ\xe9e', u'admirable', u'!', u'J', u"'ai", u'pass\xe9', u'toute', u'la', u'matin\xe9e', u'\xe9tendu', u'sur', u'l', u"'herbe", u'!']

...you can print each unicode word separately:

>>> for w in tokens: print w,
Quelle journée admirable ! J 'ai passé toute la matinée étendu sur l 'herbe !

...or you can join the list into one big (unicode) string:

>>> print " ".join(tokens)
Quelle journée admirable ! J 'ai passé toute la matinée étendu sur l 'herbe !

>>> print "[" + "] [".join(tokens) + "]"
[Quelle] [journée] [admirable] [!] [J] ['ai] [passé] [toute] [la] [matinée] [étendu] [sur] [l] ['herbe] [!]

/Peter

PS. Here's a good explanation of the difference between __str__ and __repr__:

http://stackoverflow.com/questions/1436703/difference-between-str-and-repr-in-python

Reply all

Reply to author

Forward

0 new messages