[POS-tagging] Portuguese words with accents tagging issue

Monica Neli

unread,

May 22, 2016, 5:36:27 AM5/22/16

to nltk-users

Hello,

I can't get the correct part-of-speech tag when trying to tag words with accents such as 'trânsito' and 'direção'. I trained the tagger with the Mac-morpho corpus as shown below. However, when trying to tag the sentence "trânsito lento na direção da avenida Brasil", I get the tag None.

Can anyone help me please?

Thanks!

macMorpho = nltk.corpus.mac_morpho.tagged_sents()

sizeTraining = int(len(macMorpho) * 0.9)
training_sentences = macMorpho[:sizeTraining]

tagger0 = nltk.tag.UnigramTagger(training_sentences)
tagger1 = nltk.tag.BigramTagger(training_sentences, backoff=tagger0)
posTagger = nltk.tag.TrigramTagger(training_sentences, backoff=tagger1)

words = nltk.word_tokenize("trânsito lento na direção da avenida Brasil")
print posTagger.tag(words)

Result:

('tr\xc3\xa2nsito', None)

('dire\xc3\xa7\xc3\xa3o', None)

Alexis

unread,

May 24, 2016, 6:49:45 AM5/24/16

to nltk-...@googlegroups.com

This looks like an encoding problem. The words you are trying to tag are utf-8 encoded, but what is the encoding of the mac_morpho corpus you trained on, and is it being read in correctly?

My recommendation: switch to Python 3 immediately. It handles character encoding issues MUCH more sanely than python 2, so you'll have a much easier time identifying and resolving the problem. (And if you're really lucky it might just go away by itself.)

Best,

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Monica Neli

unread,

May 25, 2016, 6:11:26 AM5/25/16

to nltk-users

Thanks! I switched to Python3 and It worked!

=)

Pedro Marcal

unread,

May 25, 2016, 11:58:59 AM5/25/16

to nltk-...@googlegroups.com

I wish it were that easy. I have 600 programs in Python 2.7

Regards,

Pedro

--

Reply all

Reply to author

Forward