[POS-tagging] Portuguese words with accents tagging issue

189 views
Skip to first unread message

Monica Neli

unread,
May 22, 2016, 5:36:27 AM5/22/16
to nltk-users
Hello, 

I can't get the correct part-of-speech tag when trying to tag words with accents such as 'trânsito' and 'direção'. I trained the tagger with the Mac-morpho corpus as shown below. However, when trying to tag the sentence "trânsito lento na direção da avenida Brasil", I get the tag None
Can anyone help me please?
Thanks!

macMorpho = nltk.corpus.mac_morpho.tagged_sents()
sizeTraining = int(len(macMorpho) * 0.9)
training_sentences = macMorpho[:sizeTraining]

tagger0 = nltk.tag.UnigramTagger(training_sentences)
tagger1 = nltk.tag.BigramTagger(training_sentences, backoff=tagger0)
posTagger = nltk.tag.TrigramTagger(training_sentences, backoff=tagger1)

words = nltk.word_tokenize("trânsito lento na direção da avenida Brasil")
print posTagger.tag(words)

Result: 
('tr\xc3\xa2nsito', None)
('dire\xc3\xa7\xc3\xa3o', None)

Alexis

unread,
May 24, 2016, 6:49:45 AM5/24/16
to nltk-...@googlegroups.com
This looks like an encoding problem. The words you are trying to tag are utf-8 encoded, but what is the encoding of the mac_morpho corpus you trained on, and is it being read in correctly?

My recommendation: switch to Python 3 immediately. It handles character encoding issues MUCH more sanely than python 2, so you'll have a much easier time identifying and resolving the problem. (And if you're really lucky it might just go away by itself.)

Best,

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Monica Neli

unread,
May 25, 2016, 6:11:26 AM5/25/16
to nltk-users
Thanks! I switched to Python3 and It worked!
=)

Pedro Marcal

unread,
May 25, 2016, 11:58:59 AM5/25/16
to nltk-...@googlegroups.com
I wish it were that easy. I have 600 programs in Python 2.7
Regards,
Pedro

--
Reply all
Reply to author
Forward
0 new messages