Hello,
I can't get the correct part-of-speech tag when trying to tag words with accents such as 'trânsito' and 'direção'. I trained the tagger with the Mac-morpho corpus as shown below. However, when trying to tag the sentence "trânsito lento na direção da avenida Brasil", I get the tag None.
Can anyone help me please?
Thanks!
macMorpho = nltk.corpus.mac_morpho.tagged_sents()
sizeTraining = int(len(macMorpho) * 0.9)
training_sentences = macMorpho[:sizeTraining]
tagger0 = nltk.tag.UnigramTagger(training_sentences)
tagger1 = nltk.tag.BigramTagger(training_sentences, backoff=tagger0)
posTagger = nltk.tag.TrigramTagger(training_sentences, backoff=tagger1)
words = nltk.word_tokenize("trânsito lento na direção da avenida Brasil")
print posTagger.tag(words)
Result:
('tr\xc3\xa2nsito', None)
('dire\xc3\xa7\xc3\xa3o', None)