Part-of-speech tagger for Russian

923 views
Skip to first unread message

Tsolak Ghukasyan

unread,
Jun 13, 2016, 9:48:33 PM6/13/16
to nltk-dev
Hello everyone, 

I have trained an averaged perceptron model for part-of-speech tagging of Russian texts and would like to incorporate it into nltk. My goal is to alter pos_tag and pos_tag_sents methods from tag module so that they work not only for English texts but also for Russian.

For training I used Russian National Corpus (consists of 94240 tagged sentences with over 1 million tagged tokens): http://ruscorpora.ru/en/index.html
I used methods of PerceptronTagger class from tag module for training and storing the model file. 

The average accuracy of the model measured by 10-fold cross-validation is 99%.

You can find more detailed description in the attachement.

Tsolak Ghukasyan
POS_Tag_Report.pdf

sabr

unread,
Feb 6, 2017, 7:28:46 PM2/6/17
to nltk-dev
Hello,

Thanks for this contribution! Could you please tell if this has ever been merged to NLTK?

Tsolak Ghukasyan

unread,
Feb 8, 2017, 7:01:45 AM2/8/17
to nltk...@googlegroups.com
Hello.

Yes, you can use the Russian tagger by calling the pos_tag method and setting parameter lang to 'rus'. The model file is in NLTK Data repository.

--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

tonko dvadva

unread,
Jun 27, 2017, 8:54:29 PM6/27/17
to nltk-dev
Hi! Can you please add tag descriptions to the nltk documentation/code please?

вторник, 14 июня 2016 г., 4:48:33 UTC+3 пользователь Tsolak Ghukasyan написал:

Tsolak Ghukasyan

unread,
Jun 29, 2017, 8:51:03 AM6/29/17
to nltk...@googlegroups.com
HI! 

Thanks for the suggestion. I will add them shortly. In the meantime, you can find the tagset description here.

--
Reply all
Reply to author
Forward
0 new messages