Part-of-speech tagger for Russian

Tsolak Ghukasyan

unread,

Jun 13, 2016, 9:48:33 PM6/13/16

to nltk-dev

Hello everyone,

I have trained an averaged perceptron model for part-of-speech tagging of Russian texts and would like to incorporate it into nltk. My goal is to alter pos_tag and pos_tag_sents methods from tag module so that they work not only for English texts but also for Russian.

For training I used Russian National Corpus (consists of 94240 tagged sentences with over 1 million tagged tokens): http://ruscorpora.ru/en/index.html

I used methods of PerceptronTagger class from tag module for training and storing the model file.

The average accuracy of the model measured by 10-fold cross-validation is 99%.

You can find more detailed description in the attachement.

Tsolak Ghukasyan

POS_Tag_Report.pdf

sabr

unread,

Feb 6, 2017, 7:28:46 PM2/6/17

to nltk-dev

Hello,

Thanks for this contribution! Could you please tell if this has ever been merged to NLTK?

Tsolak Ghukasyan

unread,

Feb 8, 2017, 7:01:45 AM2/8/17

to nltk...@googlegroups.com

Hello.

Yes, you can use the Russian tagger by calling the pos_tag method and setting parameter lang to 'rus'. The model file is in NLTK Data repository.

--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

tonko dvadva

unread,

Jun 27, 2017, 8:54:29 PM6/27/17

to nltk-dev

Hi! Can you please add tag descriptions to the nltk documentation/code please?

вторник, 14 июня 2016 г., 4:48:33 UTC+3 пользователь Tsolak Ghukasyan написал:

Tsolak Ghukasyan

unread,

Jun 29, 2017, 8:51:03 AM6/29/17

to nltk...@googlegroups.com

HI!

Thanks for the suggestion. I will add them shortly. In the meantime, you can find the tagset description here.

--

Reply all

Reply to author

Forward