Building P.O.S tagger for an indigenous language using small corpora.

Paul C.

unread,

Mar 25, 2016, 8:46:45 PM3/25/16

to nltk-users

Is it possible to build a POS tagger for a north american language using a small corpora? (10,000 to 15,000 annotated sentences).

I'm an indigenous languages researcher and I do have a bit of experience with python (game dev hobby) but Im totally new to NTLK. From what I have read so far, I need a language model that contains rules on how to classify words in a particular language. How can I build this model using the manually annotated data I already have? What would be the first step? All the documentation I have seen so far uses pre-made models for the most spoken languages and doesn't explain how to create them.

Alexis

unread,

Mar 26, 2016, 4:14:07 AM3/26/16

to nltk-...@googlegroups.com

I don't believe there are any NLTK modules designed specifically for so-called "low-resource languages", but the NLTK's standard modules should be useful to you. There are additional issues with modeling such languages (lots of missing data), so do google the term.

But you should read some more. The standard approach to POS-tagging is not based on explicit grammars, but on statistical models trained from a corpus through machine learning. Start with the NLTK book, focusing on the statistical chapters. (The CFG grammar module and its kin are primarily useful as teaching tools, not NLP technologies.) However, standard POS tagging solutions are based on tagging individual words. If your language is highly inflecting or polysynthetic, you'll need a morphological analysis component. These are much harder to train with a generic statistical tool, but I'm not up to date on the state of the art.

Alexis

On 26 Mar 2016, at 01:22, Paul C. <jua...@gmail.com> wrote:

Is it possible to build a POS tagger for a north american language using a small corpora? (10,000 to 15,000 annotated sentences).

I'm an indigenous languages researcher and I do have a bit of experience with python (game dev hobby) but Im totally new to NTLK. From what I have read so far, I need a language model that contains rules on how to classify words in a particular language. How can I build this model using the manually annotated data I already have? What would be the first step? All the documentation I have seen so far uses pre-made models for the most spoken languages and doesn't explain how to create them.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pedro Marcal

unread,

Mar 26, 2016, 7:45:09 AM3/26/16

to nltk-...@googlegroups.com

I have developed a context free parser based on statistical parsing and Design of Experiment called parts. It's about two orders of magnitude less demanding on computing. However it's not too well documented so we would have to work interactively to get you started. 10,000 sentences should be sufficient to establish the grammar. You will have to add to a dictionary as you go.

Pedro Marcal

On Friday, March 25, 2016, Paul C. <jua...@gmail.com> wrote:

Is it possible to build a POS tagger for a north american language using a small corpora? (10,000 to 15,000 annotated sentences).

I'm an indigenous languages researcher and I do have a bit of experience with python (game dev hobby) but Im totally new to NTLK. From what I have read so far, I need a language model that contains rules on how to classify words in a particular language. How can I build this model using the manually annotated data I already have? What would be the first step? All the documentation I have seen so far uses pre-made models for the most spoken languages and doesn't explain how to create them.

--

Reply all

Reply to author

Forward