Meaning of 'cutoff' in NgramTaggers

21 views
Skip to first unread message

tom.pr...@gmail.com

unread,
Jan 27, 2015, 3:44:14 PM1/27/15
to nltk-...@googlegroups.com
I'm a bit new to all this.
I am having a problem with automatic tagging.
I have 4000+ corpora and each has a sentence like [(A,1) (B,2) (C,3) (D,4)].
I am using a 3-gram tagger backed off to 2-gram, 1-gram, Default..  Problem, the tags generated by the tagger don't correspond to the sequence 1,2,3,4.  I am obviously doing something wrong here so I wanted to ask about the cutoff setting.  I'm not clear on the effect a non-zero setting would have.  Could someone explain what cutoff=N actually means/does?

Alexis Dimitriadis

unread,
Jan 29, 2015, 4:36:00 AM1/29/15
to nltk-...@googlegroups.com
The cutoff parameter is used while training a model. The following description appears in the documentation of the Ngram class:
   :param cutoff: If the most likely tag for a context occurs
        fewer than *cutoff* times, then exclude it from the
        context-to-tag table for the new tagger.
By excluding rare contexts, you allow them to be passed to the backoff tagger which can presumably offer a better guess. Otherwise even a single instance would cause the current tagger (e.g., the trigram tagger) to return a result.

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages