Fast POS Tagger

2,501 views
Skip to first unread message

Martin

unread,
Jun 12, 2010, 2:24:42 PM6/12/10
to nltk-users
Hi, everyone.

Can anyone recommend a faster alternative to nltk.pos_tag? I don't
mind using pipes, temporary files, FFI calls, etc. I just need fast
POS tagging from Python (to integrate with existing nltk code).
Thanks for your advice.

Martin

Jacob Perkins

unread,
Jun 13, 2010, 11:58:14 AM6/13/10
to nltk-users
Hi Martin,

I'd recommend training your own tagger using BrillTagger,
NgramTaggers, etc. The ClassifierBasedTagger (which is what
nltk.pos_tag uses) is very slow. This article has some metrics and
links to articles on training your own:
http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/

Jacob
---
http://streamhacker.com
http://twitter.com/japerk

Martin

unread,
Jun 13, 2010, 2:13:36 PM6/13/10
to nltk-users
Thanks, Jacob.

On Jun 13, 11:58 am, Jacob Perkins <jap...@gmail.com> wrote:
> Hi Martin,
>
> I'd recommend training your own tagger using BrillTagger,
> NgramTaggers, etc. The ClassifierBasedTagger (which is what
> nltk.pos_tag uses) is very slow. This article has some metrics and
> links to articles on training your own:http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/
>
> Jacob
> ---http://streamhacker.comhttp://twitter.com/japerk

Steven Bird

unread,
Jun 13, 2010, 10:59:33 PM6/13/10
to nltk-users
You might also try NLTK's implementation of the TnT tagger
(nltk.tag.tnt) or use one of the off-the-shelf taggers listed here:

http://nlp.stanford.edu/links/statnlp.html

Peter Ljunglöf

unread,
Jun 14, 2010, 4:13:59 AM6/14/10
to nltk-...@googlegroups.com
There is a wrapper for the Hunpos tagger in NLTK:

>>> hunpos = nltk.tag.HunposTagger("hunpos-1.0-macosx/english.model")
>>> hunpos.tag("What 's the airspeed of an unladen swallow ?".split())
[('What', 'WP'), ("'s", 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]

Hunpos is a fast open-source reimplementation of TnT: http://code.google.com/p/hunpos/

You have to install it separately (just download a binary), and you need a model. There are pre-trained English and Hungarian model on the homepage (and I can point you to good Swedish models), or you can train one yourself. There is no NLTK wrapper for training, so you have to do it on the command line - but it's not difficult.

/Peter

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>

Martin

unread,
Jun 14, 2010, 11:06:03 AM6/14/10
to nltk-users
Thanks, Steven and Peter.

On Jun 14, 4:13 am, Peter Ljunglöf <peter.ljung...@heatherleaf.se>
wrote:

Jacob Perkins

unread,
Jun 14, 2010, 11:18:48 AM6/14/10
to nltk-users
I don't have any numbers, but I was recently playing with the TnT
tagger and it seemed significantly slower than Brill+Ngram taggers for
tagging (but training was pretty fast). Good accurate tagger, though.

Peter Ljunglöf

unread,
Jun 15, 2010, 3:14:15 AM6/15/10
to nltk-...@googlegroups.com
Yes, the NLTK port of TnT is slow, I don't know why. That's why I made the Hunpos wrapper.

/Peter

Martin

unread,
Jul 4, 2010, 1:23:27 PM7/4/10
to nltk-users
My POS problem seems to be solved. Thanks again.

Slightly off topic, but does anyone have any advice on fast word and
sentence tokenization?

Right now, I'm using:

word_tokenizer = nltk.tokenize.WordPunctTokenizer()
sent_tokenizer = nltk.tokenize.PunktSentenceTokenizer ()

These seem too slow for, e.g., tokenizing all of Wikipedia.
Maybe it's unrealistic to expect to work at this scale on a single PC?

On Jun 15, 3:14 am, Peter Ljunglöf <peter.ljung...@heatherleaf.se>
wrote:
> Yes, the NLTK port of TnT is slow, I don't know why. That's why I made the Hunpos wrapper.
>
> /Peter
>
> 14 jun 2010 kl. 17.18 skrev Jacob Perkins:
>
> > I don't have any numbers, but I was recently playing with the TnT
> > tagger and it seemed significantly slower than Brill+Ngram taggers for
> > tagging (but training was prettyfast). Good accurate tagger, though.

Bill Janssen

unread,
Jul 10, 2010, 10:39:43 PM7/10/10
to nltk-users
Peter, where is the Hunpos wrapper? I don't have any
nltk.tag.HunposTagger in my install of 2.0b8. And, could you please
show me how to build a model for it from the Brown corpus?

Thanks.

Bill

On Jun 15, 12:14 am, Peter Ljunglöf <peter.ljung...@heatherleaf.se>
wrote:

Peter Ljunglöf

unread,
Jul 19, 2010, 8:14:28 AM7/19/10
to nltk-...@googlegroups.com
Hi Bill,

maybe I added it after 2.0b8 was released. In that case the wrapper is only in the subversion repository. Or you have to wait for the next release.

/Peter

PS. The wrapper can only be used with an existing model. To create a model from a corpus, you have to do it yourself from the commandline. Look at the hunpos homepage for information on that: http://code.google.com/p/hunpos/

Bill Janssen

unread,
Jul 21, 2010, 7:11:59 PM7/21/10
to nltk-users
Thanks. Looks clear enough. I'll pull the wrapper out of SVN and
train up a tagger on the Brown corpus.

Bill

On Jul 19, 5:14 am, Peter Ljunglöf <peter.ljung...@heatherleaf.se>
Reply all
Reply to author
Forward
0 new messages