Using nltk.pos_tag() as backoff

298 views
Skip to first unread message

Oli

unread,
Apr 1, 2010, 7:15:54 AM4/1/10
to nltk-users
Hi,

I am fairly new to NLTK and even Python. My first task is POS-
Tagging...so I played around with the different taggers and corpus to
get a feeling for it.

The standard tagger (nltk.pos_tag()) is doing pretty well in many
cases, but there are also some cases where the unigram-tagger (or
combinations of unigram, bigram, affix) are doing much better.

So...I thought about to use the pos_tag() in combination (as backoff-
tagger) with unigram and other taggers.

Here is my (simple) code for that:

#get the standard-tagger
t0 = nltk.data.load('taggers/maxent_treebank_pos_tagger/
english.pickle');

#unigram-tagger with t0 as backoff
t1 = nltk.UnigramTagger(sent_tagged,backoff=t0)

The problem, I am facing is the following: I took about 5 minutes (or
more) to load the t1-tagger. When I combine other taggers (unigram and
bigram e.g), it is much faster. Only when using the standard-tagger as
backoff it took uncommonly long time to initialize the tagger. The
training set (sent_tagged) is nearly the complete corpus (treebank or
other). I am using the first 50 sentences for testing and the rest of
the corpus for training.

So...any ideas, what I am doing wrong or is it simply not possible to
use the standard-tagger as backoff, like I tried it.

Thank you in advance!
Oliver

PS.: I am from Germany...so excuse my English :-)

Victor Miclovich

unread,
Apr 1, 2010, 7:23:10 AM4/1/10
to nltk-...@googlegroups.com
A lot of this NLP stuff involves a trial and error approach...
Are you also planning to use NLTK for German?
Your idea is nice by the way... sometimes you'll have to write your
own tagger and use some cool stuff with regexes (regular expressions)
to customize your programs to suit the context for which you want to
use NLTK.

> --
> You received this message because you are subscribed to the Google
> Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en
> .
>

Oli

unread,
Apr 1, 2010, 7:38:42 AM4/1/10
to nltk-users
hi,

thank you for answering.

I thought about using (or just testing) the tagging with german tags.
There is a corpus available for non-commercial (academic) usage:
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html

But for the beginning, I am going to work with english texts. The
documents, I have to tag are very technical...so I tried the "learned"-
category of the brown-corpus. and especially for this particular
corpus the pos_tag() does not very well (around 65% accuracy). Thats
why I want to combine it with a unigram-tagger (which does quite well
for these texts).

but...there is the problem, I described in my first post :-)

Oli

Victor Miclovich

unread,
Apr 1, 2010, 7:52:05 AM4/1/10
to nltk-...@googlegroups.com
Describe it more clearly...
You stated the problem but then it looks like you gave an answer.. Please clarify the question (at least for me)

Oli

unread,
Apr 1, 2010, 7:59:03 AM4/1/10
to nltk-users
ok...sorry. I try it again :-)

When doing the following:

#get the standard-tagger
t0 = nltk.data.load('taggers/maxent_treebank_pos_tagger/
english.pickle');

#unigram-tagger with t0 as backoff
t1 = nltk.UnigramTagger(sent_tagged,backoff=t0)

It took about 5 minutes (or even longer) for initializing t1 (unigram-
tagger with standard as backoff). This problem only occurs, when using
the standard-tagger as backoff. What is the reason for the long
initialization-time?

Thanks!
Oli

On 1 Apr., 13:52, Victor Miclovich <vicmiclov...@gmail.com> wrote:
> Describe it more clearly...
> You stated the problem but then it looks like you gave an answer.. Please
> clarify the question (at least for me)
>

> On Thu, Apr 1, 2010 at 2:38 PM, Oli <oliver.pes...@googlemail.com> wrote:
> > hi,
>
> > thank you for answering.
>
> > I thought about using (or just testing) the tagging with german tags.
> > There is a corpus available for non-commercial (academic) usage:
>

> >http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-co...

> > nltk-users+...@googlegroups.com<nltk-users%2Bunsu...@googlegroups.com>


> > > > .
> > > > For more options, visit this group athttp://
> > groups.google.com/group/nltk-users?hl=en
> > > > .
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > nltk-users+...@googlegroups.com<nltk-users%2Bunsu...@googlegroups.com>

Victor Miclovich

unread,
Apr 1, 2010, 8:07:29 AM4/1/10
to nltk-...@googlegroups.com
Python is a slow language when we compare that with C++, Java and other compiled/optimized programing languages.
I'm afraid to say that you will find the everything to do with data analytics with nltk and especially when you have large amounts of data (to parse, to load, etc.), python is slow.
It is advisable to use Python's APIs with other languages that are relatively faster... such as Python/C++ bindings or utilizing a binding with Java's Weka package
http://www.cs.waikato.ac.nz/ml/weka/
http://kogs-www.informatik.uni-hamburg.de/~meine/weka-python/

yep... that's definitely it... speed isn't that good for some applications you write

To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages