I am fairly new to NLTK and even Python. My first task is POS-
Tagging...so I played around with the different taggers and corpus to
get a feeling for it.
The standard tagger (nltk.pos_tag()) is doing pretty well in many
cases, but there are also some cases where the unigram-tagger (or
combinations of unigram, bigram, affix) are doing much better.
So...I thought about to use the pos_tag() in combination (as backoff-
tagger) with unigram and other taggers.
Here is my (simple) code for that:
#get the standard-tagger
t0 = nltk.data.load('taggers/maxent_treebank_pos_tagger/
english.pickle');
#unigram-tagger with t0 as backoff
t1 = nltk.UnigramTagger(sent_tagged,backoff=t0)
The problem, I am facing is the following: I took about 5 minutes (or
more) to load the t1-tagger. When I combine other taggers (unigram and
bigram e.g), it is much faster. Only when using the standard-tagger as
backoff it took uncommonly long time to initialize the tagger. The
training set (sent_tagged) is nearly the complete corpus (treebank or
other). I am using the first 50 sentences for testing and the rest of
the corpus for training.
So...any ideas, what I am doing wrong or is it simply not possible to
use the standard-tagger as backoff, like I tried it.
Thank you in advance!
Oliver
PS.: I am from Germany...so excuse my English :-)
> --
> You received this message because you are subscribed to the Google
> Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en
> .
>
thank you for answering.
I thought about using (or just testing) the tagging with german tags.
There is a corpus available for non-commercial (academic) usage:
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html
But for the beginning, I am going to work with english texts. The
documents, I have to tag are very technical...so I tried the "learned"-
category of the brown-corpus. and especially for this particular
corpus the pos_tag() does not very well (around 65% accuracy). Thats
why I want to combine it with a unigram-tagger (which does quite well
for these texts).
but...there is the problem, I described in my first post :-)
Oli
When doing the following:
#get the standard-tagger
t0 = nltk.data.load('taggers/maxent_treebank_pos_tagger/
english.pickle');
#unigram-tagger with t0 as backoff
t1 = nltk.UnigramTagger(sent_tagged,backoff=t0)
It took about 5 minutes (or even longer) for initializing t1 (unigram-
tagger with standard as backoff). This problem only occurs, when using
the standard-tagger as backoff. What is the reason for the long
initialization-time?
Thanks!
Oli
On 1 Apr., 13:52, Victor Miclovich <vicmiclov...@gmail.com> wrote:
> Describe it more clearly...
> You stated the problem but then it looks like you gave an answer.. Please
> clarify the question (at least for me)
>
> On Thu, Apr 1, 2010 at 2:38 PM, Oli <oliver.pes...@googlemail.com> wrote:
> > hi,
>
> > thank you for answering.
>
> > I thought about using (or just testing) the tagging with german tags.
> > There is a corpus available for non-commercial (academic) usage:
>
> >http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-co...
> > nltk-users+...@googlegroups.com<nltk-users%2Bunsu...@googlegroups.com>
> > > > .
> > > > For more options, visit this group athttp://
> > groups.google.com/group/nltk-users?hl=en
> > > > .
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > nltk-users+...@googlegroups.com<nltk-users%2Bunsu...@googlegroups.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.