What is the simple nltk.pos_tag() method's accuracy?

Raymond

unread,

Mar 28, 2010, 3:06:24 AM3/28/10

to nltk-users

Dear all,
I have been using the above tagging method for
simplicity and trust for nltk's developers. But it happens that
nltk contains many tag method and the above one is the most simple
one, while the other tagger is so complex.
I would like to ask, is this tagger not accurate?

Regards,
Raymond

Alex Rudnick

unread,

Mar 28, 2010, 3:34:17 PM3/28/10

to nltk-...@googlegroups.com

Hello Raymond,

As I understand it, nltk.pos_tag() is a tagger that's convenient to
use, and works in a lot of cases. If you take a look at the source for
that function, it loads up the "maxent treebank pos tagger" for
English, and uses that. So it might be useful for your application,
but you should test it out!

Possible issues:
- it's only useful if your application works with the same tag set
that this tagger produces. There are many different possible tag sets
-- you could make up your own -- but the ones used here, are the ones
used in the Penn Treebank, so they're pretty common.

- It'll work much better if the text you want to tag is similar to the
text that this tagger was trained on.

Hope this helps!

--
-- alexr

Raymond

unread,

Mar 31, 2010, 4:18:29 AM3/31/10

to nltk-users

Well...actually I cannot find this maxent treebank pos tagger inside
the documentation....
So obviously I know nothing about it...

On Mar 28, 12:34 pm, Alex Rudnick <alex.rudn...@gmail.com> wrote:
> Hello Raymond,
>
> As I understand it, nltk.pos_tag() is a tagger that's convenient to
> use, and works in a lot of cases. If you take a look at the source for
> that function, it loads up the "maxent treebank pos tagger" for
> English, and uses that. So it might be useful for your application,
> but you should test it out!
>
> Possible issues:
> - it's only useful if your application works with the same tag set
> that this tagger produces. There are many different possible tag sets
> -- you could make up your own -- but the ones used here, are the ones
> used in the Penn Treebank, so they're pretty common.
>
> - It'll work much better if the text you want to tag is similar to the
> text that this tagger was trained on.
>
> Hope this helps!
>

Alex Rudnick

unread,

Mar 31, 2010, 12:46:39 PM3/31/10

to nltk-...@googlegroups.com

I don't know if it's in the documentation -- probably it should be!
You can see what's going on with nltk.pos_tag() by taking a peek at
the source, though.

http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tag/__init__.py#55

It loads the "currently recommended" tagger from a pickle file and
just uses that. And while it's a good tagger for some uses, it might
not be right for your particular application. But it should be a fine
default starting point, until/unless you find you need something
different.

On Wed, Mar 31, 2010 at 4:18 AM, Raymond <gunbus...@gmail.com> wrote:
> Well...actually I cannot find this maxent treebank pos tagger inside
> the documentation....
> So obviously I know nothing about it...

--
-- alexr

James Smith

unread,

Mar 31, 2010, 3:57:00 PM3/31/10

to nltk-users

I believe the tagger was trained on the treebank corpus so it will be -
very- accurate for that and similar texts.

Something like the following should help you get an idea and you may
also like to look at a post by Jacon Perkins here:
http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/

corpus = nltk.corpus.treebank.tagged_sents()
tagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/
english.pickle')
tagger.evaluate(corpus)

Someone might be able to suggest a better corpus to evaluate against
though.

Raymond

unread,

Apr 1, 2010, 9:23:19 AM4/1/10

to nltk-users

How can I know which corpus suit me most?
I am looking into some hotel review, can anyone suggest some corpus?

On Apr 1, 3:57 am, James Smith <jamessmith...@gmail.com> wrote:
> I believe the tagger was trained on the treebank corpus so it will be -
> very- accurate for that and similar texts.
>
> Something like the following should help you get an idea and you may

> also like to look at a post by Jacon Perkins here:http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-p...

Oli

unread,

Apr 1, 2010, 1:42:40 PM4/1/10

to nltk-users

Hi,

interesting discussion...the question is: "How to evaluate the
pos_tag() on other corpus, than the treebank?"

When using the pos_tag() on the treebank-corpus, I got very good
results (clearly over 90%)... because the pos_tag() is trained with
the treebank-corpus.

But when using it on the brown-corpus, the results are much weaker
(around 60%).

So...what is the reason for that? I thought about the used tagsets.
The treebank-tagset is much smaller (46), than the brown-tagset (472)
- so it is clear, that pos_tag() could not give good results, when
evaluating against the brown-corpus.

An example:

- in the brown-corpus we have these tags for singular nouns: NN, NN$ ,
NN+BEZ, NN+HVD, NN+HVZ, NN+IN, NN+MD, NN+NN

- int the treebank-corpus we only have NN for singular noun

Could that be the reason, why pos_tag() didnt perform well on the
brown-corpus? How to find a "true" value for accuracy of the pos_tag()
on another corpus than treebank?

I hope, you understand my question...my english is not the best :-)

thanks
Oli

Jacob Perkins

unread,

Apr 2, 2010, 10:40:08 AM4/2/10

to nltk-users

Oli - in addition to the tag differences, there's also language
differences. pos_tag() should do well on conll2000 too, but brown is
very different. The brown corpus is actually a collection of various
small corpora, ranging from romance fiction to reviews published in
newspapers. Each category can be quite different, and only some may be
suitable for your purposes.

Raymond - I recommend reading samples of each category in the brown
corpus to see if any of them are linguistically similar to your hotel
reviews. This can be more of an intuitive choice, or you could try
using grep to find key words or phrases you've identified in the hotel
reviews.

It's possible to train on just particular categories of the brown
corpus using the 'categories=[cat]' keyword argument to various tagger
functions. I hope this all underscores the need to train on text that
is similar to the text you want to tag. You can always bootstrap your
own corpus by training on an NLTK corpus, tagging your own text, then
going in and hand-correcting your tagged text to create a new training
corpus. Ideally, you can find interns or grad students to do the last
part :)

Hope that helps,
Jacob

---
http://streamhacker.com
http://twitter.com/japerk

Raymond

unread,

Apr 3, 2010, 8:25:42 AM4/3/10

to nltk-users

Dear Jacob,

I see there are many types of taggers.
If I want to train my own tagger, which tagger is most computational
economic?
I ask this because the data I have is quite huge.
Andm what type of tagger is pos_tag()?

Raymond

On Apr 2, 10:40 pm, Jacob Perkins <jap...@gmail.com> wrote:
> Oli - in addition to the tag differences, there's also language
> differences. pos_tag() should do well on conll2000 too, but brown is
> very different. The brown corpus is actually a collection of various
> small corpora, ranging from romance fiction to reviews published in
> newspapers. Each category can be quite different, and only some may be
> suitable for your purposes.
>
> Raymond - I recommend reading samples of each category in the brown
> corpus to see if any of them are linguistically similar to your hotel
> reviews. This can be more of an intuitive choice, or you could try
> using grep to find key words or phrases you've identified in the hotel
> reviews.
>
> It's possible to train on just particular categories of the brown
> corpus using the 'categories=[cat]' keyword argument to various tagger
> functions. I hope this all underscores the need to train on text that
> is similar to the text you want to tag. You can always bootstrap your
> own corpus by training on an NLTK corpus, tagging your own text, then
> going in and hand-correcting your tagged text to create a new training
> corpus. Ideally, you can find interns or grad students to do the last
> part :)
>
> Hope that helps,
> Jacob
>

> ---http://streamhacker.comhttp://twitter.com/japerk

Jacob Perkins

unread,

Apr 3, 2010, 11:51:01 AM4/3/10

to nltk-users

On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
> Dear Jacob,
>
> I see there are many types of taggers.
> If I want to train my own tagger, which tagger is most computational
> economic?

Do you mean which tagger will train the faster, or which will tag the
fastest? Since training is a one time cost, I think it's ok if it
takes a few minutes, or even hours. And for tagging speed, I believe
accuracy is far more important than speed. The method I recommend is
to combine many taggers to get higher accuracy, then do distributed
processing in order to speed things up. Here's some articles I wrote
that might help:
http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
http://streamhacker.com/2009/11/29/distributed-nltk-execnet/

> I ask this because the data I have is quite huge.

Sounds like a perfect candidate for distributed/parallel processing.

> Andm what type of tagger is pos_tag()?

I believe it's an instance of nltk.tag.ClassifierBasedPOSTagger using
nltk.classify.MaxentClassifier. Can someone else confirm?

Raymond

unread,

Apr 3, 2010, 12:38:49 PM4/3/10

to nltk-users

No.......I don't know about parallel processing.
I am doing a data mining project....
each time I run my program, it have to tag 2000 reviews, which consume
a lot of time to tag(1X minutes)

I need faster tagging....

Raymond

On Apr 3, 11:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
> On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
>
> > Dear Jacob,
>
> > I see there are many types of taggers.
> > If I want to train my own tagger, which tagger is most computational
> > economic?
>
> Do you mean which tagger will train the faster, or which will tag the
> fastest? Since training is a one time cost, I think it's ok if it
> takes a few minutes, or even hours. And for tagging speed, I believe
> accuracy is far more important than speed. The method I recommend is
> to combine many taggers to get higher accuracy, then do distributed
> processing in order to speed things up. Here's some articles I wrote

> that might help:http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...http://streamhacker.com/2009/11/29/distributed-nltk-execnet/

James Smith

unread,

Apr 3, 2010, 1:16:52 PM4/3/10

to nltk-users

You probably want to "pickle" your trained tagger. This would be
saving the tagger once its trained and then being able to simply load
it when you need it, thus saving you the need to retrain. This is how
the default taggers and chunkers work. If you look at the source code
you'll see that somewhere along the line they're loading pickle files.

The code you'll need is something along the lines of:

import pickle
f = open('directory/tagger.pickle','w')
tagger = # Code to train tagger
pickle.dump(tagger,f)
f.close()

And then when you need to load the tagger you use:
import pickle
f = open('directory/tagger.pickle','r')
tagger = pickle.load(f)
f.close()

If you save the tagger somewhere in the nltk_data hierarchy you can
also use the nltk.data.load() function.

This should alleviate your concerns about having to retrain your
tagger everytime you run your program.

On Apr 3, 5:38 pm, Raymond <gunbuster...@gmail.com> wrote:
> No.......I don't know about parallel processing.
> I am doing a data mining project....
> each time I run my program, it have to tag 2000 reviews, which consume
> a lot of time to tag(1X minutes)
>
> I need faster tagging....
>
> Raymond
>
> On Apr 3, 11:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
>
> > On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
>
> > > Dear Jacob,
>
> > > I see there are many types of taggers.
> > > If I want to train my own tagger, which tagger is most computational
> > > economic?
>
> > Do you mean which tagger will train the faster, or which will tag the
> > fastest? Since training is a one time cost, I think it's ok if it
> > takes a few minutes, or even hours. And for tagging speed, I believe
> > accuracy is far more important than speed. The method I recommend is
> > to combine many taggers to get higher accuracy, then do distributed
> > processing in order to speed things up. Here's some articles I wrote
> > that might help:http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...
>

The Dixie Flatline

unread,

Apr 4, 2010, 6:14:19 PM4/4/10

to nltk-users

On Apr 3, 3:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
> Do you mean which tagger will train the faster, or which will tag the
> fastest? Since training is a one time cost, I think it's ok if it
> takes a few minutes, or even hours. And for tagging speed, I believe
> accuracy is far more important than speed. The method I recommend is
> to combine many taggers to get higher accuracy, then do distributed
> processing in order to speed things up. Here's some articles I wrote
> that might help:
>
> http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
> http://streamhacker.com/2009/11/29/distributed-nltk-execnet/

Hi Jacob, these are great posts!

I was curious why you didn't use the nltk.pos_tag() tagger as one of
your backoffs for this?

Also, do you (or anyone else?) have any intuition as to why a trigram
tagger might want only the two preceding words for tags? It would seem
to me that having the prior and following word might work better,
perhaps even using a 4Gram tagger for this too?

I am working on a hobby project to build a soft-AI that models a
person's speech patterns from their twitter feeds, blog posts,
mailinglist posts, and irc logs using an HMM trained on POS Ngrams to
create a statistical grammar+speech model (subjectively N=4 seems to
be best). Right now I am using a chart parser called AGFL for my HMM
grammar labels and falling back to nltk.pos_tag when that fails to
completely parse sentences (which is fairly often).

I also use nltk.pos_tag() and TF-IDF to respond to messages from a
user, giving priority to nouns, and using nouns to govern when to
expire a contextual memory vector of the current conversation.

Do you or anyone else have any suggestions as to which type of
training set might work best, or if anyone has done work on marking up
training sets of twitter data with pos taggers? I saw in the word
sense disambiguation that twitter was a hot topic for this sort of
stuff.

The project lives at http://github.com/tdflatline/Resurrectron. When I
get it connected to twitter and polish it a bit more I plan on making
an official announcement to a few different lists to try to drum up
some more interest.

Right now I'm just looking for some suggestions though. You can see
the config file describing the capabilities at:
http://github.com/tdflatline/Resurrectron/blob/master/settings.cfg

and my ideas file at:
http://github.com/tdflatline/Resurrectron/blob/master/TODO.txt

Pedro Marcal

unread,

Apr 4, 2010, 9:13:26 PM4/4/10

to nltk-...@googlegroups.com

Hi TD,

you wrote:

"""Also, do you (or anyone else?) have any intuition as to why a trigram
tagger might want only the two preceding words for tags? It would seem
to me that having the prior and following word might work better,
perhaps even using a 4Gram tagger for this too?"""

I have built a 3-gram statistical parser. I thought about this question but concluded that it would not matter as long as I took care of the beginning and end of a sentence with bigram data. As to why 3gram and not 4gram, I thought that a first order calculus was 3-gram so the use of 3 could capture that without complications. If the sentence was more complex it would be taken care by the training corpora anyway.

I took a look at your project and it looks really interesting. The accumulation of data may mean shorter topics. However the basic problem of text understanding remains. In Wilensky's work (1978), he concluded that only 60% of the necessary text was transmitted,

the rest had to be inferred by the context. I think you may have the problem in spades here because a hastily penned sentence may assume more knowledge in the receiver. Wilensky's work is hard to get hold of. If you want a paper that addresses it you can go to my web-site and download an application paper entitled Planning Applied Module. This can be downloaded from the first page art.3 of www.lifecyclevnv.com. Its 26Mbytes in length. The paper is the high point of the work on Conceptual dependency. I attach a summary of Conceptual Dependency that may help you understand it better.

Regards,

Pedro

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Pocket Guide to Conceptual dependency.doc

The Dixie Flatline

unread,

Apr 5, 2010, 5:30:58 AM4/5/10

to nltk-users

On Apr 5, 1:13 am, Pedro Marcal <pedrovmar...@gmail.com> wrote:

> Hi TD,you wrote:
>
> """Also, do you (or anyone else?) have any intuition as to why a trigram
> tagger might want only the two preceding words for tags? It would seem
> to me that having the prior and following word might work better,
> perhaps even using a 4Gram tagger for this too?"""
>
> I have built a 3-gram statistical parser. I thought about this question but
> concluded that it would not matter as long as I took care of the beginning
> and end of a sentence with bigram data.

Did you empirically verify this conclusion by with standard cross-
validation on training data? What sort of results did you get?

> As to why 3gram and not 4gram, I
> thought that a first order calculus was 3-gram so the use of 3 could capture
> that without complications. If the sentence was more complex it would be
> taken care by the training corpora anyway.

First order calculus is a property of the (assumed) underlying
reasoning engine, right? English is much more verbose than that, and
grammar dependencies can span multiple words until you get to the
actual words that are grounded in any reasoning model.

Simple example to illustrate that trigram position can influence
accuracy:

The vast ocean sounds great.
The vast ocean sounds can be wonderful.

How can a trigram using the two preceeding terms disambiguate the fact
that 'sounds' is a verb in the first sentence but a noun in the
second?

If you use a trigram where you factor in the preceeding and the
following term, the following counterexample quickly shows that you
may need 4grams in some cases:

The vast ocean sounds birds from their nests at dawn.
The vast ocean sounds birds hear are wonderful.

Again, in the first sentence, sounds is a verb meaning "to announce,
order, or direct by or as a sound". In the second, it is again a noun.

> I took a look at your project and it looks really interesting. The
> accumulation of data may mean shorter topics. However the basic problem of
> text understanding remains. In Wilensky's work (1978), he concluded that
> only 60% of the necessary text was transmitted,
> the rest had to be inferred by the context. I think you may have the problem
> in spades here because a hastily penned sentence may assume more knowledge
> in the receiver. Wilensky's work is hard to get hold of. If you want a paper
> that addresses it you can go to my web-site and download an application
> paper entitled Planning Applied Module. This can be downloaded from the

> first page art.3 ofwww.lifecyclevnv.com. Its 26Mbytes in length. The paper

> is the high point of the work on Conceptual dependency. I attach a summary
> of Conceptual Dependency that may help you understand it better.

Yeah, I'm aware of the problems of creating strong AI. I am interested
in taking a statistical short-cut around all the strong-AI reasoning
problems to build a chat bot that can be trained on a corpus of text,
and sound reasonably close to the source human being at least most of
the time, can respond to queries with relevant text, and most of all,
be fun to follow and interact with over twitter.

As far as I can tell, no one has attempted this. Maybe that's because
its a crazy thing to attempt and everyone knows better. But I think
with enough polish and with the right source corpus, some amusing
results can be had (especially if you pick amusing and/or
controversial narrow-topic speakers to 'clone').

While it still has plenty of room for improvement, my code does
accomplish this, more or less. So far, training it on Morpheus from
the Matrix 1 and 2 has produced the most consistent and entertaining
output. I can't wait to try it on transcripts of Glenn Beck and Bill
O'Reilly.

I'm also eager to try to focus the HMM vocab on specific topics using
clustering (so I can actually create one HMM per clustered topic), but
am hitting bugs in the nltk clustering algorithms right now. I hope to
submit patches in a couple days.

Pedro Marcal

unread,

Apr 5, 2010, 7:33:48 AM4/5/10

to nltk-...@googlegroups.com

Hi TD,

You wrote :

"""Did you empirically verify this conclusion by with standard cross-
validation on training data? What sort of results did you get?"""

Some time ago I built my corpora by combining, brown,Penn Tree and Conll and then tested the result on the Conll test suit. I got a 99.5% accuracy. There was a recent series of discussion on this under {nltk=users] Yagging: difference between determiner and pronoun. I did not want to rehash old positions in the current discussion. My test for accuracy is more stringent than that in the book because I also need that the words appear in my dictionary.

I now include a parse of your first two sentences that disambiguate the word sounds as verbs and nouns respectively. My approach to disambiguating meaning or sense is somewhat unorthodox. I translate the sentences into Japanese or Chinese. In both these languages it is more usual to have one written word being used for 1 meaning. I also include the Chinese translations in the results here. They are in Conll format.

The vast ocean sounds great.

the DET bracks [

vast ADJ bracks

ocean N bracks

sounds VBZ bracks ] bracksV <

great ADJ bracks [ ] bracksV >

. . bracks [ ]

*** DOE solution for meaning (max prob) ***

vast 浩瀚

ocean 海

sound 声

great 大

. 。

The vast ocean sounds can be wonderful.

the DET bracks [

vast ADJ bracks

ocean N bracks

sounds N bracks

can MOD bracks ] bracksV <

be V bracks

wonderful ADJ bracks [ ] bracksV >

. . bracks [ ]

*** DOE solution for meaning (max prob) ***

vast 浩瀚

ocean 海

sounds 声

can 可以

be 是

wonderful 妙

. 。

My intention is to link the Chinese words back to the English words using a Chinese wordnet.

Regards,

Pedro

--

Raymond

unread,

Apr 6, 2010, 6:12:39 AM4/6/10

to nltk-users

WHat happened up there?

Can I train my tagger using a big corpora such as brown, to give
general purpose use?
Because I cannot find text which is similar to the hotel reviews,
which are quite casual.

> > nltk-users+...@googlegroups.com<nltk-users%2Bunsubscribe@googlegrou ps.com>

Jacob Perkins

unread,

Apr 6, 2010, 10:31:05 AM4/6/10

to nltk-users

On Apr 4, 3:14 pm, The Dixie Flatline <td.flatl...@gmail.com> wrote:
> On Apr 3, 3:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
>
> > Do you mean which tagger will train the faster, or which will tag the
> > fastest? Since training is a one time cost, I think it's ok if it
> > takes a few minutes, or even hours. And for tagging speed, I believe
> > accuracy is far more important than speed. The method I recommend is
> > to combine many taggers to get higher accuracy, then do distributed
> > processing in order to speed things up. Here's some articles I wrote
> > that might help:
>

> >http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...

> >http://streamhacker.com/2009/11/29/distributed-nltk-execnet/
>
> Hi Jacob, these are great posts!

Thanks :)

>
> I was curious why you didn't use the nltk.pos_tag() tagger as one of
> your backoffs for this?

I'm pretty sure nltk.pos_tag() didn't exist when I wrote the part-of-
speech tagging series. But I do plan to do a part-4 sometime this year
(I'd like to say this month, but we'll see)

>
> Also, do you (or anyone else?) have any intuition as to why a trigram
> tagger might want only the two preceding words for tags? It would seem
> to me that having the prior and following word might work better,
> perhaps even using a 4Gram tagger for this too?

I've always thought so, but haven't done any testing to confirm. My
article on a classifier based chunker does imply that looking ahead
can help, but only if you also look behind:
http://streamhacker.com/2010/03/15/nltk-classifier-based-chunker-accuracy/

>
> I am working on a hobby project to build a soft-AI that models a
> person's speech patterns from their twitter feeds, blog posts,
> mailinglist posts, and irc logs using an HMM trained on POS Ngrams to
> create a statistical grammar+speech model (subjectively N=4 seems to
> be best). Right now I am using a chart parser called AGFL for my HMM
> grammar labels and falling back to nltk.pos_tag when that fails to
> completely parse sentences (which is fairly often).

It's fairly difficult to parse normal structured english, and now
you're trying to parse english in the wild? Good luck with that :)
I think training and partial parsing (chunking) will be your best bet.

>
> I also use nltk.pos_tag() and TF-IDF to respond to messages from a
> user, giving priority to nouns, and using nouns to govern when to
> expire a contextual memory vector of the current conversation.
>
> Do you or anyone else have any suggestions as to which type of
> training set might work best, or if anyone has done work on marking up
> training sets of twitter data with pos taggers? I saw in the word
> sense disambiguation that twitter was a hot topic for this sort of
> stuff.

There may be some categories of the brown corpus that might be
suitable, at least for initial training. Take a look at nltk_data/
corpora/brown/CONTENTS

>
> The project lives athttp://github.com/tdflatline/Resurrectron. When I

Jacob Perkins

unread,

Apr 6, 2010, 10:33:45 AM4/6/10

to nltk-users

Parallel processing with execnet isn't too hard, and could definitely
speed things up if you have extra cores. It's essentially the same as
if you split the reviews in to multiple sections, and ran a tagger on
each section, all at the same time.

On Apr 3, 9:38 am, Raymond <gunbuster...@gmail.com> wrote:
> No.......I don't know about parallel processing.
> I am doing a data mining project....
> each time I run my program, it have to tag 2000 reviews, which consume
> a lot of time to tag(1X minutes)
>
> I need faster tagging....
>
> Raymond
>
> On Apr 3, 11:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
>
>
>
> > On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
>
> > > Dear Jacob,
>
> > > I see there are many types of taggers.
> > > If I want to train my own tagger, which tagger is most computational
> > > economic?
>
> > Do you mean which tagger will train the faster, or which will tag the
> > fastest? Since training is a one time cost, I think it's ok if it
> > takes a few minutes, or even hours. And for tagging speed, I believe
> > accuracy is far more important than speed. The method I recommend is
> > to combine many taggers to get higher accuracy, then do distributed
> > processing in order to speed things up. Here's some articles I wrote
> > that might help:http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...
>

Jacob Perkins

unread,

Apr 6, 2010, 10:36:13 AM4/6/10

to nltk-users

On Apr 6, 3:12 am, Raymond <gunbuster...@gmail.com> wrote:
> WHat happened up there?
>
> Can I train my tagger using a big corpora such as brown, to give
> general purpose use?
> Because I cannot find text which is similar to the hotel reviews,
> which are quite casual.

I think specific categories of the brown corpus might be useful.
Definitely try the 'reviews' categories, and maybe check out some of
the others.

Pedro Marcal

unread,

Apr 6, 2010, 11:07:14 AM4/6/10

to nltk-...@googlegroups.com

Hi Jacob,

How does threading in Python ( available since python 2.5) differ from Parallel processing with execnet?

Regards,

Pedro

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.

To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Jacob Perkins

unread,

Apr 6, 2010, 12:22:23 PM4/6/10

to nltk-users

On Apr 6, 8:07 am, Pedro Marcal <pedrovmar...@gmail.com> wrote:
> Hi Jacob,
> How does threading in Python ( available since python 2.5) differ from
> Parallel processing with execnet?

execnet essentially spawns new python interpreters, so there's no
shared memory or locking, and communication must be thru messages
passed over channels. This may not be optimal for local parallel
processing, but since execnet can spawn over ssh, distributed
processing is just as easy. And execnet can spawn interpreters for
jython, ironpython, and other versions of python.

It's really an apples to oranges comparison, because you could still
run threads within a spawned execnet session. But personally, I find
it much simpler conceptually, though that's probably because I'm also
partial to erlang :)

Jacob

> > nltk-users+...@googlegroups.com<nltk-users%2Bunsubscribe@googlegrou ps.com>

The Dixie Flatline

unread,

Apr 7, 2010, 5:01:14 AM4/7/10

to nltk-users

On Apr 6, 2:31 pm, Jacob Perkins <jap...@gmail.com> wrote:

> > I was curious why you didn't use the nltk.pos_tag() tagger as one of
> > your backoffs for this?
>
> I'm pretty sure nltk.pos_tag() didn't exist when I wrote the part-of-
> speech tagging series. But I do plan to do a part-4 sometime this year
> (I'd like to say this month, but we'll see)

Ok. Awesome. I am very much looking forward to it.

> > Also, do you (or anyone else?) have any intuition as to why a trigram
> > tagger might want only the two preceding words for tags? It would seem
> > to me that having the prior and following word might work better,
> > perhaps even using a 4Gram tagger for this too?
>
> I've always thought so, but haven't done any testing to confirm. My
> article on a classifier based chunker does imply that looking ahead
> can help, but only if you also look behind:

http://streamhacker.com/2010/03/15/nltk-classifier-based-chunker-accu

Yeah, after responding to Pedro above and taking the time to come up
with those 'ocean sounds' counter examples to the prefix idea, I'm
pretty darn convinced that context can be really important to
accuracy.

One of the ideas that just struck me was that you might be able to
train a contextual 4gram tagger and perhaps even a 3gram tagger on
less training data if you do something similar to what I do with my
HMM training. The 3/4gram would be trained on the current word + just
the surrounding tags, rather than the actual words. I think really the
major thing that governs the POS of a word in context is the /tags/ of
the surrounding words, not the actual words themselves.

It looks like you're arriving at a similar conclusion in your chunker
post, right? Though you do mention sometimes it is helpful to include
both the word and the POS tag in the training? Any ideas why this
might be?

> > I am working on a hobby project to build a soft-AI that models a
> > person's speech patterns from their twitter feeds, blog posts,
> > mailinglist posts, and irc logs using an HMM trained on POS Ngrams to
> > create a statistical grammar+speech model (subjectively N=4 seems to
> > be best). Right now I am using a chart parser called AGFL for my HMM
> > grammar labels and falling back to nltk.pos_tag when that fails to
> > completely parse sentences (which is fairly often).
>
> It's fairly difficult to parse normal structured english, and now
> you're trying to parse english in the wild? Good luck with that :)
> I think training and partial parsing (chunking) will be your best bet.

I think you misunderstand. My language model does not depend on a full
parse, or even chunks. I only require the POS tags. Thankfully, when
AGFL fails, it tends to fail somewhat gracefully, and gives you
partial parses of the sentence fragments it can still work with, and
leaves the rest untagged. I then take only the AGFL parse leaves and
fill in the missing tags with nltk's pos_tag. This seems to give me
significantly better HMM output text than nltk.pos_tag() alone. This
may actually have more to do with AGFL's more specific tags than
actual raw accuracy though.

I would love to be able to drop AGFL for a purely statistical
approach, though. I look forward to your future posts to help me along
with that.

Pedro Marcal

unread,

Apr 7, 2010, 9:37:10 PM4/7/10

to nltk-...@googlegroups.com

Hi TD,

You wrote :

> I think really the
>major thing that governs the POS of a word in context is the /tags/ of
>the surrounding words, not the actual words themselves.

This is the whole premise behind context-free-parsing.

Then you wrote :

>It looks like you're arriving at a similar conclusion in your chunker
>post, right? Though you do mention sometimes it is helpful to include
>both the word and the POS tag in the training? Any ideas why this
>might be?

The Stanford parser shows that context can help compensate for inadequate context-free-parsing. However since my context-free parser performs better than their hybrid parser, I conclude that it is better to have a larger corpora set and a good parsing algorithm.. Besides its easier to perform a semantic parse by building on a context-free-parse. Then you only have to disambiguiate between senses for each word within its pos.

As far as I know the Chinese language is the only one that cannot support a context-free parse. Its verb tenses are implied by aspect markers that are context dependent.

Regards,

Pedro

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.

To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Pedro Marcal

unread,

Apr 8, 2010, 12:45:17 PM4/8/10

to nltk-...@googlegroups.com

Hi Jacob,

Thank you . I will label execnet as par. proc. heavy and threading as par. proc. light.

erlang? I have promised myself that Python is the last Language I will ever learn.(:<

Regards,

Pedro

To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

priyanka nethi

unread,

Jan 8, 2019, 12:09:35 PM1/8/19

to nltk-users

Can we use this tagger for Indian languages as well?

Alex Rudnick

unread,

Jan 8, 2019, 1:17:11 PM1/8/19

to nltk-...@googlegroups.com

Hey Priyanka,

The nltk.pos_tag command is for English (it uses a pre-trained model
that was trained on English), but you could definitely use NLTK to
make a POS tagger for Indian languages.

You'd just need to get some POS-tagged data for the language(s) you
care about, so you could train your own tagger.

Let us know if you need pointers for how to do this! Which languages
would you like to handle?

On Tue, Jan 8, 2019 at 9:09 AM priyanka nethi <priyank...@gmail.com> wrote:
>
> Can we use this tagger for Indian languages as well?

--
-- alexr

Francis Bond

unread,

Jan 8, 2019, 11:27:19 PM1/8/19

to nltk-users

Here is an example of training a model for Indonesian:
https://github.com/fcbond/ind-pos

Note that for some time the default model has been the
averaged_perceptron_tagger, which we found to be slightly more
accurate.

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University

priyanka nethi

unread,

Jan 9, 2019, 2:08:54 AM1/9/19

to nltk-users

Thank you so much, I am working on Telugu to English translation. I am a newbie to NLP. I am stuck with many tagger methods available in nltk. Any guidance is appreciated.

Thanks in advance

Reply all

Reply to author

Forward