Regards,
Raymond
As I understand it, nltk.pos_tag() is a tagger that's convenient to
use, and works in a lot of cases. If you take a look at the source for
that function, it loads up the "maxent treebank pos tagger" for
English, and uses that. So it might be useful for your application,
but you should test it out!
Possible issues:
- it's only useful if your application works with the same tag set
that this tagger produces. There are many different possible tag sets
-- you could make up your own -- but the ones used here, are the ones
used in the Penn Treebank, so they're pretty common.
- It'll work much better if the text you want to tag is similar to the
text that this tagger was trained on.
Hope this helps!
--
-- alexr
On Mar 28, 12:34 pm, Alex Rudnick <alex.rudn...@gmail.com> wrote:
> Hello Raymond,
>
> As I understand it, nltk.pos_tag() is a tagger that's convenient to
> use, and works in a lot of cases. If you take a look at the source for
> that function, it loads up the "maxent treebank pos tagger" for
> English, and uses that. So it might be useful for your application,
> but you should test it out!
>
> Possible issues:
> - it's only useful if your application works with the same tag set
> that this tagger produces. There are many different possible tag sets
> -- you could make up your own -- but the ones used here, are the ones
> used in the Penn Treebank, so they're pretty common.
>
> - It'll work much better if the text you want to tag is similar to the
> text that this tagger was trained on.
>
> Hope this helps!
>
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tag/__init__.py#55
It loads the "currently recommended" tagger from a pickle file and
just uses that. And while it's a good tagger for some uses, it might
not be right for your particular application. But it should be a fine
default starting point, until/unless you find you need something
different.
On Wed, Mar 31, 2010 at 4:18 AM, Raymond <gunbus...@gmail.com> wrote:
> Well...actually I cannot find this maxent treebank pos tagger inside
> the documentation....
> So obviously I know nothing about it...
--
-- alexr
Something like the following should help you get an idea and you may
also like to look at a post by Jacon Perkins here:
http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
corpus = nltk.corpus.treebank.tagged_sents()
tagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/
english.pickle')
tagger.evaluate(corpus)
Someone might be able to suggest a better corpus to evaluate against
though.
On Apr 1, 3:57 am, James Smith <jamessmith...@gmail.com> wrote:
> I believe the tagger was trained on the treebank corpus so it will be -
> very- accurate for that and similar texts.
>
> Something like the following should help you get an idea and you may
> also like to look at a post by Jacon Perkins here:http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-p...
interesting discussion...the question is: "How to evaluate the
pos_tag() on other corpus, than the treebank?"
When using the pos_tag() on the treebank-corpus, I got very good
results (clearly over 90%)... because the pos_tag() is trained with
the treebank-corpus.
But when using it on the brown-corpus, the results are much weaker
(around 60%).
So...what is the reason for that? I thought about the used tagsets.
The treebank-tagset is much smaller (46), than the brown-tagset (472)
- so it is clear, that pos_tag() could not give good results, when
evaluating against the brown-corpus.
An example:
- in the brown-corpus we have these tags for singular nouns: NN, NN$ ,
NN+BEZ, NN+HVD, NN+HVZ, NN+IN, NN+MD, NN+NN
- int the treebank-corpus we only have NN for singular noun
Could that be the reason, why pos_tag() didnt perform well on the
brown-corpus? How to find a "true" value for accuracy of the pos_tag()
on another corpus than treebank?
I hope, you understand my question...my english is not the best :-)
thanks
Oli
Raymond - I recommend reading samples of each category in the brown
corpus to see if any of them are linguistically similar to your hotel
reviews. This can be more of an intuitive choice, or you could try
using grep to find key words or phrases you've identified in the hotel
reviews.
It's possible to train on just particular categories of the brown
corpus using the 'categories=[cat]' keyword argument to various tagger
functions. I hope this all underscores the need to train on text that
is similar to the text you want to tag. You can always bootstrap your
own corpus by training on an NLTK corpus, tagging your own text, then
going in and hand-correcting your tagged text to create a new training
corpus. Ideally, you can find interns or grad students to do the last
part :)
Hope that helps,
Jacob
---
http://streamhacker.com
http://twitter.com/japerk
I see there are many types of taggers.
If I want to train my own tagger, which tagger is most computational
economic?
I ask this because the data I have is quite huge.
Andm what type of tagger is pos_tag()?
Raymond
On Apr 2, 10:40 pm, Jacob Perkins <jap...@gmail.com> wrote:
> Oli - in addition to the tag differences, there's also language
> differences. pos_tag() should do well on conll2000 too, but brown is
> very different. The brown corpus is actually a collection of various
> small corpora, ranging from romance fiction to reviews published in
> newspapers. Each category can be quite different, and only some may be
> suitable for your purposes.
>
> Raymond - I recommend reading samples of each category in the brown
> corpus to see if any of them are linguistically similar to your hotel
> reviews. This can be more of an intuitive choice, or you could try
> using grep to find key words or phrases you've identified in the hotel
> reviews.
>
> It's possible to train on just particular categories of the brown
> corpus using the 'categories=[cat]' keyword argument to various tagger
> functions. I hope this all underscores the need to train on text that
> is similar to the text you want to tag. You can always bootstrap your
> own corpus by training on an NLTK corpus, tagging your own text, then
> going in and hand-correcting your tagged text to create a new training
> corpus. Ideally, you can find interns or grad students to do the last
> part :)
>
> Hope that helps,
> Jacob
>
> ---http://streamhacker.comhttp://twitter.com/japerk
Do you mean which tagger will train the faster, or which will tag the
fastest? Since training is a one time cost, I think it's ok if it
takes a few minutes, or even hours. And for tagging speed, I believe
accuracy is far more important than speed. The method I recommend is
to combine many taggers to get higher accuracy, then do distributed
processing in order to speed things up. Here's some articles I wrote
that might help:
http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
http://streamhacker.com/2009/11/29/distributed-nltk-execnet/
> I ask this because the data I have is quite huge.
Sounds like a perfect candidate for distributed/parallel processing.
> Andm what type of tagger is pos_tag()?
I believe it's an instance of nltk.tag.ClassifierBasedPOSTagger using
nltk.classify.MaxentClassifier. Can someone else confirm?
I need faster tagging....
Raymond
On Apr 3, 11:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
> On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
>
> > Dear Jacob,
>
> > I see there are many types of taggers.
> > If I want to train my own tagger, which tagger is most computational
> > economic?
>
> Do you mean which tagger will train the faster, or which will tag the
> fastest? Since training is a one time cost, I think it's ok if it
> takes a few minutes, or even hours. And for tagging speed, I believe
> accuracy is far more important than speed. The method I recommend is
> to combine many taggers to get higher accuracy, then do distributed
> processing in order to speed things up. Here's some articles I wrote
> that might help:http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...http://streamhacker.com/2009/11/29/distributed-nltk-execnet/
The code you'll need is something along the lines of:
import pickle
f = open('directory/tagger.pickle','w')
tagger = # Code to train tagger
pickle.dump(tagger,f)
f.close()
And then when you need to load the tagger you use:
import pickle
f = open('directory/tagger.pickle','r')
tagger = pickle.load(f)
f.close()
If you save the tagger somewhere in the nltk_data hierarchy you can
also use the nltk.data.load() function.
This should alleviate your concerns about having to retrain your
tagger everytime you run your program.
On Apr 3, 5:38 pm, Raymond <gunbuster...@gmail.com> wrote:
> No.......I don't know about parallel processing.
> I am doing a data mining project....
> each time I run my program, it have to tag 2000 reviews, which consume
> a lot of time to tag(1X minutes)
>
> I need faster tagging....
>
> Raymond
>
> On Apr 3, 11:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
>
> > On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
>
> > > Dear Jacob,
>
> > > I see there are many types of taggers.
> > > If I want to train my own tagger, which tagger is most computational
> > > economic?
>
> > Do you mean which tagger will train the faster, or which will tag the
> > fastest? Since training is a one time cost, I think it's ok if it
> > takes a few minutes, or even hours. And for tagging speed, I believe
> > accuracy is far more important than speed. The method I recommend is
> > to combine many taggers to get higher accuracy, then do distributed
> > processing in order to speed things up. Here's some articles I wrote
> > that might help:http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...
>
Hi Jacob, these are great posts!
I was curious why you didn't use the nltk.pos_tag() tagger as one of
your backoffs for this?
Also, do you (or anyone else?) have any intuition as to why a trigram
tagger might want only the two preceding words for tags? It would seem
to me that having the prior and following word might work better,
perhaps even using a 4Gram tagger for this too?
I am working on a hobby project to build a soft-AI that models a
person's speech patterns from their twitter feeds, blog posts,
mailinglist posts, and irc logs using an HMM trained on POS Ngrams to
create a statistical grammar+speech model (subjectively N=4 seems to
be best). Right now I am using a chart parser called AGFL for my HMM
grammar labels and falling back to nltk.pos_tag when that fails to
completely parse sentences (which is fairly often).
I also use nltk.pos_tag() and TF-IDF to respond to messages from a
user, giving priority to nouns, and using nouns to govern when to
expire a contextual memory vector of the current conversation.
Do you or anyone else have any suggestions as to which type of
training set might work best, or if anyone has done work on marking up
training sets of twitter data with pos taggers? I saw in the word
sense disambiguation that twitter was a hot topic for this sort of
stuff.
The project lives at http://github.com/tdflatline/Resurrectron. When I
get it connected to twitter and polish it a bit more I plan on making
an official announcement to a few different lists to try to drum up
some more interest.
Right now I'm just looking for some suggestions though. You can see
the config file describing the capabilities at:
http://github.com/tdflatline/Resurrectron/blob/master/settings.cfg
and my ideas file at:
http://github.com/tdflatline/Resurrectron/blob/master/TODO.txt
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
Did you empirically verify this conclusion by with standard cross-
validation on training data? What sort of results did you get?
> As to why 3gram and not 4gram, I
> thought that a first order calculus was 3-gram so the use of 3 could capture
> that without complications. If the sentence was more complex it would be
> taken care by the training corpora anyway.
First order calculus is a property of the (assumed) underlying
reasoning engine, right? English is much more verbose than that, and
grammar dependencies can span multiple words until you get to the
actual words that are grounded in any reasoning model.
Simple example to illustrate that trigram position can influence
accuracy:
The vast ocean sounds great.
The vast ocean sounds can be wonderful.
How can a trigram using the two preceeding terms disambiguate the fact
that 'sounds' is a verb in the first sentence but a noun in the
second?
If you use a trigram where you factor in the preceeding and the
following term, the following counterexample quickly shows that you
may need 4grams in some cases:
The vast ocean sounds birds from their nests at dawn.
The vast ocean sounds birds hear are wonderful.
Again, in the first sentence, sounds is a verb meaning "to announce,
order, or direct by or as a sound". In the second, it is again a noun.
> I took a look at your project and it looks really interesting. The
> accumulation of data may mean shorter topics. However the basic problem of
> text understanding remains. In Wilensky's work (1978), he concluded that
> only 60% of the necessary text was transmitted,
> the rest had to be inferred by the context. I think you may have the problem
> in spades here because a hastily penned sentence may assume more knowledge
> in the receiver. Wilensky's work is hard to get hold of. If you want a paper
> that addresses it you can go to my web-site and download an application
> paper entitled Planning Applied Module. This can be downloaded from the
> first page art.3 ofwww.lifecyclevnv.com. Its 26Mbytes in length. The paper
> is the high point of the work on Conceptual dependency. I attach a summary
> of Conceptual Dependency that may help you understand it better.
Yeah, I'm aware of the problems of creating strong AI. I am interested
in taking a statistical short-cut around all the strong-AI reasoning
problems to build a chat bot that can be trained on a corpus of text,
and sound reasonably close to the source human being at least most of
the time, can respond to queries with relevant text, and most of all,
be fun to follow and interact with over twitter.
As far as I can tell, no one has attempted this. Maybe that's because
its a crazy thing to attempt and everyone knows better. But I think
with enough polish and with the right source corpus, some amusing
results can be had (especially if you pick amusing and/or
controversial narrow-topic speakers to 'clone').
While it still has plenty of room for improvement, my code does
accomplish this, more or less. So far, training it on Morpheus from
the Matrix 1 and 2 has produced the most consistent and entertaining
output. I can't wait to try it on transcripts of Glenn Beck and Bill
O'Reilly.
I'm also eager to try to focus the HMM vocab on specific topics using
clustering (so I can actually create one HMM per clustered topic), but
am hitting bugs in the nltk clustering algorithms right now. I hope to
submit patches in a couple days.
--
Can I train my tagger using a big corpora such as brown, to give
general purpose use?
Because I cannot find text which is similar to the hotel reviews,
which are quite casual.
> > nltk-users+...@googlegroups.com<nltk-users%2Bunsubscribe@googlegrou ps.com>
Thanks :)
>
> I was curious why you didn't use the nltk.pos_tag() tagger as one of
> your backoffs for this?
I'm pretty sure nltk.pos_tag() didn't exist when I wrote the part-of-
speech tagging series. But I do plan to do a part-4 sometime this year
(I'd like to say this month, but we'll see)
>
> Also, do you (or anyone else?) have any intuition as to why a trigram
> tagger might want only the two preceding words for tags? It would seem
> to me that having the prior and following word might work better,
> perhaps even using a 4Gram tagger for this too?
I've always thought so, but haven't done any testing to confirm. My
article on a classifier based chunker does imply that looking ahead
can help, but only if you also look behind:
http://streamhacker.com/2010/03/15/nltk-classifier-based-chunker-accuracy/
>
> I am working on a hobby project to build a soft-AI that models a
> person's speech patterns from their twitter feeds, blog posts,
> mailinglist posts, and irc logs using an HMM trained on POS Ngrams to
> create a statistical grammar+speech model (subjectively N=4 seems to
> be best). Right now I am using a chart parser called AGFL for my HMM
> grammar labels and falling back to nltk.pos_tag when that fails to
> completely parse sentences (which is fairly often).
It's fairly difficult to parse normal structured english, and now
you're trying to parse english in the wild? Good luck with that :)
I think training and partial parsing (chunking) will be your best bet.
>
> I also use nltk.pos_tag() and TF-IDF to respond to messages from a
> user, giving priority to nouns, and using nouns to govern when to
> expire a contextual memory vector of the current conversation.
>
> Do you or anyone else have any suggestions as to which type of
> training set might work best, or if anyone has done work on marking up
> training sets of twitter data with pos taggers? I saw in the word
> sense disambiguation that twitter was a hot topic for this sort of
> stuff.
There may be some categories of the brown corpus that might be
suitable, at least for initial training. Take a look at nltk_data/
corpora/brown/CONTENTS
>
> The project lives athttp://github.com/tdflatline/Resurrectron. When I
On Apr 3, 9:38 am, Raymond <gunbuster...@gmail.com> wrote:
> No.......I don't know about parallel processing.
> I am doing a data mining project....
> each time I run my program, it have to tag 2000 reviews, which consume
> a lot of time to tag(1X minutes)
>
> I need faster tagging....
>
> Raymond
>
> On Apr 3, 11:51 pm, Jacob Perkins <jap...@gmail.com> wrote:
>
>
>
> > On Apr 3, 5:25 am, Raymond <gunbuster...@gmail.com> wrote:
>
> > > Dear Jacob,
>
> > > I see there are many types of taggers.
> > > If I want to train my own tagger, which tagger is most computational
> > > economic?
>
> > Do you mean which tagger will train the faster, or which will tag the
> > fastest? Since training is a one time cost, I think it's ok if it
> > takes a few minutes, or even hours. And for tagging speed, I believe
> > accuracy is far more important than speed. The method I recommend is
> > to combine many taggers to get higher accuracy, then do distributed
> > processing in order to speed things up. Here's some articles I wrote
> > that might help:http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-p...
>
I think specific categories of the brown corpus might be useful.
Definitely try the 'reviews' categories, and maybe check out some of
the others.
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
execnet essentially spawns new python interpreters, so there's no
shared memory or locking, and communication must be thru messages
passed over channels. This may not be optimal for local parallel
processing, but since execnet can spawn over ssh, distributed
processing is just as easy. And execnet can spawn interpreters for
jython, ironpython, and other versions of python.
It's really an apples to oranges comparison, because you could still
run threads within a spawned execnet session. But personally, I find
it much simpler conceptually, though that's probably because I'm also
partial to erlang :)
Jacob
> > nltk-users+...@googlegroups.com<nltk-users%2Bunsubscribe@googlegrou ps.com>
> > I was curious why you didn't use the nltk.pos_tag() tagger as one of
> > your backoffs for this?
>
> I'm pretty sure nltk.pos_tag() didn't exist when I wrote the part-of-
> speech tagging series. But I do plan to do a part-4 sometime this year
> (I'd like to say this month, but we'll see)
Ok. Awesome. I am very much looking forward to it.
> > Also, do you (or anyone else?) have any intuition as to why a trigram
> > tagger might want only the two preceding words for tags? It would seem
> > to me that having the prior and following word might work better,
> > perhaps even using a 4Gram tagger for this too?
>
> I've always thought so, but haven't done any testing to confirm. My
> article on a classifier based chunker does imply that looking ahead
> can help, but only if you also look behind:
http://streamhacker.com/2010/03/15/nltk-classifier-based-chunker-accu
Yeah, after responding to Pedro above and taking the time to come up
with those 'ocean sounds' counter examples to the prefix idea, I'm
pretty darn convinced that context can be really important to
accuracy.
One of the ideas that just struck me was that you might be able to
train a contextual 4gram tagger and perhaps even a 3gram tagger on
less training data if you do something similar to what I do with my
HMM training. The 3/4gram would be trained on the current word + just
the surrounding tags, rather than the actual words. I think really the
major thing that governs the POS of a word in context is the /tags/ of
the surrounding words, not the actual words themselves.
It looks like you're arriving at a similar conclusion in your chunker
post, right? Though you do mention sometimes it is helpful to include
both the word and the POS tag in the training? Any ideas why this
might be?
> > I am working on a hobby project to build a soft-AI that models a
> > person's speech patterns from their twitter feeds, blog posts,
> > mailinglist posts, and irc logs using an HMM trained on POS Ngrams to
> > create a statistical grammar+speech model (subjectively N=4 seems to
> > be best). Right now I am using a chart parser called AGFL for my HMM
> > grammar labels and falling back to nltk.pos_tag when that fails to
> > completely parse sentences (which is fairly often).
>
> It's fairly difficult to parse normal structured english, and now
> you're trying to parse english in the wild? Good luck with that :)
> I think training and partial parsing (chunking) will be your best bet.
I think you misunderstand. My language model does not depend on a full
parse, or even chunks. I only require the POS tags. Thankfully, when
AGFL fails, it tends to fail somewhat gracefully, and gives you
partial parses of the sentence fragments it can still work with, and
leaves the rest untagged. I then take only the AGFL parse leaves and
fill in the missing tags with nltk's pos_tag. This seems to give me
significantly better HMM output text than nltk.pos_tag() alone. This
may actually have more to do with AGFL's more specific tags than
actual raw accuracy though.
I would love to be able to drop AGFL for a purely statistical
approach, though. I look forward to your future posts to help me along
with that.
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.