POS tagger implementation

Tanton Gibbs

unread,

Jul 30, 2009, 2:00:19 PM7/30/09

to NLP-r...@googlegroups.com

Ok, I have finally finished reading the assigned paper and have a few
questions (I'm a machine learning newbie).

1. The "beam search" they describe: is that just a way of determining
maximum liklihood? Where could I find more information on that? Note
I haven't searched for it yet, but sometimes people can point to a
more understandable/level appropriate page than can a search engine.

2. They talk of using a beam size of N = 5. What is the beam size?
Is that the number of tag candidates? So, when you define a feature
on w_i = "about" and t_i = IN then you evaluate that feature on t_i
across the top 5 tags?

3. Are there additional optimizations that can be done on it such as
some sort of pruning? They describe it as being slow, so one would
expect various optimizations to have been developed.

4. Is there a reference implementation of a POS tagger in mahout or in
some other library? Especially one similar to the one in the paper -
code might help my understanding :)

Sorry about the newbie questions!

Tanton

Jeremy K

unread,

Jul 30, 2009, 2:11:29 PM7/30/09

to Tanton Gibbs, NLP-r...@googlegroups.com

re: Tanton's questions 1 and 2:
http://en.wikipedia.org/wiki/Beam_search is actually a fairly good
description.

--Jeremy

Tanton Gibbs

unread,

Jul 30, 2009, 2:26:50 PM7/30/09

to NLP-r...@googlegroups.com

Ah, so it is a pruning algorithm. Great!
Thanks for the link.

Tanton

Shashikant Kore

unread,

Jul 31, 2009, 2:57:39 AM7/31/09

to Tanton Gibbs, NLP-r...@googlegroups.com

On Thu, Jul 30, 2009 at 11:30 PM, Tanton Gibbs<tanton...@gmail.com> wrote:
>
> 4. Is there a reference implementation of a POS tagger in mahout or in
> some other library? Especially one similar to the one in the paper -
> code might help my understanding :)
>

I recently tried OpenNLP for noun phrase identification, which uses
this algorithm. BeamSearch class can be found in opennlp.tools.util.

OpenNLP using beam size of 10. I changed it to 5. That didn't change
the output but the performance improved quite a bit. From run time of
50 minutes, it came down to 30 minutes with reduced beam size. (This
was for text of 10,000 word documents.) No change in output could be
attributed the kind of input data. May be I should try it with even
small width.

BTW, I must confess that I have read the paper but have very little
understanding of it apart from the general sketch.

--shashi

--
http://www.bandhan.com/

Grant Ingersoll

unread,

Jul 30, 2009, 6:48:23 PM7/30/09

to NLP-r...@googlegroups.com

On Jul 30, 2009, at 2:00 PM, Tanton Gibbs wrote:
> 4. Is there a reference implementation of a POS tagger in mahout or in
> some other library? Especially one similar to the one in the paper -
> code might help my understanding :)

I believe OpenNLP uses MaxEnt for POS (and other things)

Scott Frye

unread,

Jul 31, 2009, 10:43:10 AM7/31/09

to Natural Language Processing Virtual Reading Group

I've put together a summerization of the paper based on my
understanding. I admit I had to read it several times and welcome any
pointers on where I may have gotten it wrong. I also probably focus
on the parts that were of interest to me an may have missed some
important points.

Summary:
- This paper presents a model for Part-of-Speech tagging.
- The author claims a 96.6% accuracy which is state of the art at the
time the paper was written (1996).
- The experiments were conducted on the Wall Street Journal corpus of
the Penn Treebank project.
- The author feels that the Brill Tagger can be improved on by "better
use of context".
- The model proposed can beclassified as a "Maximum Entropy Model"
- The model is trained on training data that has every word marked
with a part of speech (using the Penn Treebank tag set)
- The model works in this way, (as far as I understand)
-Every word in the training data is examined.
-As each word is considered, the two words before and the two words
after are looked at as well. This "window" is considered the
"history" of the word.
-This "window" of 5 words have a certain number of functions that
determine if a "feature is present on the window of words.
-Each function takes the history of a word, and a tag and returns a 0
or a 1 indicating if feature exists or does not.
- Example features:
-The word is 'about' and the word 2 before is 'the'.
-the word contains a hyphen and the word is tagged as a
'JJ' (adjective).
-the word ends in 'ing' and is tagged as a 'VBG' (verb gerund).
- The features are generated automatically from the tagged training
data by applying the questions in table 1 in the paper.
- features that occur less than 10 times are not included in the
finished model
- the author has a feature for "rare" words. He considers a rare
word one that occurs 3 times or less in the training data. Unseen
words are considered "rare" words.
- in the trained model every feature function has a weight (alpha)
assigned to it. For instance, one function might be true indicate
that a word should be a JJ based on the history but another would
indicate a VBG. If the VBG happened more often (with that history in
the training data, then that feature function would have a higher
weight (alpha) applied to it because it would be considered more
significant.
- The alphas are determined during training by selecting them so the
"maximize the probability of the training data". The papers doesn't
describe how this is done but it is assumed that the occurrence of
each function in the training data is counted and then divided by the
number of training examples.
- The goal is to be able to generate a probability of a particular
tag given a history when running the model against unseen data. When
a word is encountered, features in the list (k of them) will be either
0 or 1. If the feature exists, then its alpha will be saved. All of
the alphas that are generated for a given word occurrence are
multiplied together to give a "probability" of that tag and history
occurring.
- This is not a true probability because these numbers don't add to
100% for each tag and for all tags so the normalization constants pi
and mu must be used. (How these are calculated is not explained in
this paper but is a common thing to deal with in these type of
products.)
- When a sentence is examined in new data, instead of picking a
single most probable tag for each word, a most probable "tag sequence"
for the entire sentence is constructed.
- To figure out he most probable "tag sequence" each possible tag/
history is searched. A beam search algorithm is used to make sure
this completes in a finite amount of time. The N most probable tag
sequences are analyzed instead of ALL possible tag sequences. The
author chose N=5 for his experiments.
- To further speed calculation, a Tag Dictionary is used. This is
created from tags that a word has been known to appear as. If the
word has not been seen before, all tags in the tagset are considered.

Some notes of the experiments:
- Beam size of 5 was used. Higher beam sized didn't make the tagger
more accurate but made it slower.
- The tag dictionary didn't improve accuracy but made it run faster.
- Training time is O(NTA): N = training set size, T = Tagset size, A =
average number of features that are true.
- Execution time is O(NTAB): N = Sentence length, T = Tagset size, A =
average number of features that are true and B = Beam size.

Advantages according to author:
- better able to use divers information than the Markov Models.
- requires less supporting techniques than the Statistical Decision
Tree.
- provides a 'probabilistic framework' unlike the Transformation Based
Learner (Brill Tagger).

The author feels that this tagger doesn't perform hugely better than
these other taggers because of either a the missing predictor (feature
function not found yet) or a limitation of the Penn Treebank test
data. Designing "specialized features" that address some of the
failure points of the algorithm should improve the performance.

- Scott Frye

Scott Frye

unread,

Jul 31, 2009, 10:44:51 AM7/31/09

to Natural Language Processing Virtual Reading Group

My Questions:
- What is it about this model that allows it to be classified as a
Maximum Entropy Model?
- I got a bit lost in formula (2) where they explain how this can be
expressed in Maximum Entropy formalization. I am not sure what the
advantage of expressing this way is and if that expression is used in
practice or is just theory to tie the approach back to the Maximum
Entropy model.
- This paper states that it "tests the hypothesis that better use of
context will improve accuracy" but then states that the accuracy is
only comparable to other methods. It never indicates if the
hypothesis is considered proven or not. I think this paper doesn't
really prove the point one way or another. Are there other papers
that have shed more light on this?
- The paper states: "Maximum entropy doesn't impose any distributional
assumptions on the training data." Can anyone elaborate on what is
meant by this and its significance?
- The "window" that features are described over is +/- 2 words from
the current word under consideration. Is there any advantage to using
a larger window? How much worse does a smaller window perform (i.e.
+/- 1). I'm sure the training time would be very long for larger
windows.

Disadvantages I see:
-Model needs to be trained on large corpus annoted with POS tags.
Sounds like a lot of man hours!
-There might be a more probable tag sequence that the algorithm
didn't find because the beam search finished on a local maximum
instead of a true maximum. The author indicated that increasing the
beam size didn't show an improvement but didn't indicate how much it
was increased to reach this conclusion.
- The algorithm doesn't seem to improve the more it is used so it
will need to have the model updated periodically or for different
domains.
- Tagset must be known in advance and be static. Different tagging
schemes might perform differently?
- Highly reliant on traing data set.

- Scott Frye

Grant Ingersoll

unread,

Aug 3, 2009, 8:06:22 PM8/3/09

to Scott Frye, Natural Language Processing Virtual Reading Group

On Jul 31, 2009, at 10:44 AM, Scott Frye wrote:

>
> My Questions:
> - What is it about this model that allows it to be classified as a
> Maximum Entropy Model?
> - I got a bit lost in formula (2) where they explain how this can be
> expressed in Maximum Entropy formalization. I am not sure what the
> advantage of expressing this way is and if that expression is used in
> practice or is just theory to tie the approach back to the Maximum
> Entropy model.

I took it to be the latter.

Ted Dunning

unread,

Aug 5, 2009, 6:16:34 PM8/5/09

to Natural Language Processing Virtual Reading Group

On Jul 31, 7:43 am, Scott Frye <scottf3...@aol.com> wrote:
> I've put together a summary of the paper based on my
> understanding.

Nicely done, too.

> - The alphas are determined during training by selecting them so the
> "maximize the probability of the training data". The papers doesn't
> describe how this is done but it is assumed that the occurrence of
> each function in the training data is counted and then divided by the
> number of training examples.

This makes some unsupported and strong assumptions.

The phrase "maximize the probability of the training data" is a very
vague statement of the maximum likelihood training procedure. What is
actually happening which MLE training is that you maximize the
probability of the training data *as*estimated*by*the*model*. This is
a good thing in an asymptotic since because you wind up with an
asymptotically optimum model.

The trouble is that you get really bad results on sparse data due to
over-fitting. Moreover, you want to allow the model to be as
elaborate as is justified by the data which exacerbates the over-
fitting problem. Overcoming this requires some sort of regularization
and the flavor of regularization that I prefer is Bayesian.

> ... - This is not a true probability because these numbers don't add to

> 100% for each tag and for all tags so the normalization constants pi
> and mu must be used. (How these are calculated is not explained in
> this paper but is a common thing to deal with in these type of
> products.)

And this is where they lose the thread of what probability really is.
Modern approaches often avoid these problems by a form of Metropolis
sampling which let us explore the distribution of the model parameters
(and possibly even the model form) in light of the data and our prior
assumptions. Such techniques often allow us to sample from the
posterior parameter distribution without having normalized
distributions.

> Advantages according to author:
> - better able to use diverse information than the Markov Models.

But they don't really demonstrate this.

> - requires less supporting techniques than the Statistical Decision
> Tree.

Nor this.

> - provides a 'probabilistic framework' unlike the Transformation Based
> Learner (Brill Tagger).

Nor this. The "maxEnt" approach described in this paper is only
vaguely probabilistic as is Brill's technique. In modern parlance, a
technique is probabilistic if it uses real probabilities in a
principled fashion, typically requiring a generative model. This
current paper doesn't do that.

>
> The author feels that this tagger doesn't perform hugely better than
> these other taggers because of either a the missing predictor (feature
> function not found yet) or a limitation of the Penn Treebank test
> data. Designing "specialized features" that address some of the
> failure points of the algorithm should improve the performance.

This is a common statement (if I just had some phlogiston, my model
would work sooo much better). Actually, the fact that this tagger
doesn't work better than simpler approaches is a strong indication
that the simpler approaches are preferable.

Ted Dunning

unread,

Aug 5, 2009, 6:31:12 PM8/5/09

to Natural Language Processing Virtual Reading Group

On Jul 31, 7:44 am, Scott Frye <scottf3...@aol.com> wrote:
> - This paper states that it "tests the hypothesis that better use of
> context will improve accuracy" but then states that the accuracy is
> only comparable to other methods. It never indicates if the
> hypothesis is considered proven or not.

Bingo!

> I think this paper doesn't really prove the point one way or another.

Actually, if we assume that the authors did a good job of implementing
their technique, we have some weak negative evidence.

> - The paper states: "Maximum entropy doesn't impose any distributional
> assumptions on the training data." Can anyone elaborate on what is
> meant by this and its significance?

This statement is just plain wrong. The maximum entropy technique
starts with an assumption about marginal invariance and that maximum
entropy under this assumption is "optimal". This has strong implicit
implications regarding the distribution of the data and models.

> Disadvantages I see:
> -Model needs to be trained on large corpus annoted with POS tags.
> Sounds like a lot of man hours!

Indeed. There are potentially more efficient approaches based on some
form of active learning where the system asks for specific examples to
be tagged.

> -There might be a more probable tag sequence that the algorithm
> didn't find because the beam search finished on a local maximum
> instead of a true maximum.

Possible, but not terribly likely given that the problem is actually
pretty simple.

> - The algorithm doesn't seem to improve the more it is used so it
> will need to have the model updated periodically or for different
> domains.

You mean you would like the system to adapt to recent text
automagically? How would you expect it to understand how to do this?

> - Tagset must be known in advance and be static. Different tagging
> schemes might perform differently?

This is almost universal with taggers. The major alternative is to
let the system invent its own tag set intended to facilitate some
other criterion than tag "correctness". This leads to systems like
Brown and Mercer's class based language model or Collobert's unified
mixed task learning framework.

P. Arulmozhi

unread,

Aug 6, 2009, 4:21:10 AM8/6/09

to Ted Dunning, Natural Language Processing Virtual Reading Group

can somebody attach the paper and send me? I am not able to download this paper..

Scott Frye

unread,

Aug 6, 2009, 10:15:24 AM8/6/09

to Natural Language Processing Virtual Reading Group

On Aug 5, 6:31 pm, Ted Dunning <ted.dunn...@gmail.com> wrote:
> You mean you would like the system to adapt to recent text
> automagically? How would you expect it to understand how to do this?
>

Well that's the problem...a person learns more about language the more
they read. These algorithms don't. If all is based on counts of
ocurrances of various things in the test data, why can't running
totals be kept.

If I had time to persue a phd, this would proably be where I would
focus my efforts :)

-Scott Frye

Reply all

Reply to author

Forward