[VOTE] Part of Speech Tagging Proposed Papers

3 views
Skip to first unread message

Grant Ingersoll

unread,
Jul 14, 2009, 9:27:00 PM7/14/09
to Natural Language Processing Virtual Reading Group
Here's the proposed papers.  Copying and pasting from 2 column PDFs is a pain, so I'll leave it to you to read the abstracts.

1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill - http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
Cites: 1100+
Description:  Describes a rule-based tagger that uses Transformation Based Learning to acquire rules.  Claims similar accuracy to stochastic methods
Note, there's also an interesting follow on by Brill as well in the form of a case study that one can find if they search for part of speech tagging on Google Scholar.

2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait Ratnaparkhi   http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
       Cites: 480+
Description:  Maximum entropy is often used for classification problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent to do POS tagging, AIUI. 
     This paper describes the max. ent. statistical approach to POS tagging, claiming over 96% accuracy

3.  TnT - A statistical Part of Speech Tagger - Thorsten Brants - http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
Cites: 780+
       Desc:  Uses Markov models for predicting POS tags

4. A practical Part-of-speech tagger - Cutting, et. al http://eprints.kfupm.edu.sa/20079/1/20079.pdf
Cites: 580+
    Desc: An earlier paper on using Markov models. 


To sum, up we have one paper on transformation based learning, two on Markov models and one on Maximum Entropy.

Voting is open until July 18th.

Cheers,
Grant

lianiana

unread,
Jul 15, 2009, 10:21:32 AM7/15/09
to Natural Language Processing Virtual Reading Group
I choose paper No.2 .

On Jul 15, 5:27 am, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> Here's the proposed papers. Copying and pasting from 2 column PDFs is
> a pain, so I'll leave it to you to read the abstracts.
>
> 1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill -http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
> Cites: 1100+
> Description: Describes a rule-based tagger that uses Transformation
> Based Learning to acquire rules. Claims similar accuracy to
> stochastic methods
> Note, there's also an interesting follow on by Brill as well in the
> form of a case study that one can find if they search for part of
> speech tagging on Google Scholar.
>
> 2. A Maximum Entropy Model for Part-of-Speech Tagging by Adwait
> Ratnaparkhi http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
> Cites: 480+
> Description: Maximum entropy is often used for classification
> problems (which POS tagging is). For instance, OpenNLP uses Max. Ent
> to do POS tagging, AIUI.
> This paper describes the max. ent. statistical approach to POS
> tagging, claiming over 96% accuracy
>
> 3. TnT - A statistical Part of Speech Tagger - Thorsten Brants -http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
> Cites: 780+
> Desc: Uses Markov models for predicting POS tags
>
> 4. A practical Part-of-speech tagger - Cutting, et. alhttp://eprints.kfupm.edu.sa/20079/1/20079.pdf

T.R. Fitz-Gibbon

unread,
Jul 15, 2009, 11:30:56 AM7/15/09
to Natural Language Processing Virtual Reading Group
I would also vote for paper 2: Max Ent PoS

Thanks!
T.R.

Joan

unread,
Jul 15, 2009, 4:41:11 PM7/15/09
to Natural Language Processing Virtual Reading Group
From what I hear the Brill tagger is a classic. I vote for the first
paper especially since it has the case study on Google scholar.
-N

On Jul 14, 9:27 pm, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> Here's the proposed papers.  Copying and pasting from 2 column PDFs is  
> a pain, so I'll leave it to you to read the abstracts.
>
> 1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill -http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
>         Cites: 1100+
>         Description:  Describes a rule-based tagger that uses Transformation  
> Based Learning to acquire rules.  Claims similar accuracy to  
> stochastic methods
>         Note, there's also an interesting follow on by Brill as well in the  
> form of a case study that one can find if they search for part of  
> speech tagging on Google Scholar.
>
> 2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait  
> Ratnaparkhi  http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
>         Cites: 480+
>         Description:  Maximum entropy is often used for classification  
> problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent  
> to do POS tagging, AIUI.
>                         This paper describes the max. ent. statistical approach to POS  
> tagging, claiming over 96% accuracy
>
> 3.  TnT - A statistical Part of Speech Tagger - Thorsten Brants -http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
>         Cites: 780+
>         Desc:  Uses Markov models for predicting POS tags
>
> 4. A practical Part-of-speech tagger - Cutting, et. alhttp://eprints.kfupm.edu.sa/20079/1/20079.pdf

Ozgur Yilmazel

unread,
Jul 15, 2009, 9:32:19 PM7/15/09
to Grant Ingersoll, Natural Language Processing Virtual Reading Group
I vote for number 2 as well.

Jason Adams

unread,
Jul 15, 2009, 9:45:23 PM7/15/09
to Natural Language Processing Virtual Reading Group
I vote for #2 as well.

On Jul 14, 9:27 pm, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> Here's the proposed papers.  Copying and pasting from 2 column PDFs is  
> a pain, so I'll leave it to you to read the abstracts.
>
> 1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill -http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
>         Cites: 1100+
>         Description:  Describes a rule-based tagger that uses Transformation  
> Based Learning to acquire rules.  Claims similar accuracy to  
> stochastic methods
>         Note, there's also an interesting follow on by Brill as well in the  
> form of a case study that one can find if they search for part of  
> speech tagging on Google Scholar.
>
> 2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait  
> Ratnaparkhi  http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
>         Cites: 480+
>         Description:  Maximum entropy is often used for classification  
> problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent  
> to do POS tagging, AIUI.
>                         This paper describes the max. ent. statistical approach to POS  
> tagging, claiming over 96% accuracy
>
> 3.  TnT - A statistical Part of Speech Tagger - Thorsten Brants -http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
>         Cites: 780+
>         Desc:  Uses Markov models for predicting POS tags
>
> 4. A practical Part-of-speech tagger - Cutting, et. alhttp://eprints.kfupm.edu.sa/20079/1/20079.pdf

Mike

unread,
Jul 15, 2009, 11:15:23 PM7/15/09
to Natural Language Processing Virtual Reading Group
#2

Valia Vaneeva

unread,
Jul 16, 2009, 9:11:39 AM7/16/09
to Natural Language Processing Virtual Reading Group
Vote for #2.

On Jul 15, 8:27 am, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> Here's the proposed papers.  Copying and pasting from 2 column PDFs is  
> a pain, so I'll leave it to you to read the abstracts.
>
> 1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill -http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
>         Cites: 1100+
>         Description:  Describes a rule-based tagger that uses Transformation  
> Based Learning to acquire rules.  Claims similar accuracy to  
> stochastic methods
>         Note, there's also an interesting follow on by Brill as well in the  
> form of a case study that one can find if they search for part of  
> speech tagging on Google Scholar.
>
> 2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait  
> Ratnaparkhi  http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
>         Cites: 480+
>         Description:  Maximum entropy is often used for classification  
> problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent  
> to do POS tagging, AIUI.
>                         This paper describes the max. ent. statistical approach to POS  
> tagging, claiming over 96% accuracy
>
> 3.  TnT - A statistical Part of Speech Tagger - Thorsten Brants -http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
>         Cites: 780+
>         Desc:  Uses Markov models for predicting POS tags
>
> 4. A practical Part-of-speech tagger - Cutting, et. alhttp://eprints.kfupm.edu.sa/20079/1/20079.pdf

Scott Frye

unread,
Jul 16, 2009, 9:34:30 AM7/16/09
to Natural Language Processing Virtual Reading Group
Another vote for #2:

A Maximum Entropy Model for Part-of-Speech Tagging by Adwait
Ratnaparkhi

On Jul 14, 9:27 pm, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> Here's the proposed papers.  Copying and pasting from 2 column PDFs is  
> a pain, so I'll leave it to you to read the abstracts.
>
> 1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill -http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
>         Cites: 1100+
>         Description:  Describes a rule-based tagger that uses Transformation  
> Based Learning to acquire rules.  Claims similar accuracy to  
> stochastic methods
>         Note, there's also an interesting follow on by Brill as well in the  
> form of a case study that one can find if they search for part of  
> speech tagging on Google Scholar.
>
> 2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait  
> Ratnaparkhi  http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
>         Cites: 480+
>         Description:  Maximum entropy is often used for classification  
> problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent  
> to do POS tagging, AIUI.
>                         This paper describes the max. ent. statistical approach to POS  
> tagging, claiming over 96% accuracy
>
> 3.  TnT - A statistical Part of Speech Tagger - Thorsten Brants -http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
>         Cites: 780+
>         Desc:  Uses Markov models for predicting POS tags
>
> 4. A practical Part-of-speech tagger - Cutting, et. alhttp://eprints.kfupm.edu.sa/20079/1/20079.pdf

po po

unread,
Jul 16, 2009, 10:38:52 AM7/16/09
to Natural Language Processing Virtual Reading Group
Vote no. 2

Ronald Hobbs

unread,
Jul 16, 2009, 1:51:50 PM7/16/09
to Natural Language Processing Virtual Reading Group
my votes for #1

Elmer Garduno

unread,
Jul 16, 2009, 2:55:40 PM7/16/09
to Ronald Hobbs, Natural Language Processing Virtual Reading Group
#1

On Thu, Jul 16, 2009 at 12:51 PM, Ronald Hobbs<ronald...@gmail.com> wrote:
>
> my votes for #1
> >
>

sujit

unread,
Jul 16, 2009, 4:20:22 PM7/16/09
to Natural Language Processing Virtual Reading Group
Voting for #2.

On Jul 14, 6:27 pm, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> Here's the proposed papers.  Copying and pasting from 2 column PDFs is  
> a pain, so I'll leave it to you to read the abstracts.
>
> 1. A SIMPLE RULE-BASED PART OF SPEECH TAGGER by Eric Brill -http://acl.ldc.upenn.edu/H/h92/H92-1022.pdf
>         Cites: 1100+
>         Description:  Describes a rule-based tagger that uses Transformation  
> Based Learning to acquire rules.  Claims similar accuracy to  
> stochastic methods
>         Note, there's also an interesting follow on by Brill as well in the  
> form of a case study that one can find if they search for part of  
> speech tagging on Google Scholar.
>
> 2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait  
> Ratnaparkhi  http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
>         Cites: 480+
>         Description:  Maximum entropy is often used for classification  
> problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent  
> to do POS tagging, AIUI.
>                         This paper describes the max. ent. statistical approach to POS  
> tagging, claiming over 96% accuracy
>
> 3.  TnT - A statistical Part of Speech Tagger - Thorsten Brants -http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
>         Cites: 780+
>         Desc:  Uses Markov models for predicting POS tags
>
> 4. A practical Part-of-speech tagger - Cutting, et. alhttp://eprints.kfupm.edu.sa/20079/1/20079.pdf

Ted Dunning

unread,
Jul 16, 2009, 4:32:15 PM7/16/09
to Natural Language Processing Virtual Reading Group

I would vote no on the Brill paper, as much as I like Eric's work. It
was classic at the time, but *nobody* does rule-based POS taggers any
more.

Ted Dunning

unread,
Jul 16, 2009, 4:38:40 PM7/16/09
to Natural Language Processing Virtual Reading Group

If we go for MaxEnt tagging, it should be paired with Thorsten's TnT
paper. Markov models are pretty much the dominant paradigm for POS
any more, largely because they perform at least as well (and usually
better) than almost anything else.

IF the desire is for some diversity of approaches, then I would
advocate Ronan Collobert's paper on mixed task learning. It includes
a very competitive POS tagger as part of a unified architecture that
achieved state of the art parsing results.

http://eprints.kfupm.edu.sa/20079/1/20079.pdf

or

http://www.kyb.tuebingen.mpg.de/bs/people/weston/papers/unified_nlp.pdf

Ronald Hobbs

unread,
Jul 17, 2009, 5:53:55 AM7/17/09
to Natural Language Processing Virtual Reading Group
This probably isn't the place for discussion... but how come?
according to the papers there's 0.1% accuracy difference between the
two, so why are rule based taggers shunned?

Like I said, probably not the place for discussion, any suggestions on
where we can discuss differences as opposed to just the paper of the
month?

Jason Adams

unread,
Jul 17, 2009, 7:23:30 AM7/17/09
to Natural Language Processing Virtual Reading Group
One reason rule based systems in general are shunned is because of the
expert human labor involved upfront. Another reason is that
statistical methods have become the vogue in NLP (in part because of
the first reason). And they have been shown to work just as well or
better a lot of the time for a lot less work.

This is related as well to a larger discussion between rationalism and
empiricism in linguistics. Chomsky has been a proponent of the
rationalist perspective, that you can examine your own language
faculty and make generalizations to how language works. You make
grammaticality judgments or ask a group of test subjects to make
grammaticality judgments and then develop rules for some linguistic
construct. The empiricist view on the other hand looks at data and
says not whether something is grammatical but what sort of probability
it has. Empiricists tend to view nearly every utterance as possible,
just not necessarily very probable.

Of course I'm just skimming the surface here in the five minutes
before i need to head to work, it's a lot more complicated..

Grant Ingersoll

unread,
Jul 17, 2009, 7:58:56 AM7/17/09
to Natural Language Processing Virtual Reading Group

On Jul 17, 2009, at 5:53 AM, Ronald Hobbs wrote:

>
> This probably isn't the place for discussion... but how come?
> according to the papers there's 0.1% accuracy difference between the
> two, so why are rule based taggers shunned?
>
> Like I said, probably not the place for discussion, any suggestions on
> where we can discuss differences as opposed to just the paper of the
> month?

This is the perfect place for this discussion, IMO, and is a great
question.

Grant Ingersoll

unread,
Jul 17, 2009, 8:00:19 AM7/17/09
to Natural Language Processing Virtual Reading Group
I think we can also choose to extend the topic to the next month, as
well, with another related paper.

Scott Frye

unread,
Jul 17, 2009, 8:35:40 AM7/17/09
to Natural Language Processing Virtual Reading Group
My understanding is also that rule-based stuff also has a slower
execution time where as the statistical stuff has a longer training
time but executes relatively quickly. When machines were slower, this
caused a "drift" in research towards the statistical techniques that
everyone still seems to focus on.

I TOO think that not enough research has been done on the rule based
side. I believe the statistics methods can only go so far and are not
flexible enough to change from domain to domain. I feel the only way
that we will ever be able to reach 100% accuracy is to merge both of
these techniques some how.
> > > > -N- Hide quoted text -
>
> - Show quoted text -

Grant Ingersoll

unread,
Jul 17, 2009, 8:52:10 AM7/17/09
to Natural Language Processing Virtual Reading Group

On Jul 17, 2009, at 8:35 AM, Scott Frye wrote:

>
> My understanding is also that rule-based stuff also has a slower
> execution time where as the statistical stuff has a longer training
> time but executes relatively quickly. When machines were slower, this
> caused a "drift" in research towards the statistical techniques that
> everyone still seems to focus on.
>
> I TOO think that not enough research has been done on the rule based
> side. I believe the statistics methods can only go so far and are not
> flexible enough to change from domain to domain. I feel the only way
> that we will ever be able to reach 100% accuracy is to merge both of
> these techniques some how.

Rule based systems are often brittle in light of changing domains,
too. I "grew up" in rule based systems and while it seems you often
could get higher accuracy in a domain due to fine tuning by analysts,
it seemed to me the cost associated with hiring people to come up with
generalized rules versus hiring people simply to annotate was
significant. The skill level for the former was usually much higher
and thus more expensive. Not to say any Joe off the street can
annotate, either, but it doesn't usually require them to learn the
rule language as well as going through lots of data.

Just my two cents,
Grant

Alexandre Rafalovitch

unread,
Jul 17, 2009, 9:07:27 AM7/17/09
to Natural Language Processing Virtual Reading Group
On Fri, Jul 17, 2009 at 8:52 AM, Grant
Ingersoll<grant.i...@gmail.com> wrote:
> Rule based systems are often brittle in light of changing domains,
> too.
I think statistical systems are brittle when changing domains as well
for a somewhat hidden reason: tokenization.

If your automatic tokenizer's boundaries do not match your hand
annotated boundaries, you basically cannot proceed. And the real-life
data tokenization is absolutely not a trivial exercise. Just see what
happened with newspaper systems when they tried to parse the
biological domain data. And even with tokenization rules adjusted they
still have major problems due to the frequent nested Named Entities
and such.

So, in the end, you have to hire that very-expensive domain expert to
build you special use regular expressions to do the tokenization.
Probably still not as expensive for large systems as having to do
everything with rules, but not as rosy as 'just annotate a lot of
data'. Google can afford to ignore the hard cases because they just
have so much data. Others - not so much.

Of course, this is all is based on statistical sample of 1.... :-)

Regards,
Alex.

On Fri, Jul 17, 2009 at 8:52 AM, Grant
Ingersoll<grant.i...@gmail.com> continued:

T.R. Fitz-Gibbon

unread,
Jul 17, 2009, 10:41:00 AM7/17/09
to Natural Language Processing Virtual Reading Group
Minor, minor point, but 100% accuracy is probably impossible as even
human experts disagree about the PoS in certain cases. Granted, these
are not your everyday cases and involve some pretty specific tags (not
just noun, verb, etc.), but still I find this interesting.

I'm not trying to nitpick; the spirit of you comment is definitely
true, in my opinion.

T.R.

Scott Frye

unread,
Jul 17, 2009, 11:05:28 AM7/17/09
to Natural Language Processing Virtual Reading Group
I thought that as well, however, while reading Jurafsky and Marin's
"Speech and Natural Processing" I came across a reference to a paper
by Voutilainen in 1995 ( it is here http://acl.ldc.upenn.edu/P/P97/P97-1032.pdf
if you are interested though I haven't read it myself yet) that
indicates when human annotators are allowed to collaborate they can
get almost 100% agreement. Unfortunately I don't know if the which
corpuses have been tagged to this high level.

Of course that means at least two separate tagging systems would have
to be used that had slightly different algorithms and then they would
have to have some way of determining which was right in each case.
Still it is fascinating.

-Scott

On Jul 17, 10:41 am, "T.R. Fitz-Gibbon" <trfitzgib...@gmail.com>
wrote:
> > > - Show quoted text -- Hide quoted text -

Grant Ingersoll

unread,
Jul 17, 2009, 11:21:47 AM7/17/09
to Natural Language Processing Virtual Reading Group
_almost_ being the key word, even with annotators collaborating. Some
people are just stubborn ;-)

Scott Frye

unread,
Jul 17, 2009, 11:40:05 AM7/17/09
to Natural Language Processing Virtual Reading Group
Point taken :)

By the way, I looked further and the link to the paper I gave was
incorrect. However it did mention the correct paper:
http://acl.ldc.upenn.edu/E/E95/E95-1029.pdf and it showed results of
human collaborations between 99.1% and 99.3%. I guess we can use that
as a goal instead of 100%.

On Jul 17, 11:21 am, Grant Ingersoll <grant.ingers...@gmail.com>
wrote:
> _almost_ being the key word, even with annotators collaborating.  Some  
> people are just stubborn ;-)
>
> On Jul 17, 2009, at 11:05 AM, Scott Frye wrote:
>
>
>
>
>
> > I thought that as well, however, while reading Jurafsky and Marin's
> > "Speech and Natural Processing" I came across a reference to a paper
> > by Voutilainen in 1995 ( it is herehttp://acl.ldc.upenn.edu/P/P97/P97-1032.pdf

Paul Kalmar

unread,
Jul 17, 2009, 3:07:36 PM7/17/09
to Natural Language Processing Virtual Reading Group
On Jul 17, 5:35 am, Scott Frye <scottf3...@aol.com> wrote:
> I TOO think that not enough research has been done on the rule based
> side.  I believe the statistics methods can only go so far and are not
> flexible enough to change from domain to domain.  I feel the only way
> that we will ever be able to reach 100% accuracy is to merge both of
> these techniques some how.

Statistical methods can adapt better between domains if they use
something like Domain Adaptation (for example, http://acl.ldc.upenn.edu/P/P07/P07-1033.pdf
). As far as I know, no such technique for switching between domains
exists for hand created rules.

Ted Dunning

unread,
Jul 18, 2009, 2:56:31 PM7/18/09
to Natural Language Processing Virtual Reading Group
As you say, you can often build a rule engine that achieves comparable
accuracy, but the expert human effort involved is significant.
Moreover, each time you change domain, you have to redo that human
effort. Overall, the cost is much higher for rule based POS taggers
for no accuracy benefit. Some statistical methods can be extended to
automagically retrain on new domains similar to previous ones with
minimal requirements for additional annotation.

It should also be kept in mind that POS taggers have no inherent
value. Part of speech is an artificial construct that has no
practical value. It was chosen as a task simpler than parsing that
might illuminate general principles of language processing. The
illumination was faint.

As an intermediate step, POS tagging has some value in that it
provides a feature to use as input in other systems such as named
entity bracketers. Again, however, practice indicates that things
aren't so simple and that having the POS tagger as a completely
independent entity within a larger system does not seem to perform as
well as simply using a POS tagger as an initial estimate for a
component which is then trained in the context of the larger task.

This contextual training provides an opportunity for the domain switch
re-training I mentioned above.

Ted Dunning

unread,
Jul 18, 2009, 3:05:54 PM7/18/09
to Natural Language Processing Virtual Reading Group


On Jul 17, 5:35 am, Scott Frye <scottf3...@aol.com> wrote:
> My understanding is also that rule-based stuff also has a slower
> execution time where as the statistical stuff has a longer training
> time but executes relatively quickly.  When machines were slower, this
> caused a "drift" in research towards the statistical techniques that
> everyone still seems to focus on.

I was there. I don't think that this was the case. Both systems are
plenty fast enough that there was no practical difference in speed for
POS tagging.

The drift was much more because rule based systems did not seem to be
able to exceed the accuracy of statistical systems and were much
harder to build, especially because of the large annotated corpora
that are/were available.

> I TOO think that not enough research has been done on the rule based
> side.  

I think that several man-millenia of effort were wasted on rule-based
systems. Damerau did his dissertation in the early 60's showing that
simple Markov models were as good as any known grammatical systems.
That was a great beginning for statistical models of language, but
that entire branch of research was submarined by the rule-based
community. Three decades later, this was shown to be a huge mistake.

> I believe the statistics methods can only go so far and are not
> flexible enough to change from domain to domain.  

The data tend to disagree with you here. Purely rule-based systems
tend to be far more brittle in the face of domain change. Statistical
systems can be design in such a way as to be much more agile in this
regard.

> I feel the only way
> that we will ever be able to reach 100% accuracy is to merge both of
> these techniques some how.

This is already being done. Most statistical systems use basic
features taken from linguistic intuitions. A good example is the
handling of unknown words in POS tagging or Chinese segmentation.
Typically, the unknown word is replaced by a small number of rule-
driven features (is-capitalized, is-all-upper-case, ends-in-s, ends-in-
ly, ends-in-ed and so on). This is a great role for intuitionist
rules.


Ted Dunning

unread,
Jul 18, 2009, 3:11:00 PM7/18/09
to Natural Language Processing Virtual Reading Group


On Jul 17, 6:07 am, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> On Fri, Jul 17, 2009 at 8:52 AM, GrantIngersoll<grant.ingers...@gmail.com> wrote:
> > Rule based systems are often brittle in light of changing domains,
> > too.
>
> I think statistical systems are brittle when changing domains as well
> for a somewhat hidden reason: tokenization.

This is a red-herring. All systems that use tokens as input have this
problem. Rule-based and statistical POS taggers both suffer here.

In my experience, the rule-based taggers are less tolerant of these
problems.

> If your automatic tokenizer's boundaries do not match your hand
> annotated boundaries, you basically cannot proceed.

This is another red herring because you typically hand annotate the
output of your automatic tokenizer. Doing otherwise would be silly.

> And the real-life data tokenization is absolutely not a trivial exercise.

Amen. This is often overlooked. Moreover, real-life tokenization is
not even unambiguous. There are many cases where the same string
might be tokenized differently in different contexts.

That said, there are some excellent statistical methods for
automagically building tokenizers. See Carl de Marckens dissertation
and the statistical Chinese segmenters.

> Google can afford to ignore the hard cases because they just
> have so much data. Others - not so much.

I wouldn't say that they can ignore the hard cases as say that they
can afford to attack them statistically because of the availability of
data.

Ted Dunning

unread,
Jul 18, 2009, 3:13:44 PM7/18/09
to Natural Language Processing Virtual Reading Group

I think that a better goal would be to demonstrate lift on a useful
task such as named entity recognition.

It is quite plausible to expect that a 90% accurate tagger with a good
trainable structure could easily be more useful than 98% accurate
tagger that is frozen. The trainable structure would allow the first
tagger to be adapted to be the most use to the higher level system.
That adaptation might increase or decrease the POS accuracy, but
whichever it does is immaterial because POS tagging has little value
per se.

On Jul 17, 8:40 am, Scott Frye <scottf3...@aol.com> wrote:
> Point taken :)
>
> By the way, I looked further and the link to the paper I gave was
> incorrect.  However it did mention the correct paper:http://acl.ldc.upenn.edu/E/E95/E95-1029.pdfand it showed results of

Grant Ingersoll

unread,
Jul 19, 2009, 7:00:05 AM7/19/09
to Natural Language Processing Virtual Reading Group
On Wed, Jul 15, 2009 at 8:27 AM, Grant Ingersoll <grant.i...@gmail.com> wrote:
Here's the proposed papers.  Copying and pasting from 2 column PDFs is a pain, so I'll leave it to you to read the abstracts.


2. A Maximum Entropy Model for Part-of-Speech Tagging  by Adwait Ratnaparkhi   http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf
       Cites: 480+
Description:  Maximum entropy is often used for classification problems (which POS tagging is).  For instance, OpenNLP uses Max. Ent to do POS tagging, AIUI. 
     This paper describes the max. ent. statistical approach to POS tagging, claiming over 96% accuracy


OK, here's our winner.  Enjoy reading!  Remember, feel free to ask questions as you go through.



Scott Frye

unread,
Jul 20, 2009, 10:05:41 AM7/20/09
to Natural Language Processing Virtual Reading Group


On Jul 18, 2:56 pm, Ted Dunning <ted.dunn...@gmail.com> wrote:
> As you say, you can often build a rule engine that achieves comparable
> accuracy, but the expert human effort involved is significant.
> Moreover, each time you change domain, you have to redo that human
> effort.  Overall, the cost is much higher for rule based POS taggers
> for no accuracy benefit.  Some statistical methods can be extended to
> automagically retrain on new domains similar to previous ones with
> minimal requirements for additional annotation.

Clearly you have far much more expertise in this area than me.

However, I am often confused by statements like you have made above.
All of the algorithms I've seen in papers that recommend rule based
taggers (such as Brill's) use some sort of "learning method" that
requires very little expert human effort to start them. I haven't
seen any papers that advocate had creating all these rules by hand. I
do, repeatedly, hear exactly what you are saying that a LOT of people
spent a lot of time doing exactly that for some reason. Any idea why
people spent so many hours hand coding these rules instead of learning
automatically or developing algorithms (better than Brill's) to learn
these rules dynamically?

-Scott Frye

Ronald Hobbs

unread,
Jul 20, 2009, 1:43:02 PM7/20/09
to Natural Language Processing Virtual Reading Group
After running through the ME Model paper once, I still need to grok
the initial probability summation so will be asking questions on that,
the thing that stood out for me so far has been the Features, and
their correlation to Brills rule templates.

Brill defined 8 rule templates, and 4 of those occur directly in the
features, the rest of the features and rules look like they are fairly
complimentary too. Suffixes, previous word capitalisations, taging
before and after the current word etc.

I guess this is continuing the debate a bit, but it seems to me that
the only real difference between the two methods are the initial
assigning, which is where ME is applied and Brill just uses most
common. Am I just seeing what I want to see here?

Ronald Hobbs

unread,
Jul 21, 2009, 6:11:24 AM7/21/09
to Natural Language Processing Virtual Reading Group
I made an big assumption error, The rules in Brills model are used as
a corrective measure. The features in ME are used initially to figure
out the probability distribution, (1). This is probably why I
shouldn't have read both papers: square pegs & round holes.

T.R. Fitz-Gibbon

unread,
Jul 21, 2009, 6:36:08 PM7/21/09
to Natural Language Processing Virtual Reading Group
Yes, absolutely. Very interesting, and thanks for the paper!

What I actually meant was that I've read about cases where even human
experts disagree on the exact part-of-speech. Again, they are not
common cases and they are for very specific parts-of-speech. So, it
is probably impossible to design a system that will agree with all
experts 100% of the time. Not a big deal, but I find it interesting
to have that perspective into the difficulty of the problem.

What I also find interesting is that it is really about how you define
"accuracy" or "100% accuracy". You could easily have 3 experts tag a
corpus and take their majority vote. Then, "100% accuracy" is
possible (or, at least, it is not limited by the definition of
"accuracy", it may be limited by the data set or the difficulty of the
task, of course), but that's not necessarily what other think of when
you say "100% accuracy".

Another good example of this is sentiment classification. Many people
report high accuracy for sentiment classification, I've even heard
numbers around 90%. What they actually mean is, "we had a set of
humans tag a data set, we took the majority vote, and our system
agrees with this majority vote about 90% of the time". What you don't
always hear is that no one human agreed with the majority vote more
than, say, 60% of the time. In this case, "accuracy" is not
"correctness", it is simply "agreement with a somewhat arbitrarily
selected testing set".

I'm not criticizing such approaches, people, research, etc. I'm
simply pointing out that there's often more to the story than just an
accuracy number. And, let's not even start on precision and recall in
this context. :)

On Jul 17, 10:05 am, Scott Frye <scottf3...@aol.com> wrote:
> I thought that as well, however, while reading Jurafsky and Marin's
> "Speech and Natural Processing" I came across a reference to a paper
> by Voutilainen in 1995 ( it is herehttp://acl.ldc.upenn.edu/P/P97/P97-1032.pdf

Ted Dunning

unread,
Jul 22, 2009, 2:38:19 PM7/22/09
to Natural Language Processing Virtual Reading Group

I know that you backtracked in a followup, but this comment is nearer
to the truth than you think.

The issue is that the features in a statistical tagger often look a
lot like the rules in a rule-based system.

This is absolutely a true observation, and an important one. Picking
good features is the absolutely key step in machine learning
problems. If you have a human expert who has an opinion, it is very
reasonable to include a mechanized version of that opinion in your
system.

That said, the real complication in rules based system is figuring out
precedence and interactions between rules. That is exactly what the
machine learning (aka statistical) approach avoids and that is exactly
where human experts go down the rabbit hole. In general, humans
describe things by exceptions rather than the norm and thus taking
down what they say in the form of rules typically misses the big
picture. Resolving this takes enormous effort.

Ted Dunning

unread,
Jul 22, 2009, 3:06:05 PM7/22/09
to Natural Language Processing Virtual Reading Group


On Jul 20, 7:05 am, Scott Frye <scottf3...@aol.com> wrote:
> On Jul 18, 2:56 pm, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > As you say, you can often build a rule engine that achieves comparable
> > accuracy, but the expert human effort involved is significant. ...
>
> However, I am often confused by statements like you have made above.
> All of the algorithms I've seen in papers that recommend rule based
> taggers (such as Brill's) use some sort of "learning method" that
> requires very little expert human effort to start them.  I haven't
> seen any papers that advocate had creating all these rules by hand.  ... Any idea why
> people spent so many hours hand coding these rules instead of learning
> automatically or developing algorithms (better than Brill's) to learn
> these rules dynamically?

Well, to answer your last question first, you had to be there at the
time for it to make any sense at all at this point.

As for a totally hand-built rules based approach, they not only
existed, but they dominated the field in the 80's. AT&T distributed a
program known as parts which was such a beast and it was widely used.
The wikipedia page on POS tagging captures the essence of the
situation in the early 90's when they say that the success of data-
driven methods were "surprisingly disruptive". I remember people
yelling at each other over the issue at conferences.

http://en.wikipedia.org/wiki/Part-of-speech_tagging

Brill's work was, as much as anything, an attempt to suggest that we
could all just get along; that the hard part of rule-driven systems
could be learned but the perceived superiority of rules-based systems
could be preserved. Before long, even that pretence was dropped and
people started just using data-driven systems (with hand-coded simple
features) because it was just soo much easier.

Lots of modern work now centers on doing (almost) without the hand
annotated corpus by using clustering methods to derive a useful
tagging scheme from untagged language. Since most tags are completely
non-controversial, and since these systems can often highlight points
where they are unsure of their output, this can be a very much more
efficient way to develop language tools.

Adil Aijaz

unread,
Jul 23, 2009, 1:12:24 AM7/23/09
to Natural Language Processing Virtual Reading Group
what is the deadline for finishing the reading? can't seem to find it in earlier emails

adil

po po

unread,
Jul 23, 2009, 1:34:39 AM7/23/09
to Natural Language Processing Virtual Reading Group
2 weeks

Ronald Hobbs

unread,
Jul 23, 2009, 6:09:19 AM7/23/09
to Natural Language Processing Virtual Reading Group
Yeah, my backtrack is that Brills approach used the rules after the
fact to correct the initial assignment, where as the features are more
observations obtained at the beginning of the training. This has been
enlightening simply because I had made the more or less naive
assumption that statistical based methods are simply driven by the
words themselves rather than relationships between the words.
Effectively only thinking in terms of the Tag Dictionary mentioned in
the ME paper.

Ted Dunning

unread,
Jul 23, 2009, 2:59:48 PM7/23/09
to Natural Language Processing Virtual Reading Group

Brill's tagger *is* a statistical, learned tagger. The model is a
sparsified one with the weights encoded in the order of the rules and
the prevalence of different phenomena in the language.

At the time, however, it was a very big deal that his model looked
like rules. The orthodoxy was that rules, well, ruled and that the
only satisfactory model is one that had rules in it. In retrospect,
his model was virtuous because you could explain (and thus implement)
the learning algorithm, because the final model was small (if you
discount the lexical resources) and could tag quickly. Whether it
looks like rules or decision trees under the hood has little
importance any more.
Reply all
Reply to author
Forward
0 new messages