>
> This probably isn't the place for discussion... but how come?
> according to the papers there's 0.1% accuracy difference between the
> two, so why are rule based taggers shunned?
>
> Like I said, probably not the place for discussion, any suggestions on
> where we can discuss differences as opposed to just the paper of the
> month?
This is the perfect place for this discussion, IMO, and is a great
question.
>
> My understanding is also that rule-based stuff also has a slower
> execution time where as the statistical stuff has a longer training
> time but executes relatively quickly. When machines were slower, this
> caused a "drift" in research towards the statistical techniques that
> everyone still seems to focus on.
>
> I TOO think that not enough research has been done on the rule based
> side. I believe the statistics methods can only go so far and are not
> flexible enough to change from domain to domain. I feel the only way
> that we will ever be able to reach 100% accuracy is to merge both of
> these techniques some how.
Rule based systems are often brittle in light of changing domains,
too. I "grew up" in rule based systems and while it seems you often
could get higher accuracy in a domain due to fine tuning by analysts,
it seemed to me the cost associated with hiring people to come up with
generalized rules versus hiring people simply to annotate was
significant. The skill level for the former was usually much higher
and thus more expensive. Not to say any Joe off the street can
annotate, either, but it doesn't usually require them to learn the
rule language as well as going through lots of data.
Just my two cents,
Grant
If your automatic tokenizer's boundaries do not match your hand
annotated boundaries, you basically cannot proceed. And the real-life
data tokenization is absolutely not a trivial exercise. Just see what
happened with newspaper systems when they tried to parse the
biological domain data. And even with tokenization rules adjusted they
still have major problems due to the frequent nested Named Entities
and such.
So, in the end, you have to hire that very-expensive domain expert to
build you special use regular expressions to do the tokenization.
Probably still not as expensive for large systems as having to do
everything with rules, but not as rosy as 'just annotate a lot of
data'. Google can afford to ignore the hard cases because they just
have so much data. Others - not so much.
Of course, this is all is based on statistical sample of 1.... :-)
Regards,
Alex.
On Fri, Jul 17, 2009 at 8:52 AM, Grant
Ingersoll<grant.i...@gmail.com> continued:
On Wed, Jul 15, 2009 at 8:27 AM, Grant Ingersoll <grant.i...@gmail.com> wrote:
Here's the proposed papers. Copying and pasting from 2 column PDFs is a pain, so I'll leave it to you to read the abstracts.
2. A Maximum Entropy Model for Part-of-Speech Tagging by Adwait Ratnaparkhi http://acl.ldc.upenn.edu/W/W96/W96-0213.pdfCites: 480+Description: Maximum entropy is often used for classification problems (which POS tagging is). For instance, OpenNLP uses Max. Ent to do POS tagging, AIUI.This paper describes the max. ent. statistical approach to POS tagging, claiming over 96% accuracy