Here is my summary and thoughts ...
In this paper the authors claim that their HMM chunker and tagger
system is used to create a named entity recognition system that
performs better than other machine learning systems and even better
than handcrafted rules.
In the introduction, they claim that "rule based systems lack the
ability of coping with the problems of robustness and portability".
The claim that their system just needs to be retrained on new data and
will apply to any system. I find this amusing because it is the same
reason that authors of non-supervised systems use over these systems.
They mention other techniques specifically another HMM, Maximum
Entropy, Decision Tree and a system based on Transformation based
rules. They claim that the HMM performance is higher than those of
others and indicate that the reason is because it captures the
locality of phenomena better. I am surprised because I would have
expected this of the Maximum Entropy system.
They also claim that the performance of a machine learning system is
always poorer than a rule based system by about 2%. They claim their
system does better than rule based systems though.
They mention two kinds of evidence that can be used to the ambiguity,
robustness and portability issues: Internal (to the word) and
external evidence (context)
The HMM model that is proposed uses mutual information between the
tags and the tokens instead of Bayes Rule which is used in traditional
HMM models. This allows them directly generate the original Named
Entity tags instead of modeling the original process that generates
the NE-class annoted words from the original words. (confused, what
does this mean). Because of this, mutual information independence
instead of conditional probability independence needs to be assumed
but otherwise the formulas are similar.
The tokens in the formula are a structure consisting of word sequence
and a word feature sequence. The word feature sequence consists of 4
features types
1) internal - capitalization, numeric, etc (77.6% performance)
2) semantic classification - month, Weekend, Quarter, etc. (10% extra
performance)
3) gazetteer feature - drawn from lists of names, places,
organizations etc. (1.2% extra performance)
4) external macro context? I don't understand this one but think it
might be how the word fits a syntactic template? (5.5% extra
performance)
What they don't explain is if the order of these made a difference.
For instance, if the gazetteer feature was implemented alone, would it
have give only 1.2 performance. Would the internal feature had such
an effect if it was measured incrementally after another feature was
used?
Back-off modeling is used as the smoothing technique. There is a
complex order to the backoff strategy.
Another issue was that they compare the Mutual information model to
this study and conclude that the Mutual Information makes for better
performance. However there are a lot of factors in the feature
classes and its not clear if this was a result of the features or the
algorithm.
> > > --Paul Kalmarhttp://www.KalmarResearch.com- Hide quoted text -
>
> - Show quoted text -
-Scott Frye