Theoretical:
1. The paper is about maximum entropy. What is entropy in the context
of NLP and why is it a useful? Explain the formulas in laymen's terms.
2. "The generation of features for tagging unknown words relies on
the hypothesized distinction that 'rare' words in the training set are
similar to unknown words"
Is there any downside to this hypothesis?
3. Why do you think the features in table 1 were chosen? How do they
help you predict POS? What other features might be useful besides
those in table 1?
4. What advantages does the author claim over Markov models, TBL, and
other approaches? Are these claims accurate? What are the downsides
to Max Ent approach?
Practice:
1. The introduction paragraph mentions manually annotating corpora.
What tools have people used for annotating? What has been your
experience in using them? How much time/money do you put into
annotating?
2. What other smoothing algorithms are there? What are the benefits
of these algorithms?
3. What are the main benefits of the tag dictionary? What data
structures would be effective for storing, especially for a large
number of terms? What structures would be be best for storing the
training data, etc.?
4. Wall St. Journal corpus is one thing, how does Max Ent compare in
the real world?
5. What situations have you applied Max Ent approaches in? How did
you determine it was the best approach?
-Grant