[EOTM] POS Discussion Points

Grant Ingersoll

unread,

Jul 31, 2009, 3:29:08 PM7/31/09

to Natural Language Processing Virtual Reading Group

In reading the paper, here are few questions I thought would make for
discussion, feel free to add your own. I thought I would break them
up into theory and practice.

Theoretical:

1. The paper is about maximum entropy. What is entropy in the context
of NLP and why is it a useful? Explain the formulas in laymen's terms.

2. "The generation of features for tagging unknown words relies on
the hypothesized distinction that 'rare' words in the training set are
similar to unknown words"
Is there any downside to this hypothesis?

3. Why do you think the features in table 1 were chosen? How do they
help you predict POS? What other features might be useful besides
those in table 1?

4. What advantages does the author claim over Markov models, TBL, and
other approaches? Are these claims accurate? What are the downsides
to Max Ent approach?

Practice:

1. The introduction paragraph mentions manually annotating corpora.
What tools have people used for annotating? What has been your
experience in using them? How much time/money do you put into
annotating?

2. What other smoothing algorithms are there? What are the benefits
of these algorithms?

3. What are the main benefits of the tag dictionary? What data
structures would be effective for storing, especially for a large
number of terms? What structures would be be best for storing the
training data, etc.?

4. Wall St. Journal corpus is one thing, how does Max Ent compare in
the real world?

5. What situations have you applied Max Ent approaches in? How did
you determine it was the best approach?

-Grant

Scott Frye

unread,

Aug 3, 2009, 1:02:16 PM8/3/09

to Natural Language Processing Virtual Reading Group

On Jul 31, 3:29 pm, Grant Ingersoll <grant.ingers...@gmail.com> wrote:
> 3. Why do you think the features in table 1 were chosen? How do they
> help you predict POS? What other features might be useful besides
> those in table 1?

Looking at the features, they seem to be initially based on the idea
of examining the two words before and the two words after the current
word and looking for a pattern. For instance: "The big" might
frequently be followed by a noun. Or the 2nd word after a verb might
frequently be a noun (object of the verb). This idea was clearly
expanded to include tags as well instead of individual words. This
would catch such patterns as DETERMINER ADJECTIVE NOUN instead of just
"the", "big" NOUN.

The lower half of the table seems to look at the previous tag, the
previous 2 tags, the previous word, the 2nd previous word, the
following word and the 2nd following word.

The upper half of the table seems to extend this thinking to various
propertyes of the word, suffix, prefix, contains number, contains,
uppercase and contains hyphen. Of course there is a final (or first)
feature checking the most common tag for a specific word.

Other features that might be useful?
Of course the distance examined could be extended beyond 2 words
before/after and two tags before to 3 or more. This would increase
the training time but would catch longer patterns of words, like
DETERMINER-NOUN-ACTION VERB-NOUN patterns.

Suffixes and prefixes up to length four are considered and this could
be extended as well. This might not be too helpful for English but
might be more helpful for other languages.

Finally, any number of properties could be looked at that are similar
to "contains upper case". Some that spring to mind are:
- previous word is upper case. (2nd previous, following, 2nd following
ect)
- word is an abbreviation (from known list of abbreviations)
(previous, following ...ect)
- person of a noun (if it's a noun) or tense of a verb.
(morphological analysis of word)
- number of the word (plural/singular)
- other counts of occurance besides rare (frequent, uncommon but not
rare). Maybe look further than +/-2 words/tags on very common words
only?
- "power" of an adjective? Maybe an ranked list of adjectives could
be stored: (small, smaller, smallest, tiny, miniscule etc)
- embedded known words (i.e. believable, unbelievably)

-Scott

Arshad

unread,

Aug 4, 2009, 3:59:33 AM8/4/09

to Natural Language Processing Virtual Reading Group

> 5. What situations have you applied Max Ent approaches in? How did
> you determine it was the best approach?

I am wondering if MaxEnt can provide any help in clustering? Are there
any efficient MaxEnt based clustering algorithms out there?

Arshad

Scott Frye

unread,

Aug 5, 2009, 2:29:06 PM8/5/09

to Natural Language Processing Virtual Reading Group

Arshad,

Could you elaborate a bit? I'm not very familiar with clustering.
I've seen it mentioned a few times and I did a quick scan of the
wikipedia artical on it (didn't read in depth). Any seminal papers on
the topic you could point me at?

Are you talking about clustering to determine the parts of speech
themselves? or just as a way to speed up the MaxEnt algorithm somehow?

-Scott Frye

Arshad

unread,

Aug 6, 2009, 1:52:53 AM8/6/09

to Natural Language Processing Virtual Reading Group

Scott,

I am sorry I was not clear enough in my question. My question is not
related to POS tagging but on the application of MaxEnt for non-
sequential classification.

To my understanding MaxEnt is a classification algorithm. POS tagging
is one of its application. When we perform clustering, for example by
using K Means, we are also doing a kind of classification. So my
question is that can we use MaxEnt to perform clustering? The
difference I see is that POS tagging is a sequential classification
process whereas general clustering is non-sequential.

I could find these references related to MaxEnt and Clustering

1. http://www.springerlink.com/content/h162671084523748/
2. http://portal.acm.org/citation.cfm?id=1117276.

Arshad

Reply all

Reply to author

Forward