On Sep 9, 8:27 am, Scott Frye <
scottf3...@aol.com> wrote:
> ... this paper was a lot harder to understand
> than the last one. I was lost on a lot of the details.
Don't feel bad. I don't think that it was necessarily well expressed,
especially at the beginning.
Hopefully, I can clear up some of these issues.
> Here are some questions to stimulate discussion:
> -What is the difference between a parser and a POS tagger?
A parser constructs a grammatical structure for a sentence. A part of
speech (POS) tagger just labels individual words with what grammatical
role they are playing in a very general sense without reference to the
structure of the sentence or phrase.
For example, in "the red horse cart", a part of speech tagger would
tell us that "red" is serving as an adjective and that "horse" and
"cart" are nouns. Most POS taggers would not tell us that horse-cart
is a noun compound, nor that red probably applies to the cart (or
perhaps to the horse-cart compound), not the horse.
A parser, on the other hand would give us structure such as
[NounPhrase [Article "the"] [AdjectivalPhrase "red" [NounCompound
[Noun "horse"] [Noun "cart"]]]. Don't hold me to any high standards
on the actual labels that I use here; real grammarians tend to use
much more fine-grained distinctions. My labels here are just made up
to help explain the concepts.
> -The authors say this parser is based on a probabalistic generative
> model. What is this exactly?
A generative model is one that specifies a complete joint conditional
probability of the observable and hidden variables conditional on any
parameters of the model. Typically, it is phrased in terms of sub-
models that describe a simpler (more restricted) conditional
structure. This is very helpful in many inference problems where we
want to know what the hidden variables are. A non-generative model
would just tell us the probability of the observations without telling
us about the hidden variables. If there are no hidden variables, then
the two are equivalent.
To be more explicit, the generative model used in Latent Dirichlet
Allocation is very nice and clear. It supposes that a creator of text
has a small number of topics on which they can speak. When creating a
document, the text creator has in mind a distribution of topics for
the document. They then pick a document length. Then they generate
words, one at a time until the document has the required length. To
generate a word, they pick a topic from the document topic
distribution. Each topic has a distribution of words. The text
creator then uses the topic of the word to pick a specific word from
the topic's word distribution.
In this model, the parameters are:
a) the distribution of document topic distributions
b) the word probabilities for each topic
c) the parameters of the length model
The hidden variables are:
a) the topic distribution for each document
b) the specific topic for each word
The observed variables are:
a) the length of each document
b) the words in the document.
> -What is the difference between a tag, a label and a head?
A tag is usually a part of speech tag. A label is usually a label on
a parsed structure. A head is specifically the label on the root of
some sub-tree in a parsed structure.
> -How is the probability of the expansion generated? (this section
> lost me)
That is the crux of the paper. I can't give you details on this yet
(haven't read hard enough), but schematically speaking, the generative
model gives a probability of observed and hidden variables given
parameters p(observed, hidden | parameters). For training, we use the
observed variables, make up hidden variables and eventually get a
single value or distribution of values for the parameters.
For parsing, we have the observed variables and the parameters (from
training) and have to search for hidden variable values or
distributions of values that make us happy (have high probability).
That search is the trick that makes statistical parsing with a
generative model work. This search is usually done using a trimmed
best-first sort of algorithm called beam search.
> - The author refers to three parsing techniques:
> Charniak(2000) - maximum entropy "inspired"
> Collins(1999) - head driven statistical models
> Ratnaparkhi(1999) - maximum entropy models
> Are there others worth considering?
Yes. Collobert's approach is worth looking at. So are more general
Bayesian techniques (for which I don't have a reference handy).
These different approaches differ in how they come by the probability
distribution (or general quality estimate in the case of Collobert).
This difference is mostly in terms of the dependency structure, how
the parsing structure is described and thus how the search proceeds.
There are also some differences in terms of how the parameters are
estimated.
A much larger difference that is not often discussed is whether one
uses a maximum-likelihood approach for parameter estimation as opposed
to estimating a distribution of parameters. Likewise, when parsing,
do we get a single parse structure or a distribution of parse
structures. Distributional techniques naively applied result in 5 or
more orders of magnitude increase in computational cost, but there are
clever approaches which cost vastly less.