On Tue, 18 Sep 2007, Rob Freeman wrote:
Quote:
I want to summarize some of the more practical aspects of those
solutions.
thanks for this summary; now I won't have to re-read the whole thread
to remind myself what's been discussed Smile
Quote:
..., we might use the context about a word or phrase to select, ad-
hoc, a class of words or phrases with are similar to that word or
phrase (in that context.) ... we can use these true/not true
distinctions to select both syntax, and meaning, specific to context,
in ways we have not been able up to now.
This suggests that corpus linguists should be interested in clustering
or unsupervised machine learning of words into classes according to
shared contexts; but they have been investigating this for some time,
see e.g. papers in Proceedings of ICAME'86, EACL'87. The main
difference between then and now is compute power: we can now use more
sophisticated clustering algorithms, and cluster according to more
complex context patterns, e.g. Roberts et al in Corpora, vol. 1, pp.
39-57. 2006.
But my impression is that most Corpus Linguists are not really that
interested in unsupervised Machine Learning, i.e. letting the computer
work out the grammar/semantics "from scratch"; they prefer to examine
and analyse the corpus data "by hand" to select examples to back up
their own theories...
Eric Atwell, Leeds University
WWW/email: google Eric Atwell
On 9/18/07, Eric Atwell wrote:
> Quote:
> ..., we might use the context about a word or phrase to select, ad-
> hoc, a class of words or phrases with are similar to that word or
> phrase (in that context.) ... we can use these true/not true
> distinctions to select both syntax, and meaning, specific to context,
> in ways we have not been able up to now.
>
> This suggests that corpus linguists should be interested in clustering
> or unsupervised machine learning of words into classes according to
> shared contexts; but they have been investigating this for some time,
> see e.g. papers in Proceedings of ICAME'86, EACL'87. The main
> difference between then and now is compute power: we can now use more
> sophisticated clustering algorithms, and cluster according to more
> complex context patterns, e.g. Roberts et al in Corpora, vol. 1, pp.
> 39-57. 2006.
Yes, people have been clustering words into classes according to
shared contexts for some time.
The point here is the idea that they need to cluster them into a
different class for each context in which they occur.
It is the goals of machine learning which I am suggesting need to
change (viz. a complete grammar), not the methods.
I think computational linguistics will get good results as soon as it
stops looking for global generalizations and clusters ad-hoc,
according to context.
Quote:
But my impression is that most Corpus Linguists are not really that
interested in unsupervised Machine Learning, i.e. letting the computer
work out the grammar/semantics "from scratch"; they prefer to examine
and analyse the corpus data "by hand" to select examples to back up
their own theories...
Whether they are working "by hand" or not, people are not used to
thinking of syntax as ad-hoc generalization according to shared
contexts. I'm suggesting this idea needs to be taken out of machine
learning (where it has only been seen as a means to find "grammar"
anyway, and not a principle of syntax in its own right) and given a
broader airing as a principle of syntax on it own merits.
It might explain why MWE's tend to have the same "slot fillers" for
instance. Detailed analyses of what slot fillers can occur in a given
MWE could be done on the basis of what other contexts two words share
and do not share.
Corpus analysis currently tends to be done in terms of lexicon, what
units are repeated, how often. Corpus style syntactic analyses could
be done on the basis of what words share what contexts, and how well
this predicts the range of combinations they participate in, how MWE's
change over time etc.
-Rob
Rob
(One feels nervous about continuing this now--but you are raising new
and important issues)
I think you may misunderstand what machine learning can do---though of
course it all depends on what you mean by learning/generalizing from
the "same context" Modesty normally forbids me citing, say,
http://citeseer.ist.psu.edu/stevenson01interaction.html, where
Stevenson and I combined learners for word-sense disambiguation over
quite a large corpus (The Red Badge of Courage), and one way of
interpreting what the learner was doing (and it is something some
would find distasteful) is that it was learning for-each-sense-of each-
content -word what were the contexts and criteria that would
disambiguate it-----there are other bits of contemporary and later
work that could also be described that way (and this was not at all
simple unsupervised learning, either).
Best
Yorick
PS On the on-going meta-issues, I fear the last paragraph of Eric
Atwell's message is very insightful as to what is really going on
here, under cloaks of private fights", "abstract discussions"
"separate lists" etc.:
> But my impression is that most Corpus Linguists are not really that
> interested in unsupervised Machine Learning, i.e. letting the computer
> work out the grammar/semantics "from scratch"; they prefer to examine
> and analyse the corpus data "by hand" to select examples to back up
> their own theories...
I have a hunch most Corpus Linguists are not interested much in
computation in general, except as a secretarial/editing/retrieval
tool, but they have to pay lip service to it.
Paradoxically, I think, it is CL/NLP researchers who actually "trust
the text", in they are experimenters who, by definition, dont know
what the results of computation/experiment will be. Many Corpus
Linguists, I suspect, and there are honourable exceptions, know
exactly where they are going and are as dependent on intuition and
judgement as much as Chomskyans, who they still affect to criticize,
and for reasons not all together clear to me. I have an on-going
struggle with a distinguished lexicographer friend and colleague, who
uses sophisticated KWIC indices to display contexts of a word, which
he then classifies by intuition. Suggestions as to how this last stage
could be automated, and I have made many over the years, are never
well received and I have stopped.
Dear Yorick,
Yorick Wilks wrote, in PS:
> I have a hunch most Corpus Linguists are not interested much in
> computation in general, except as a secretarial/editing/retrieval
> tool, but they have to pay lip service to it.
Oh, I'm not so sure... I'm currently reading Linear Unit Grammar (by
John Sinclair and Anna Mauranen) and it seems to be more than paying
"lip service" to computational approaches: LUG reads as a theory of
grammar motivated by the success of Steven Abney's notion of
"chunking" (1991 "Parsing by chunks"), an influential approach to
parsing that was amenable to computers.
I'm sure I have not done justice to the subtleties of LUG, but I just
wanted to show that Corpus Linguists are open to the influence of
computational linguistics.
Regards,
David
Yorick
Modesty and nervousness initially stopped me replying to your thoughts
on
'machine learning', and unsupervised learning, from corpora. But these
are risk averse times.
We had carried out a number of experiments on automatic categorisation
of texts, in a corpus, using Kohonen Feature Maps or Self Organising
Maps. The results were encouraging in that our systems learnt to
classify news paper stories (Reuters) and articles in specialist
journals. The point of introversion/intuition was when we had to
choose a 'feature vector' to train our large SOM.
Pensiri Manomaisupat, Bogdan Vrusias & Khurshid Ahmad, Categorization
of Large Text Collections: Feature selection for unsupervised and
supervised neural networks, 7th Int. Data Engineering and Automated
Learning Conf. (Lecture Notes on Computer Science - LNCS 4224),
Burgos, Spain, 20th-23rd September , edited by E. Corchado, H. Yin, V.
Botti & C. Fyfe, Springer-Verlag, 2006, pp1003 - 1013.
Khurshid Ahmad
Professor of Computer Science
Department of Computer Science
Trinity College,
DUBLIN-2
IRELAND
Phone 00 353 1 896 8429
Web Page: http://people.tcd.ie/kahmad
Yorick,
the impression one gets from reading this thread and the ones related
to it is not just that people in corpus linguistics - and perhaps also
a good share of CL/NLP - are convinced that meaning of words can be
recovered by means of KWIC, bigrams, and other similar lexically
related measures.
My "hunch" is that semantics of words can only be drawn from semantics
of propositions in which words are contained, not just a topographical
notion, but a syntax related notion if you like. That is syntax and
semantics - in that sense, not just lexical - is what determines
meaning at last.
The question is that this is hard to come by with good approximation
with - just - statistically based methods. We all look forward to see
this new generation of CL/Corpus linguists who will be able to couple
St.Methods, corpus measures and symbolic computation. After all, I
cannot give up the idea that it is our intuition as theoreticians that
must guide us...
Best
Rodolfo
P.S.: I know this has already been discussed between us, but the list
deserves more fuel...
Dear Rob,
> Yes, people have been clustering words into classes according to
> shared contexts for some time.
>
> The point here is the idea that they need to cluster them into a
> different class for each context in which they occur.
To my mind, this would not be possible because you will rarely (if
ever) see a word or phrase in the same context. Of course, you can
weaken this for "similar" contexts, but how you do that is an open
research problem.
I'd like to bring Distributional Analysis (DA) back in to this
discussion because I think the important points for your theory to
address are highlighted here.
Unsupervised word clustering has been the subject of much research, as
Eric Atwell reports. However, you are right to highlight a problem
with "completeness", because much of the research on word clustering
using DA relies on an assumption that each word type is unambiguous.
Even that work which has sought to address ambiguity directly, such as
that of Hinrich Schuetze (1997, "Ambiguity Resolution in Language
Learning") and Alexander Clark (2001, "Inducing syntactic categories
by context distribution clustering") includes this assumption to some
degree.
Clark's work attempts to cluster the N most-frequent words assuming no
ambiguity. He then uses this clustering as a bootstrap to classify
less frequent words allowing for ambiguity.
Schuetze's work attempts to cluster together word tokens by forming
distributions over both the contexts in which the word type occurs,
and contexts in which the token context occurs. However, the
distributions themselves only allow features from the list of N most
frequent words, which are of course subject to ambiguity that is not
resolved.
This brings me to your proposal of "incompleteness". Similar to
Schuetze's work, you are suggesting treating word (or other linguistic
phenomenon) tokens independently, and attempting to create clusters in
an ad hoc manner. I'm not too sure of the details, so I'll ask:
(1) what does this have to do with syntax, in terms of arrangements of
symbols? A long-distance dependency like that addressed by yourself
and Mike Maxwell in earlier messages demonstrates a syntactic relation
between words that might be explained by a number of different
theories.
Even simple models (like which word occurs N words to the right/left
of the word being "clustered") have real problems with data-sparsity
(or poverty of the stimulus), and more complex models of syntax just
exacerbate the problem. If you are simply saying "co-occurrence", as
in occurs in the same context (be that a phrase, sentence, document,
etc.), you are not saying anything about the arrangement of symbols,
merely that they co-occur. (I may have misinterpreted "co-occurrence",
but I cannot see how a long-distance relationship would be modelled
otherwise because you have not stated what counts as context, other
than being in the corpus.) At no point have you suggested that (non-
lexical) arrangements be learned or used, so I just wanted
clarification.
(2) If a token is treated as incomplete, under the assumption that it
might represent a different sense (or syntactic class, or "meaning",
or whatever the terms) to other instances of the type, is the same
assumption extended to the token's context? I would assume that you
have a circular definition here, unless at some point you bite the
bullet and assume some tokens are unambiguously defined. If so, which
bullets do you bite, and under which circumstances?
Cheers,
D
Hmm, not sure if it's best to post a reply to this thread, or start a
new one. As you can see, I decided to do the former.
Back when this discussion was on CorporaList, I (Mike Maxwell) had
written:
> One advantage of splitting apart syntax and semantics ... is
> that you begin to see how you can understand long distance
> dependencies of meaning, for example
> The support that John brought was helpful to making our argument.
> 'support' means one thing; whereas in the following sentence
> The support that John brought was helpful for bracing the building.
> it means something quite different. Syntax allows you to relate the
> choice of sense in each case to words at an arbitrary distance away,
> in a way that collocation *by itself* cannot do.
--to which Rob Freeman replied:
> You can model long distance dependencies using raw word associations
> (collocations), Mike. I did an implementation of this. All you need to
> do is carry the (collocation) features through until they are needed to
> make a selection.
We're both right, and I think this very nicely brings out the
difference between science and engineering.
Namely, I'm prepared to believe that Rob's solution will,
statistically speaking, come up with the right answer most of the
time; and that's all you can expect from an engineering solution.
(Strictly speaking, that's all you can expect from people, too, which
is what inter-annotator agreement is about. But I digress.)
But it is not hard to come up with examples where this doesn't work--
and by "doesn't work", I mean that given such examples, people will
come up with a different answer than what the engineering approach
will come up with, and do so with some consistency (reasonably high
inter-annotator agreement).
For example:
Code:
They supported the building's use as a church on weekends.
The sense of 'support' here is clearly that of agreeing with,
providing moral support, etc. But the nearest noun to the verb
'support' here is 'building'. So the engineering approach would select
the physical sense of support (as would be appropriate for "They
supported the office building").
And contrariwise:
Code:
They supported our organization's building.
the sense that people would pick out is that of physically supporting,
even though the nearest noun is 'organization', which having an
abstract sense would require some other sense of 'support' (as would
be appropriate for "They supported the organization of workers into a
union.")
Now these are clearly contrived sentences--I haven't pulled them out
of a corpus. So a person who believed the only true evidence was
corpus evidence might not be convinced. But since we're talking about
people's interpretation of language, I'm not sure what would convince
such a skeptic--since corpora don't come labeled with senses, and we
have to use people's interpretation of the meaning, regardless of
whether the sentence comes from a corpus or from my head.
There are still issues of directionality, sparse data, how one deals
with ambiguity in the context, and how strong the association has to
be--some words are much weaker indicators of the meaning than others.
But apart maybe from the question of directionality, these are
engineering issues, and there may be engineering solutions that work
most of the time.
In sum, I don't doubt that the raw word association approach will work
most of the time. But there are cases (rare though they may be in
corpora) where it gives significantly different answers from what
people would give, particularly in cases where the syntax and
morphology are at odds with the "nearest neighbor approach." So I call
the raw word association method engineering, and the grammar method
science. That isn't to disparage engineering; you'd be hard pressed to
get me on board a plane that flapped its wings. But 747s aren't
science, and neither is raw word association.
BTW, I suspect someone to bring up the fact that if grammar were
science, as I am claiming, then we would have 100% inter-annotator
agreement on word sense tagging. That obviously doesn't happen, but I
think there are other reasons. Go ahead, make my day Smile.
Mike Maxwell
Hi Yorick,
I hope you make it over to this new forum.
> by YORICK WILKS ...
> I think you may misunderstand what machine learning can do---though of
> course it all depends on what you mean by learning/generalizing from
> the "same context" Modesty normally forbids me citing, say,http://citeseer.ist.psu.edu/stevenson01interaction.html, where
> Stevenson and I combined learners for word-sense disambiguation over
> quite a large corpus (The Red Badge of Courage), and one way of
> interpreting what the learner was doing (and it is something some
> would find distasteful) is that it was learning for-each-sense-of each-
> content -word what were the contexts and criteria that would
> disambiguate it-----there are other bits of contemporary and later
> work that could also be described that way (and this was not at all
> simple unsupervised learning, either).
The term "machine learning" covers a lot of ground. In this case,
though, I'm not sure if you are saying I am under appreciating or over
appreciating what it can do. Are you saying that in this case machine
learning was in fact learning a different representation for each
context, as I suggest it should?
So are you saying I am under appreciating what machine learning can
do, and does do, already?
-Rob
Mike Maxwell wrote:
> Back when this discussion was on CorporaList, I (Mike Maxwell) had
> written:
>
> > One advantage of splitting apart syntax and semantics ... is
> > that you begin to see how you can understand long distance
> > dependencies of meaning, for example
> > The support that John brought was helpful to making our argument.
> > 'support' means one thing; whereas in the following sentence
> > The support that John brought was helpful for bracing the building.
> > it means something quite different. Syntax allows you to relate the
> > choice of sense in each case to words at an arbitrary distance away,
> > in a way that collocation *by itself* cannot do.
>
> --to which Rob Freeman replied:
>
> > You can model long distance dependencies using raw word associations
> > (collocations), Mike. I did an implementation of this. All you need to
> > do is carry the (collocation) features through until they are needed to
> > make a selection.
>
> We're both right, and I think this very nicely brings out the
> difference between science and engineering.
>
> ...
>
> But it is not hard to come up with examples where this doesn't work--
Mike. You've invented something called the "engineering approach",
given it to me, and then proceeded to argue it does not work.
But you don't know what it is I am suggesting.
I said the method I use works much like feature unification in HPSG.
Is generative grammar implemented in HPSG then "engineering" or
"science"?
The bottom line is you can't argue something doesn't work without
knowing what it is.
I know you will dismiss what I suggest out of hand. But you are
getting ahead of yourself. Can you first look at what I am proposing
and _then_ dismiss it out of hand, thanks!
-Rob
It is by no means certain I completely understand the nuances of your
questions, so let me tell you what I have done and we can try to work
from there.
Distributional analysis is a good starting point.
Take just the immediate context. The word immediately preceding or
following will do. Traditionally you would form a vector consisting of
these immediately preceding or following contexts for each word in
your corpus, and then try to form classes based on such contexts. I
don't recall if that is what Hinrich Schuetze or Steve Finch did in
their early work, but I think it is. Schuetze's is the first I came
across (e.g. Schuetze "Dimensions of Meaning" http://citeseer.ist.psu.edu/23424.html).
So the way this works is that if two words have a couple of contexts
in common, say they both have the preceding contexts "a" and "the",
you are well on your way to guessing you're looking at words belonging
to the same syntactic class. In this case you might call that class
"noun".
Note it is one or other set of contexts within all the contexts of two
words which determine whether they are in the same class.
Now, the incompleteness argument is that there will be more than one
set like "a"and "the" which different words have in common. In general
word A might have set X in common with word B, but set Y in common
with word C. So which set is A in?
Well it is in both, but it can't be in both at the same time. That is
the problem. Either A is in a class with B and not C, or C and not B.
It can't be in both because the classes are different.
I say you should let context decide.
Remember, you just need to choose a set, X or Y. Context can do that.
Put two words together in sequence, the word you want to "classify"
and the context you want to "classify" it in. You can let the contexts
of one select the contexts of another according to combinations which
occur together in sequence.
Basically, instead of matching contexts directly to find a set two
words have in common (as you do when you cluster classes normally),
now you use one word to select a set in another, according to which
contexts of each occur together in sequence.
You just need to choose a set, and the observed sequences can do that.
Does that make sense?
-Rob
> It is by no means certain I completely understand the nuances of your
> questions, so let me tell you what I have done and we can try to work
> from there.
>
> Distributional analysis is a good starting point.
>
> Take just the immediate context. The word immediately preceding or
> following will do. Traditionally you would form a vector consisting of
> these immediately preceding or following contexts for each word in
> your corpus, and then try to form classes based on such contexts. I
> don't recall if that is what Hinrich Schuetze or Steve Finch did in
> their early work, but I think it is. Schuetze's is the first I came
> across (e.g. Schuetze "Dimensions of Meaning"http://citeseer.ist.psu.edu/23424.html).
Yes, this sounds like what I was trying to describe.
> So the way this works is that if two words have a couple of contexts
> in common, say they both have the preceding contexts "a" and "the",
> you are well on your way to guessing you're looking at words belonging
> to the same syntactic class. In this case you might call that class
> "noun".
>
> Note it is one or other set of contexts within all the contexts of two
> words which determine whether they are in the same class.
Agreed, this is the intuition. One looks for some substantial
similarity between the words, usually in terms of the size of the
intersection of these sets.
> Now, the incompleteness argument is that there will be more than one
> set like "a"and "the" which different words have in common. In general
> word A might have set X in common with word B, but set Y in common
> with word C. So which set is A in?
>
> Well it is in both, but it can't be in both at the same time. That is
> the problem. Either A is in a class with B and not C, or C and not B.
> It can't be in both because the classes are different.
Right, this is where we're getting a little fuzzy. I agree that if you
assume a word *type* to be ambiguous, you are necessarily suggesting
that the set of contexts is not a single set, but rather it should be
viewed as a set that is partitioned, with one partition for each
"sense" of a word. This is what Schuetze's later work attempted: it
started with the assumption that tokens of a type are similar, but may
constitute quite different senses dependent on context. I wouldn't
argue against his intuition or methodology, nor against your
description above.
This view is a "mixture model": we view the complete set of contexts
for a type (A is the type, and its complete set of contexts will at
least cover the union of X and Y) as a mixture of subsets, where the
correct mixture should find the sets X and Y exactly. However, in this
case it sounds easier than it is...
> I say you should let context decide.
>
> Remember, you just need to choose a set, X or Y. Context can do that.
> Put two words together in sequence, the word you want to "classify"
> and the context you want to "classify" it in. You can let the contexts
> of one select the contexts of another according to combinations which
> occur together in sequence.
Agreed again. This is the equivalent of using an established (trained)
classifier. If the observed contextual feature ("the context you want
to 'classify' it in") has some strong association with one
classification (A-X or A-Y) then the classifier will work. (I
mentioned data-sparsity in my earlier post, and this is where it kicks
in: not all features have strong associations; many feature-values are
infrequent; etc.)
> Basically, instead of matching contexts directly to find a set two
> words have in common (as you do when you cluster classes normally),
> now you use one word to select a set in another, according to which
> contexts of each occur together in sequence.
right. I think this is precisely the intuition between the works of
Schuetze and Clark I mentioned earlier. Their implementations go some
way towards what you are saying. Schuetze's work, for example, gives
equal weight to the relation of a token to its type, and the relation
to its context - he applies distributional analysis to each token,
considering vectors of type-contexts and context-contexts.
> Does that make sense?
It does, but the critical issue I think needs resolving is the
presence of ambiguity in both words and their contexts.
Hopefully I can clarify by example. Let's say we have a fragment:
"... Alfonso's strong desire to prospect in Bradford's land ...
" (nonsense, I know, though I'm sure I could find a corpus sentence
with equivalent properties of having an ambiguous function word
preceded/followed by ambiguous content words. Perhaps this is the
worst kind of introspection - made with faith that I could find an
example in a corpus ;) )
I'm particularly interested in "desire to prospect" here:
"to" can be either an infinitive marker (in this case) or a
preposition, and we need to know which sense we have.
"prospect" and "desire" may be too infrequent to give adequate support
for a cluster.
So let's go after the infrequent words first: "prospect". We might
want to know if it is a verb or a noun (or at least one of a number of
senses), but in order to do so we need to look at the context.
However, "to" is also ambiguous, so we now also have to decide upon a
sense for that!
Clearly, we would be unwise to make an assumption that the word is
unambiguous, since this is not the case here; equally if we assume
ambiguity we have a real problem, because we don't know which sense we
should use, and this notion extends to the contexts - they are also
subject to ambiguity, and *this* is the critical point, I think of
both of our arguments.
The approaches of Clark and Schuetze are admirable in that they
approach ambiguity, but they are still forced to assume, at some
point, that words or contexts are unambiguous. (This is a great
improvement on making the assumptions throughout, as in earlier work
on unsupervised clustering.)
I think the best machine learning can hope for is that we can have
some degree of faith that the context belongs to some class or other,
but to keep the options open to allow for competing explanations.
Probabilistic parsing runs along these lines (reranking, etc.), but
sadly it is usually treated as an entirely separate step to word
clustering. (ON an optimistic note, it is an aim of the UDOP
approach.) This is why I'm interested in your suggestion of
"incompleteness", because if it presents a workable solution to the
above ambiguity it would be very useful indeed.
Cheers,
D
I think from your description you are merely pointing out a flaw in
distributional modes of analysis -- completeness (in the logical
sense) is something utterly different.
Your examples above are surely just saying that observing two strings
in the same context does not imply they are syntactically congruent.
This limitation of the distributional approach was discussed in
Chomsky's LSLT a long time ago.
robf wrote:
> Well it is in both, but it can't be in both at the same time. That is the problem. Either A
> is in a class with B and not C, or C and not B. It can't be in both because the classes
> are different.
You are assuming that the classes are disjoint -- which is false,
since we know that words are lexically ambiguous. So 'can' could be a
noun or a modal auxiliary. Clearly the syntactic classes must overlap.
If you wish them not to overlap, then you have to expand out the
ambiguities by having a richer set of classes (a power-set
construction like for determinisation of FSAs).
I've avoided this discussion in the main forum, but now I feel able to
roll my sleeves up and put the gloves on...
As background, I'm an engineer which means I care more that things
work than knowing why they work (though the former generally requires
the latter).
For me there have been a number of points make.
1. Syntactical analysis (parsing) does not work well if considered in
isolation from lexis, semantics and pragmatics.
2. Language and lexis change over subjext, style, person and time. Any
non-adaptive language analysis process will fail (perform poorly) over
time or in different environments (this is particularly close to my
heart though mainly in respect of classification).
3. It is possible to extract patterns / grammar from unstructured text
through various machine learning processes. This is a well researched
process for lexis, though 'grammar' induction is in it's infancy.
For me, the problem with this approach is that we choose to ignore
pragmatics. The human approach to language appears to be to have a
message in ones head that one intends to give and a matching
expectation of a message in the receivers head that they expect to
receive. If the message arriving is significantly different from what
the receiver expects or has background to understand (which is broadly
the same thing), then communication stalls or fails.
In short, I think the alternative meanings are selected by the general
subject matter (social organisation vs architecture) in the
communicators heads and it will NEVER be possible to disambiguate that
without (somehow) taking that into account.
The question then arises as to the extent that the context can be
determined by the surrounding text (quite possible in the case of the
example of support for the Church) and the extent to which 'world
knowledge' is required.
For example, despite being an intelligent (well I think so) and
reasonably well read (in computational lingustics and related areas)
person, I frequency come across papers which are totally
incomprehensible at first. Comprension requires me to acquire
something of writers mental processes through a combination of
external background reading (often wikipedia these days) and an
assimilation of the style of the writer, usually by repeated reading
of the paper and so on.
This is arguably an extreme example (and one devoid of corpus proof
Wink ), but publications in any niche area can be a struggle to read
until you gain the flavour of it.
So I think that the excercises to induce an adaptive grammer (which is
in effect what I think Rob is doing, though I'm not entirely sure he
would agree) from unstructured text are worthy and valuable. However,
I don't think the problem of language analysis will be solved until we
recognise that pragmatics, semantics, syntax and lexis are not
distinct.
by ALEXC
I think from your description you are merely pointing out a flaw in
distributional modes of analysis -- completeness (in the logical
sense) is something utterly different.
Your examples above are surely just saying that observing two strings
in the same context does not imply they are syntactically congruent.
This limitation of the distributional approach was discussed in
Chomsky's LSLT a long time ago.
On Sep 19, 10:48 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:
Simply that we have pairs of strings like "I am thin" and "I am an
Englishman", but "thin" and "an Englishman" are not syntactically
congruent (not his example).
Alex
> > Your examples above are surely just saying that observing two strings
> > in the same context does not imply they are syntactically congruent.
> > This limitation of the distributional approach was discussed in
> > Chomsky's LSLT a long time ago.
>
> What was Chomsky's discussion?
Simply that we have pairs of strings like "I am thin" and "I am an
Englishman", but "thin" and "an Englishman" are not syntactically
congruent (not his example).
> We must, that is, grasp some set of attributes that all games and only
> games have in common. Wittgenstein, however, concluded that, given the way
> we use language and the sort of world to which we apply it, there need be no
> such set of characteristics. Though a discussion of _some_ of the attributes
> shared by a _number_ of games or chairs or leaves often helps us learn how
> to employ the corresponding term, there is no set of characteristics that is
> simultaneously applicable to all members of the class and to them alone."
Right, but I think the field of machine learning appreciates this.
First, unsupervised clusterings of words are not always made on a
categorical basis (contrast hard- and soft-clustering models, the latter
of which allow you to say "with probability P1, word w1 belongs to class
C1; with probability P2 w1 belongs to C2, etc."), which allows for
uncertainty in class assignments. I don't know how this relates to your
notion of incompleteness.
Second, soft-clustering is only possible because we assume is that there
is no set of characteristics that determine precisely which
classification is right (a soft-clustering under such circumstances
would yield a hard-clustering, since there would be no overlap or
similarity between instances, only a clear division). Of course it helps
if your data has a nice split of characteristics that map exactly on to
classes, but this is rarely the case in interesting machine learning
problems. Where this is not the case, an approximate "rule" (however it
is embodied) is formed, considering more or less useful characteristics,
and this is what machine learning seeks.
The contextual feature-sets analysed by any technique are usually
constrained (in terms of how you specify the underlying model) to make
the problem tractable, so they only capture useful trends in the data
that are consistent with these constraints. This is why I earlier asked
you if you wanted to define your model a priori (and I said this was
equivalent to defining a syntax) or to learn your model somehow.
Finally, I think there is also an acknowledgement (by the fact that ML
techniques are usually scored on performance out of 100%, rather than
simply saying "it does/doesn't work") that machine learning only
attempts to account for /some/ of the regularities.
A framework like Memory-Based Learning (or k-Nearest Neighbour) will
account for more because it retains specific instances when
generalisations do not hold -- i.e. if we can't interpret according to
existing models, keep the instance in the hope we can form a model
later, or simply use it as an exception-case. Although this is similar
in intention to your suggestions, I don't think this is what you are
arguing for, but I might be wrong.
D
by IainD
...
For me, the problem with this approach is that we choose to ignore
pragmatics.
In short, I think the alternative meanings are selected by the general
subject matter (social organisation vs architecture) in the
communicators heads and it will NEVER be possible to disambiguate that
without (somehow) taking that into account.
The question then arises as to the extent that the context can be
determined by the surrounding text (quite possible in the case of the
example of support for the Church) and the extent to which 'world
knowledge' is required.
For example, despite being an intelligent (well I think so) and
reasonably well read (in computational lingustics and related areas)
person, I frequency come across papers which are totally
incomprehensible at first. Comprension requires me to acquire
something of writers mental processes through a combination of
external background reading (often wikipedia these days) and an
assimilation of the style of the writer, usually by repeated reading
of the paper and so on.
So I think that the excercises to induce an adaptive grammer (which is
in effect what I think Rob is doing, though I'm not entirely sure he
would agree) from unstructured text are worthy and valuable. However,
I don't think the problem of language analysis will be solved until we
recognise that pragmatics, semantics, syntax and lexis are not
distinct.
> We must, that is, grasp some set of attributes that all games and only
> games have in common. Wittgenstein, however, concluded that, given the way
> we use language and the sort of world to which we apply it, there need be no
> such set of characteristics. Though a discussion of _some_ of the attributes
> shared by a _number_ of games or chairs or leaves often helps us learn how
> to employ the corresponding term, there is no set of characteristics that is
> simultaneously applicable to all members of the class and to them alone."
Right, but I think the field of machine learning appreciates this.
First, unsupervised clusterings of words are not always made on a
categorical basis (contrast hard- and soft-clustering models, the latter
of which allow you to say "with probability P1, word w1 belongs to class
C1; with probability P2 w1 belongs to C2, etc."), which allows for
uncertainty in class assignments. I don't know how this relates to your
notion of incompleteness.
Second, soft-clustering is only possible because we assume is that there
is no set of characteristics that determine precisely which
classification is right (a soft-clustering under such circumstances
would yield a hard-clustering, since there would be no overlap or
similarity between instances, only a clear division). Of course it helps
if your data has a nice split of characteristics that map exactly on to
classes, but this is rarely the case in interesting machine learning
problems. Where this is not the case, an approximate "rule" (however it
is embodied) is formed, considering more or less useful characteristics,
and this is what machine learning seeks.
The contextual feature-sets analysed by any technique are usually
constrained (in terms of how you specify the underlying model) to make
the problem tractable, so they only capture useful trends in the data
that are consistent with these constraints. This is why I earlier asked
you if you wanted to define your model a priori (and I said this was
equivalent to defining a syntax) or to learn your model somehow.
Finally, I think there is also an acknowledgement (by the fact that ML
techniques are usually scored on performance out of 100%, rather than
simply saying "it does/doesn't work") that machine learning only
attempts to account for /some/ of the regularities.
A framework like Memory-Based Learning (or k-Nearest Neighbour) will
account for more because it retains specific instances when
generalisations do not hold -- i.e. if we can't interpret according to
existing models, keep the instance in the hope we can form a model
later, or simply use it as an exception-case. Although this is similar
in intention to your suggestions, I don't think this is what you are
arguing for, but I might be wrong.
> I think it says incompleteness explains why statistical language models
> appear to be random.
They may be random, but the aim of machine learning (at least the
generative modelling stuff) is to impose structure on that randomness in
useful ways.
> So you blur your model to take into account the fact it can't handle all
> your information.
>
> Once again, this under specifies the system and makes it appear random.
I wouldn't say "to take into account", I would say "as acknowledgement
of". It allows words to belong simultaneously to two-or-more classes,
where it has similarity with each. The most similar in any context wins,
presumably, but how you define "similarity" and "context" is a
limitation of your model, and cannot possibly account for all contextual
information. But then, I don't believe (yet!) that your suggestion will
either.
Soft-clustering allows uncertainty, but is certainly not random, as above.
> So you ignore features which don't fit your model?
Yes! I don't think it is a good thing either, but determining relevant
features is tantamount to proposing a syntax, which is why I think there
is a good deal of circularity in the approach. Simply storing instances
does not get around this, because you still need to specify *how you
then analyse them*, which again is a syntactic decision.
> If you can do that and still make useful predictions, I guess it is OK. But
> in fact our language models all turn out random.
Again, I don't think "random" quite describes it. Certainly, they
involve randomness, but there is a world of difference between uniform
distributions and other distributions. Depending on how well your model
fits your problem, "random" (selection within model parameters) can be
very good indeed!
> And such lowered expectations are a good thing?
No, but in the absence of a perfect model, the best approximation wins!
Furthermore, they lead to deepening insight and scientific progress
(because you can analyse and understand their flaws), which are good things.
> I like MBL and kNN. As you say they deal with more complexity by postponing
> generalization. The problem with MBL and kNN is that once again they assume
> a grammar. In this case the grammar they assume is their solution set.
I don't think that's necessarily true. You may have to start with a
model to use kNN, but what you do with exceptions (simply retain, or
reanalyse in an attempt to form another model/grammar that accounts for
exceptions) is a matter of choice, surely? My original interpretation of
your suggestion of ad hoc grammar runs along these lines:
[1] assume a simple model, and learn as best you can;
[2] find all things that don't fit the model;
[3] analyse these outliers and propose some new, more complex model
machinery to deal with them;
[4] go to [1] until convergence.
This is *incredibly* naive in that it is hard to define how you might
actually do any of the above (well, except [4]), but that was my
interpretation and caused my link to Subsumption Architecture. As you
can see, it is quite different to your suggestion because it
necessitates a commitment to model syntax at each stage, but I thought
"ad hoc" might mean "form a new model to fit the data", so you can
(hopefully) see the connection.
> MBL and kNN both work as classifiers. If you don't have a complete set of
> classes (and cannot get a complete set, can never have a complete set) they
> can never work (beyond the randomness inherent in their solution set.)
Right, so it isn't a great model for unsupervised learning...
> So, in general, I agree machine learning handles these kinds of things now,
> in a way, by accepting randomness. I'm just suggesting that the randomness
> comes from incompleteness in the underlying model. And looking at
> incompleteness in the underlying model might be a better way to handle it.
I agree with what you say here, which is why I am interested in /how/
you incorporate incompleteness, and what makes it better in practice.
D
This view is a "mixture model": we view the complete set of contexts
for a type (A is the type, and its complete set of contexts will at
least cover the union of X and Y) as a mixture of subsets, where the
correct mixture should find the sets X and Y exactly. However, in this
case it sounds easier than it is...
> I say you should let context decide.
>
> Remember, you just need to choose a set, X or Y. Context can do that.
> Put two words together in sequence, the word you want to "classify"
> and the context you want to "classify" it in. You can let the contexts
> of one select the contexts of another according to combinations which
> occur together in sequence.
Agreed again. This is the equivalent of using an established (trained)
classifier. If the observed contextual feature ("the context you want
to 'classify' it in") has some strong association with one
classification (A-X or A-Y) then the classifier will work. (I
mentioned data-sparsity in my earlier post, and this is where it kicks
in: not all features have strong associations; many feature-values are
infrequent; etc.)
... Schuetze's work, for example, gives
equal weight to the relation of a token to its type, and the relation
to its context - he applies distributional analysis to each token,
considering vectors of type-contexts and context-contexts.
...the critical issue I think needs resolving is the
presence of ambiguity in both words and their contexts.
...
I think the best machine learning can hope for is that we can have
some degree of faith that the context belongs to some class or other,
but to keep the options open to allow for competing explanations.
Probabilistic parsing runs along these lines (reranking, etc.), but
sadly it is usually treated as an entirely separate step to word
clustering. (ON an optimistic note, it is an aim of the UDOP
approach.) This is why I'm interested in your suggestion of
"incompleteness", because if it presents a workable solution to the
above ambiguity it would be very useful indeed.
Of course it is possible, /for any given corpus/, although it may be
unlikely. If you mean "sufficient for language not present in the
corpus" then that's a different matter, and one that no-one in machine
learning would claim to achieve.
Desirable in practical terms: if it works, we keep it.
> Has anyone looked into proving it is possible (or not)?
Don't know, I'm afraid.
> ... The only possible flaw I can see is that you are implementing a classifier.
> Why? What makes you think there is a complete set of classes to resolve to
> which will completely parametrize your problem?
I don't think there is necessarily a complete set of classes. I am
simply saying that is how these DA algorithms work, on my way towards
the "recursive ambiguity" problem (that you coin below).
>> ... Schuetze's work...
> He also tries to learn a complete set of classes this way does he?
Granted, it is a complete set of classes. Clark has a class of
unresolved entities in his model, though, if this appeals to the
"incompletist" in you. (Not much of a concession, since it is still a
class! But I'm sure some more complicated machinery could be brought in
to deal with the class members.)
> I can't see that UDOP will help. The entire content of DOP seems to me that
> it codes grammar in a tree-bank. There may be some subtle differences
> associated with that. But I can't see them. It is just grammar by another
> name.
I'll grant you that DOP has Tree Substitution Grammar at its heart,
which is clearly an a priori model. The form of UDOP is really
unimportant; it is the aims which are optimistic:
UDOP has been established to work with POS categories. The project
coordinators hope to:
[1] extend UDOP so it uses an unsupervised classification algorithm to
select POS categories (extending to words-only corpora)
[2] extend again so that syntactic constraints within the UDOP trees are
considered when making POS category assignments (thus allowing learned
syntactic constraints to interact with decisions that are traditionally
approached with only N-word windows of context features).
That was my note of optimism: feedback from non-adjacent, possibly
syntactically related words, into the word-classification process.
> I don't see a problem. I don't see why you need to assume something is
> unambiguous at some point so you can resolve recursive ambiguities.
Ok. Let's assume that all word tokens are completely independent of
their type, and that they are only associated with the type if there is
sufficient evidence, for which there is initially none. Our corpus can
then be rewritten by replacing each word with a unique symbol, to remove
any preconceptions about what the tokens mean. One such scheme might
just be to assign each token a positive integer starting from 1 - in
which case you have a sequence of numbers from 1 to N (size of corpus),
which is impossible to generalise over.
The alternative is to conflate tokens against their type, which makes an
assumption that leads towards mixtures. You then have to deal with the
problem of ambiguity. This pervades the tokens you look at, the tokens
you consider as contexts, etc. /That/ is why I think recursive ambiguity
is a problem.
> Why can't the two select each other uniquely? Why can't an entire sequence
> of words select a unique set of elements to express their combination, and
> that combination alone?
I'm not saying they cannot, but have you tried doing this on a large scale?
Can you give a simple example of a complete corpus, a procedure for
performing your combination, and the end result? I think it would really
help me understand. Your sentences in earlier messages/threads all
assume they are drawn from a corpus containing more information, where
identified noun-phrases are used as contexts without describing how you
arrive at their identification.
Have you seen the EMILE algorithm (Adriaans, 1992, "Language Learning
from a Categorial Perspective")? It may be relevant - discusses
"support" and "characteristic sets" in a distributional setting.
> Why does it have to be a question of assuming either the first word is a
> noun or the second, so you can resolve the sequence to be either NV or VN?
> Why shouldn't the case be that "this word" + "that word" _always_ means NV
> or VN, uniquely specified, no arguments, and no need to guess one first (and
> if they don't add some more context until they do.)
A probabilistic treatment that leaves open all possibilities does this,
and affords the power of a metric of preference between possibilities.
It's just very hard to achieve.
I'd also like to clarify my position that even the assumption of linear
order between pairs of words is of a syntactic nature, and vital to your
model. We are always committing to such syntactic models. In ML cases,
this commitment is where the model to be learnt is defined; in your case
the commitment is made when you analyse a word to find its meaning -
/where/ you look in the corpus *is* your definition of syntax. Are you
saying you make no such commitments?
Cheers,
D
On Sep 19, 11:43 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:
> Ian,
> > For me, the problem with this approach is that we choose to ignore
> > pragmatics.
>
> I don't see pragmatics as a problem. John Sowa raised this point. As I said
> to him, it is one of the nice features of the ad-hoc generalization approach
> I'm suggesting, that a basis for meaning in context provides a natural path
> to extend the kind of context we regard as "meaningful" beyond text, as far
> as we like.
>
> And within text or without, there is no reason why we should exclude
> pragmatic information. What we mean by pragmatics might emerge naturally
> from patterns in contexts.
>
The point I was trying to make about pragmatics (or context), is that
some can most certainly gleaned from the text immediately around
phrases. Some may be available from documents which are in some way
associated with the one under consideration. Some may be in text
somewhere, but not even google may have access to it and some, perhaps
really is tacit. It is purely a model inside our heads, perhaps
originating in another medium (vision, smell, touch) or simply (!)
being the digest of many years of living in the world.
In short I don't think we can ever acheive language processing purely
by bootstrapping text.
The problem I see is the implementation of a system which treats each
phrase / sentence in terms of the totality of all the other phrases in
your collection, is simply the vastness of it. We need to have some
means of reducing the vastness of it to some kind of model. Even if
the model is just an index to the original data.
> In short, I think the alternative meanings are selected by the general
>
> > subject matter (social organisation vs architecture) in the
> > communicators heads and it will NEVER be possible to disambiguate that
> > without (somehow) taking that into account.
>
> It depends whether you think what is in their heads is intrinsically
> different from what comes out of their mouths. I suggest the one models the
> other.
There is clearly a relationship. Though tenous sometimes in my
experience ;)... However, human communication is much richer than
speech so it seems to me that our internal models are richer than
speech. Unfortunately, here we run into philosophy (or perhaps
religion) as it is extremely hard to experiment with our internal
models!
>
> I think it is worth exploring as a model, anyway, right through to
> pragmatics. Why not? Analyse pragmatics in terms of patterns in contexts,
> social contexts, etc.
Broadly, I agree. In fact is that not what science does? Analyses
'things' in terms of patterns in contexts?
>
> Meaning is an organization of examples. If you don't have access to the same
> examples, you will not understand. When you say that doing some background
> reading on a subject enables you to understand, you are supporting a corpus
> model of meaning.
I find myself unable to believe in a one to one correspondance between
words (however processed) and meaning. Sorry!
On Sep 19, 11:13 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:
I think your use of the word paradox here is misleading: this is not
a logical issue. It's a problem for naive learning algorithms, not a
paradox.
This is just a fact about certain classes of languages -- and is quite
different to the issue of lexical ambiguity.
Rob Freeman wrote:
> Why can't the two select each other uniquely? Why can't an entire sequence
> of words select a unique set of elements to express their combination, and
> that combination alone?
I'm not saying they cannot, but have you tried doing this on a large scale?
Can you give a simple example of a complete corpus, a procedure for
performing your combination, and the end result? I think it would really
help me understand.
Your sentences in earlier messages/threads all
assume they are drawn from a corpus containing more information, where
identified noun-phrases are used as contexts without describing how you
arrive at their identification.
I'd also like to clarify my position that even the assumption of linear
order between pairs of words is of a syntactic nature, and vital to your
model. We are always committing to such syntactic models. In ML cases,
this commitment is where the model to be learnt is defined; in your case
the commitment is made when you analyse a word to find its meaning -
/where/ you look in the corpus *is* your definition of syntax. Are you
saying you make no such commitments?
On Sep 19, 11:13 am, "Rob Freeman" < gro...@chaoticlanguage.com> wrote:
>
> Two words can have contexts in common, and yet not have all contexts in
> common. This creates a paradox from the point of view of trying to form a
> complete grammatical description. Sometimes the words are the same, and
> sometimes they are different.
...
I think your use of the word paradox here is misleading: this is not
a logical issue. It's a problem for naive learning algorithms, not a
paradox.
This is just a fact about certain classes of languages -- and is quite
different to the issue of lexical ambiguity.
In short I don't think we can ever acheive language processing purely
by bootstrapping text.
The problem I see is the implementation of a system which treats each
phrase / sentence in terms of the totality of all the other phrases in
your collection, is simply the vastness of it. We need to have some
means of reducing the vastness of it to some kind of model. Even if
the model is just an index to the original data.
I find myself unable to believe in a one to one correspondance between
words (however processed) and meaning. Sorry!
So given a sequence of words ABC, if you first select
combinations between AB or BC, and then select from among the results of that
combination with what was left ((AB)C) or (A(BC)), that naturally gives a best
order of combination for ABC. That most natural "order of
combination" turns out to look a lot like parse or constituent structure.
As part of that, halfway through combining all these groups of words into
partial combinations, you naturally arrive at a point where a group which looks
a lot like a noun-phrase, or other, is acting as a context to select something
else, as you mention above.
It happens, but all you have to do to make it happen is model selections between
immediately adjacent tokens, and iterate that.
As I say, it is all really down to the power of selections to find new sets.
That means you can always find a new set of contexts which uniquely specifies
the behaviour of each partial combination AB, BC, ((AB)C)... etc, and can
iterate the selection.
[[Iain]] This seems to be very similar to the methods used to perform unsupervised grammar induction. Both Alignment Based Learning (van Zaanen) and ADIOS (Solan) have the idea of selecting partial combinations and iterating from there. Both claim some success at identifying remote connections. How does your approach differ?
Iain
On Sep 20, 3:23 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:
The example I gave does not use lexical ambiguity -- that should be
argument enough!
I can look into it, but that's probably not going to be provided by
google groups ;)
> It turns out that power is enough to deal with any structural issues by
> itself. All you have to do is iterate selections between immediately
> adjacent tokens.
Right, so you're talking about a greedy learning framework, such as
those described by Iain (ABL-incr is the only instantiation of ABL that
behaves in this way, however). There have been many attempts to model
language learning in this way: ADIOS, ABL, EMILE, the work of J. Gerard
Wolff, Stolcke and Omohundro, and the more ad hoc work of Smith and
Witten or my own thesis. Alex Clark and Dan Klein have both discussed
the problems of such an approach.
> I'm not quite sure what you mean by "committing to ... syntactic models". My
> syntactic model is the corpus. If a combination of words occurs in the
> corpus, it is in the syntactic model. (If two words share a lot of contexts
> they are hypothesized to belong to one or other same syntactic class, and
> allowed to make predictions about each other.) Up to now I've only looked at
> immediate contexts. That is a commitment of sorts. I don't see why it
> shouldn't be extended though.
I'm not explaining this very well (evidently). I'm trying to say that by
introducing a constraint that says "initially we only consider relations
between adjacent words", you are introducing a syntactic constraint.
Granted, as in greedy/iterative learning algorithm, it is /possible/
that *if* you make all the correct choices for adjacent words, you
*might* end up being able to consider relations between syntactically
related non-adjacent words. But the problem with greed is that your
success at a high level is entirely dependent on success at local
levels, and if you get it wrong locally then long-distance dependencies
will not be considered.
This is why it is a syntactic constraint: if your algorithm rules out
/any/ analyses at the beginning (like the relation between non-adjacent
words), it is making a syntactic commitment. That your algorithm has
only loose specification of syntax (adjacency), and that the overall
syntax is an emergent property, does not exempt it from my observation.
We can only see how well it works by comparing it empirically to other
models, though such comparison is a difficult problem in itself! (Again,
Alex Clark, among others, have discussed the problems of evaluation.)
To try to convince you that your commitment to adjacency is a commitment
to syntax, I'd like to consider English constructions where adjacency
doesn't properly account for the relationships. Cross-serial
dependencies in some sentences with "respectively" are an example:
[from Mark Davies' View/BNC interface:]
"71 per cent and 75 per cent in the grammatical and oral examinations
respectively"
Here the relationship is much more than can be captured with head-words
and adjacency, and in fact an adjacency model will not capture the
correct relationships. To correctly interpret the above sentence
requires indexing of the order of (conjunction) participants, and
duplication of the coupling relation ("in" in the above example).
(However you specify the syntax, it must at least capture these relations.)
Cross-serial constructions are a problem for adjacency (as the name
implies): if one of "71 per cent" or "75 per cent" is selected as a
head, then the other cannot be related to "in" or to its partner in the
complement. In other languages the situation will be different because
there may be a different incidence of such constructions.
Under my misinterpreted view of "ad hoc generalisation", I think that
structures like this (and like the lists that John Sowa mentioned way
back in the old thread) are treatable: you said yourself that
uninterpretable constructions may be reanalysed as a syntactc oddity,
which may then be reinforced by other instances of the patter, until
eventually it becomes an ossified pattern.
Cheers,
D
--
David Brooks
Teaching Instructor
http://www.cs.bham.ac.uk/~djb
> It turns out that power is enough to deal with any structural issues by
> itself. All you have to do is iterate selections between immediately
> adjacent tokens.
Right, so you're talking about a greedy learning framework, such as
those described by Iain (ABL-incr is the only instantiation of ABL that
behaves in this way, however). There have been many attempts to model
language learning in this way: ADIOS, ABL, EMILE, the work of J. Gerard
Wolff, Stolcke and Omohundro, and the more ad hoc work of Smith and
Witten or my own thesis. Alex Clark and Dan Klein have both discussed
the problems of such an approach.
To try to convince you that your commitment to adjacency is a commitment
to syntax, I'd like to consider English constructions where adjacency
doesn't properly account for the relationships. Cross-serial
dependencies in some sentences with "respectively" are an example:
[from Mark Davies' View/BNC interface:]
"71 per cent and 75 per cent in the grammatical and oral examinations
respectively"
Here the relationship is much more than can be captured with head-words
and adjacency, and in fact an adjacency model will not capture the
correct relationships. To correctly interpret the above sentence
requires indexing of the order of (conjunction) participants, and
duplication of the coupling relation ("in" in the above example).
(However you specify the syntax, it must at least capture these relations.)
Cross-serial constructions are a problem for adjacency (as the name
implies): if one of "71 per cent" or "75 per cent" is selected as a
head, then the other cannot be related to "in" or to its partner in the
complement. In other languages the situation will be different because
there may be a different incidence of such constructions.
I didn't say anything about a complete set of classes here. Some of
these algorithms do attempt to find complete classes; some of them
explicitly do not (original EMILE and ADIOS); some of them attempt more
greedy merges of classes where enough similarity exists (practical
implementations of EMILE that reduce the constraint of "characteristic
support" to a similarity measure, and my work).
These algorithms primarily learn structure, and not necessarily
classifications.
In my approach, I wouldn't say that there is a predefined idea of
what/how many classes there should be -- they are just formed where
enough similarity is present. I'm not sure you'd accept this as "ad
hoc", though, because my choice of similarity threshold effectively
specifies the number of classes in advance, although not in such a
transparent way. The motivation for this was that there simply were not
enough instances of any observed combination to support further
learning. I think this is a problem for all such approaches, and I
merely made a choice in one particular direction, and your criticisms
are of course valid.
> Firstly let me say I am not stuck on immediate contexts.
In which case your openness to other syntactic relations removes the
syntactic constraint I described. You can see my point, and are aware of
my concern. (It's still hard to do the combinations without defining the
syntax, but I'll roll with it for now...)
> What I get is a representation which depends on both.
Ok, more reasonable. I've seen plenty of theories that don't even allow
this.
> How you would then tease that apart to get the "respective" associations,
> frankly, I have no idea. But I think the complexity I am talking about will
> help.
Agreed. Bolting on a more complex mechanism is a possibility, and the
details are not immediately important.
> Maybe. It would be the word "learning" which I would be afraid of in that
> summarization. I don't think it is possible to "learn" a complete set of
> classes to summarize all the information there is in the distributions.
I didn't say anything about a complete set of classes here. Some of
these algorithms do attempt to find complete classes; some of them
explicitly do not (original EMILE and ADIOS)...
These algorithms primarily learn structure, and not necessarily
classifications.
Rob Freeman wrote:
> On 9/20/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:
>>
>>> Maybe. It would be the word "learning" which I would be afraid of in
>> that
>>> summarization. I don't think it is possible to "learn" a complete set of
>>> classes to summarize all the information there is in the distributions.
>> I didn't say anything about a complete set of classes here. Some of
>> these algorithms do attempt to find complete classes; some of them
>> explicitly do not (original EMILE and ADIOS)...
>>
>> These algorithms primarily learn structure, and not necessarily
>> classifications.
>
>
> If they just try to structure text and not learn a set of classes to
> parametrize that structure, that could be good.
>
> Structure is also ultimately subjective, e.g. you get some branches where it
> does not matter which you choose as "head", and some will be sensitive to
> your exact corpus. But if you are letting the full complexity of selection
> between sets of contexts govern structure, then that is the same as me, and
> I think you have the complexity you need.
Just to clarify: in my work, I found that *without* the generalisation
afforded by fitting to (albeit ad hoc) classification, I couldn't
leverage larger units because there was so little contextual support for
potential units. I resorted to classification, and while I don't view
that as a bad thing per se, it does raise issues of the sort you are
trying to address. I think this is the problem, as Alex said, with naive
learning algorithms...
However, others (ADIOS and EMILE) are closer; certain instantiations of
ABL likewise.
I am quite keen to see a working implementation of your program now. Can
it be run on an Apache web-server on a local machine?
I am quite keen to see a working implementation of your program now. Can
it be run on an Apache web-server on a local machine?
> It turns out that power is enough to deal with any structural issues by
> itself. All you have to do is iterate selections between immediately
> adjacent tokens.
Right, so you're talking about a greedy learning framework, such as
those described by Iain (ABL-incr is the only instantiation of ABL that
behaves in this way, however). There have been many attempts to model
language learning in this way: ADIOS, ABL, EMILE...
I'm happy to admit that I don't really understand how your algorithm is
intended to work.
> I'm having more and more problems with the equivalence you have drawn
> between my approach and ADIOS and EMILE, as well as with the general
> characterization as a "a greedy learning framework".
>
> Firstly, I don't think what I am describing is a "greedy" algorithm.
> "Greedy" algorithms involve making decisions early, right? I don't make
> decisions early at all.
So you don't make any commitments? So a phrase like "the man with the
stick" is not treated as a unit of some kind? Then how can it act as
syntactic context as if it were a unit?
Maybe I misunderstood - are you saying that this interpretation is just
*one of an enormous space of possibilities* that is considered in
parallel? Would you therefore also consider "the man with the", "the man
with", "the man * the stick" etc., all at the same time? If so, I don't
see how it is possible to cover all the possible relations.
Perhaps I'm not clear what "iterate" means in this context. I took it to
mean "iteratively committing to current best interpretations", hence I
used the term "greedy".
I interpreted greed because the explosion of alternatives seems so
massive that I assumed you'd have to do something to make processing
tractable. You seem to be saying that it is this enormous space of
possibilities that defines your approach. Coming from an AI background I
can't really see how (especially in 200MB of RAM!), but that's why I
want to learn more about it.
Can you send me (or link off this group) some operational code for your
system so I can try it out on my local server? (And some instructions
for use.)
D
...
I interpreted greed because the explosion of alternatives seems so
massive that I assumed you'd have to do something to make processing
tractable. You seem to be saying that it is this enormous space of
possibilities that defines your approach. Coming from an AI background I
can't really see how (especially in 200MB of RAM!), but that's why I
want to learn more about it.
[[Iain]] I'm on David's side here, I can't see how you can perform what amounts to an exhaustive search as the historic data increases. I assume the 200MB is just for your indexes.
I also presume that you DO have some kind of index structure - an exhaustive search over all priors is just never going to work in real life!
Can you provide more information on your algorithm?
Iain
The 200Mb is just for the indexes, yes.
The search space is big, yes too, because the potential number of
parses goes up fast as sentence length increases. There are some
optimizations you can do which keep the time polynomial, but if you
search over all of them in serial, it does take time.
The solution would be to do searches in parallel.
> Can you provide more information on your algorithm?
It is not so hard. If Google can do ad-hoc searches across the whole
Web, I can do ad-hoc searches across 40 million words.
I start with words indexed on lists of contexts. Then I have contexts
in these lists select each other according to the words in the search
query.
This very much like what happens in indexed search now.
A trick that is then unique to me is that I take each observed
combination between contexts in these lists, and expand them out on
the contexts _they_ occur in. So I get a list of contexts representing
the combination or words in the search query, in the order they are
presented. This is also a candidate "constituent", and the list of
contexts is its representation.
That is something search engines don't currently do. Which means the
order of search keys does not currently affect results, not beyond
crude exact phrase matches.
In this system the order of search keys makes a big difference. From
the point of view of language modeling it is the whole game.
So you iteratively combine lists of contexts associated with search
keys in this way, and end up with a list representing the whole
presented sentence, in the order it was presented (and also according
to the order "constituents" were found--the parse structure.)
The size of the list gives you a measure for how well that sequence of
search keys fits combinations of contexts in your data-base.
The actual list looks a lot like it is collected on the meaning of the
search query. Just as search results do now. The only difference is
that the order of the search keys in the query has selected the
contexts too.
-Rob
> It is a parallel search problem, so not difficult in theory. But if you
> can't do parallel search efficiently, yes, it is a problem.
Ok, I'm getting a better handle on this.
I've just googled you and found your website
(http://www.chaoticlanguage.com/), containing the quote:
"Essentially we do the same, with the twist that we replace the classes
with vectors or lists of examples."
I think this makes the concept much clearer for me (as do subsequent
descriptions about "infinite numbers of [virtual] rules at runtime").
The implementation you have is presumably one you've restricted only to
vectors of adjacent words (cf. my discussion of syntactic constraints -
with your ). This is just so I understand this instantiation, not so I
can critise the approach in general - I think you've allayed my earlier
concerns, and I just want to be clear before I start thinking harder
about it.
FYI: My comparison to ADIOS was made on the basis that this is what I
believe ADIOS does. I think the same pertains for EMILE *only* in the
case of "characteristic contexts" (defined by Pieter Adriaans).
I know EMILE has been extended to perform similarity-based
generalisations (in effect making it context-free, at least to some
degree), but I don't know if ADIOS has been extended similarly.
I'm not suggesting that they are doing the same thing necessarily; I
just wanted to clarify the basis for my comparison.
I don't wish to imply that I doubt the method, but the same can be said
for a lot of computationally hard problems, and this is why I (and maybe
also Iain) asked.
> It is not so hard. If Google can do ad-hoc searches across the whole
> Web, I can do ad-hoc searches across 40 million words.
But google doesn't index function words, which I think is an important
difference in terms of effectiveness (but not in terms of problem-scale,
where your observation is quite true).
Two corner-cases that you must be able to argue are:
[1] extremely frequent words. These offer little information about
co-occurrence - in terms of traditional measures like Mutual Information
that allow you to constrain the problem.
[2] hapax legomena (because their information is too limited to
generalise effectively).
Google doesn't do too well with either, AFAIK. The fact that google
works is a product of relations between content words, which select each
other syntactically only as a by-product of common semantic associations.
I suspect that the effects of lexical ambiguity and "syntactic
incongruence despite contextual congruence" (an abstraction of Alex
Clark's description that is probably inaccurate; Menno van Zaanen
describe this as "dual-level", but I don't know what the correct term
is?) are particularly important in these corner cases.
I'm not saying you can't argue these corner-cases, and I'm sure you've
thought about this before and have an argument prepared. But perhaps I
should think more carefully about the workings before we enter into this
discussion.
> (i.e. it should assume B is in a class with D _only_ in the context of A_C.)
FYI: My comparison to ADIOS was made on the basis that this is what I
believe ADIOS does. I think the same pertains for EMILE *only* in the
case of "characteristic contexts" (defined by Pieter Adriaans).
Emile's approach is slightly different, but I am not 100% sure.
Google doesn't do too well with either, AFAIK. The fact that google
works is a product of relations between content words, which select each
other syntactically only as a by-product of common semantic associations.
I suspect that the effects of lexical ambiguity and "syntactic incongruence despite contextual congruence" ... are particularly important in these corner cases.
I'm not saying you can't argue these corner-cases, and I'm sure you've
thought about this before and have an argument prepared. But perhaps I
should think more carefully about the workings before we enter into this
discussion.
I think there is an important distinction: the phenomenon Alex
highlights is present *even if we managed to resolve all lexical
ambiguity*. (I know we're in a discussion about whether or not that is
necessary/possible, but I think it's especially important for someone
adopting a position like Rob's - if your approach copes with this where
others fail, then it is an advantage you should highlight.)
I guess that Rob conflates the issue because from a set-overlap
perspective, he's saying his system can handle:
A _B_ (term B observed in the context of A)
even where:
[1] instances of B may be syntactically incongruent (B is ambiguous)
[2] A may /occur with/ syntactically incongruent terms.
For [2] to occur, it is possible that A is simply ambiguous, and this
relates to my earlier discussion of contextual ambiguity; however,
ambiguity only accounts for a subset of these observations, not
including Alex's example.
As Alex also pointed out, this is a problem for naive learning
algorithms: if an algorithm that forms classifications successfully
discerns between syntactically incongruent instances of B (on the basis
of the surface form of B), then it is quite reasonable to say it
accounts for the problem. Data-sparsity in the surface form is one
reason why this may not be possible, and it is particularly relevant
when considering raw text (as opposed to POS-tagged texts).
In my view, the arguments distinguishing ambiguity from the phenomenon
Alex described are important as they define capacities of learning
algorithms, even though the approaches to resolving them might be similar.
For example, van Zaanen's thesis, Section 7.3 [p.101-102] suggests that
clustering is a viable approach to resolving the following case (from
ATIS corpus):
show me the _nonstop_ flights
show me the _morning_ flights
He argues that distributional analysis over the words in question would
lead us to think that they are in separate classes. My earlier point
about data-sparsity (especially hapax legomena) is relevant here: if
these are the only occurrences of the words, how do you determine that
they belong to different syntactic types? I can't see how a set-based
view such as that expounded by Rob would resolve this problem, unless
there is much more data about the words in question.
D
can you briefly explain how your demo (the picture on your website)
arrives at constituent bracketings from sets?
cheers,
D