Discussion on corpus-syntax-and-how-we-can-use-it-to-code-meaning

12 views
Skip to first unread message

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 3:49:20 AM9/19/07
to Grammatical Incompleteness
by ERIC ATWELL

On Tue, 18 Sep 2007, Rob Freeman wrote:

Quote:
I want to summarize some of the more practical aspects of those
solutions.


thanks for this summary; now I won't have to re-read the whole thread
to remind myself what's been discussed Smile

Quote:
..., we might use the context about a word or phrase to select, ad-
hoc, a class of words or phrases with are similar to that word or
phrase (in that context.) ... we can use these true/not true
distinctions to select both syntax, and meaning, specific to context,
in ways we have not been able up to now.


This suggests that corpus linguists should be interested in clustering
or unsupervised machine learning of words into classes according to
shared contexts; but they have been investigating this for some time,
see e.g. papers in Proceedings of ICAME'86, EACL'87. The main
difference between then and now is compute power: we can now use more
sophisticated clustering algorithms, and cluster according to more
complex context patterns, e.g. Roberts et al in Corpora, vol. 1, pp.
39-57. 2006.

But my impression is that most Corpus Linguists are not really that
interested in unsupervised Machine Learning, i.e. letting the computer
work out the grammar/semantics "from scratch"; they prefer to examine
and analyse the corpus data "by hand" to select examples to back up
their own theories...

Eric Atwell, Leeds University
WWW/email: google Eric Atwell

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 3:54:08 AM9/19/07
to Grammatical Incompleteness
by ROB FREEMAN

On 9/18/07, Eric Atwell wrote:

> Quote:
> ..., we might use the context about a word or phrase to select, ad-
> hoc, a class of words or phrases with are similar to that word or
> phrase (in that context.) ... we can use these true/not true
> distinctions to select both syntax, and meaning, specific to context,
> in ways we have not been able up to now.
>
> This suggests that corpus linguists should be interested in clustering
> or unsupervised machine learning of words into classes according to
> shared contexts; but they have been investigating this for some time,
> see e.g. papers in Proceedings of ICAME'86, EACL'87. The main
> difference between then and now is compute power: we can now use more
> sophisticated clustering algorithms, and cluster according to more
> complex context patterns, e.g. Roberts et al in Corpora, vol. 1, pp.
> 39-57. 2006.

Yes, people have been clustering words into classes according to
shared contexts for some time.

The point here is the idea that they need to cluster them into a
different class for each context in which they occur.

It is the goals of machine learning which I am suggesting need to
change (viz. a complete grammar), not the methods.

I think computational linguistics will get good results as soon as it
stops looking for global generalizations and clusters ad-hoc,
according to context.

Quote:


But my impression is that most Corpus Linguists are not really that
interested in unsupervised Machine Learning, i.e. letting the computer
work out the grammar/semantics "from scratch"; they prefer to examine
and analyse the corpus data "by hand" to select examples to back up
their own theories...


Whether they are working "by hand" or not, people are not used to
thinking of syntax as ad-hoc generalization according to shared
contexts. I'm suggesting this idea needs to be taken out of machine
learning (where it has only been seen as a means to find "grammar"
anyway, and not a principle of syntax in its own right) and given a
broader airing as a principle of syntax on it own merits.

It might explain why MWE's tend to have the same "slot fillers" for
instance. Detailed analyses of what slot fillers can occur in a given
MWE could be done on the basis of what other contexts two words share
and do not share.

Corpus analysis currently tends to be done in terms of lexicon, what
units are repeated, how often. Corpus style syntactic analyses could
be done on the basis of what words share what contexts, and how well
this predicts the range of combinations they participate in, how MWE's
change over time etc.

-Rob

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 3:56:46 AM9/19/07
to Grammatical Incompleteness
by YORICK WILKS

Rob
(One feels nervous about continuing this now--but you are raising new
and important issues)

I think you may misunderstand what machine learning can do---though of
course it all depends on what you mean by learning/generalizing from
the "same context" Modesty normally forbids me citing, say,
http://citeseer.ist.psu.edu/stevenson01interaction.html, where
Stevenson and I combined learners for word-sense disambiguation over
quite a large corpus (The Red Badge of Courage), and one way of
interpreting what the learner was doing (and it is something some
would find distasteful) is that it was learning for-each-sense-of each-
content -word what were the contexts and criteria that would
disambiguate it-----there are other bits of contemporary and later
work that could also be described that way (and this was not at all
simple unsupervised learning, either).

Best
Yorick

PS On the on-going meta-issues, I fear the last paragraph of Eric
Atwell's message is very insightful as to what is really going on
here, under cloaks of private fights", "abstract discussions"
"separate lists" etc.:

> But my impression is that most Corpus Linguists are not really that
> interested in unsupervised Machine Learning, i.e. letting the computer
> work out the grammar/semantics "from scratch"; they prefer to examine
> and analyse the corpus data "by hand" to select examples to back up
> their own theories...

I have a hunch most Corpus Linguists are not interested much in
computation in general, except as a secretarial/editing/retrieval
tool, but they have to pay lip service to it.

Paradoxically, I think, it is CL/NLP researchers who actually "trust
the text", in they are experimenters who, by definition, dont know
what the results of computation/experiment will be. Many Corpus
Linguists, I suspect, and there are honourable exceptions, know
exactly where they are going and are as dependent on intuition and
judgement as much as Chomskyans, who they still affect to criticize,
and for reasons not all together clear to me. I have an on-going
struggle with a distinguished lexicographer friend and colleague, who
uses sophisticated KWIC indices to display contexts of a word, which
he then classifies by intuition. Suggestions as to how this last stage
could be automated, and I have made many over the years, are never
well received and I have stopped.

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 3:58:06 AM9/19/07
to Grammatical Incompleteness
by DAVID BROOKS

Dear Yorick,

Yorick Wilks wrote, in PS:

> I have a hunch most Corpus Linguists are not interested much in
> computation in general, except as a secretarial/editing/retrieval
> tool, but they have to pay lip service to it.

Oh, I'm not so sure... I'm currently reading Linear Unit Grammar (by
John Sinclair and Anna Mauranen) and it seems to be more than paying
"lip service" to computational approaches: LUG reads as a theory of
grammar motivated by the success of Steven Abney's notion of
"chunking" (1991 "Parsing by chunks"), an influential approach to
parsing that was amenable to computers.

I'm sure I have not done justice to the subtleties of LUG, but I just
wanted to show that Corpus Linguists are open to the influence of
computational linguistics.

Regards,
David

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 3:58:52 AM9/19/07
to Grammatical Incompleteness
by Khurshid Ahmad

Yorick

Modesty and nervousness initially stopped me replying to your thoughts
on
'machine learning', and unsupervised learning, from corpora. But these
are risk averse times.

We had carried out a number of experiments on automatic categorisation
of texts, in a corpus, using Kohonen Feature Maps or Self Organising
Maps. The results were encouraging in that our systems learnt to
classify news paper stories (Reuters) and articles in specialist
journals. The point of introversion/intuition was when we had to
choose a 'feature vector' to train our large SOM.

Pensiri Manomaisupat, Bogdan Vrusias & Khurshid Ahmad, Categorization
of Large Text Collections: Feature selection for unsupervised and
supervised neural networks, 7th Int. Data Engineering and Automated
Learning Conf. (Lecture Notes on Computer Science - LNCS 4224),
Burgos, Spain, 20th-23rd September , edited by E. Corchado, H. Yin, V.
Botti & C. Fyfe, Springer-Verlag, 2006, pp1003 - 1013.

Khurshid Ahmad

Professor of Computer Science
Department of Computer Science
Trinity College,
DUBLIN-2
IRELAND
Phone 00 353 1 896 8429

Web Page: http://people.tcd.ie/kahmad

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 4:01:58 AM9/19/07
to Grammatical Incompleteness
by RODOLFO DELMONTE

Yorick,

the impression one gets from reading this thread and the ones related
to it is not just that people in corpus linguistics - and perhaps also
a good share of CL/NLP - are convinced that meaning of words can be
recovered by means of KWIC, bigrams, and other similar lexically
related measures.

My "hunch" is that semantics of words can only be drawn from semantics
of propositions in which words are contained, not just a topographical
notion, but a syntax related notion if you like. That is syntax and
semantics - in that sense, not just lexical - is what determines
meaning at last.
The question is that this is hard to come by with good approximation
with - just - statistically based methods. We all look forward to see
this new generation of CL/Corpus linguists who will be able to couple
St.Methods, corpus measures and symbolic computation. After all, I
cannot give up the idea that it is our intuition as theoreticians that
must guide us...
Best
Rodolfo

P.S.: I know this has already been discussed between us, but the list
deserves more fuel...

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 4:03:41 AM9/19/07
to Grammatical Incompleteness
by DAVID BROOKS

Dear Rob,

> Yes, people have been clustering words into classes according to
> shared contexts for some time.
>
> The point here is the idea that they need to cluster them into a
> different class for each context in which they occur.

To my mind, this would not be possible because you will rarely (if
ever) see a word or phrase in the same context. Of course, you can
weaken this for "similar" contexts, but how you do that is an open
research problem.

I'd like to bring Distributional Analysis (DA) back in to this
discussion because I think the important points for your theory to
address are highlighted here.

Unsupervised word clustering has been the subject of much research, as
Eric Atwell reports. However, you are right to highlight a problem
with "completeness", because much of the research on word clustering
using DA relies on an assumption that each word type is unambiguous.
Even that work which has sought to address ambiguity directly, such as
that of Hinrich Schuetze (1997, "Ambiguity Resolution in Language
Learning") and Alexander Clark (2001, "Inducing syntactic categories
by context distribution clustering") includes this assumption to some
degree.

Clark's work attempts to cluster the N most-frequent words assuming no
ambiguity. He then uses this clustering as a bootstrap to classify
less frequent words allowing for ambiguity.

Schuetze's work attempts to cluster together word tokens by forming
distributions over both the contexts in which the word type occurs,
and contexts in which the token context occurs. However, the
distributions themselves only allow features from the list of N most
frequent words, which are of course subject to ambiguity that is not
resolved.

This brings me to your proposal of "incompleteness". Similar to
Schuetze's work, you are suggesting treating word (or other linguistic
phenomenon) tokens independently, and attempting to create clusters in
an ad hoc manner. I'm not too sure of the details, so I'll ask:

(1) what does this have to do with syntax, in terms of arrangements of
symbols? A long-distance dependency like that addressed by yourself
and Mike Maxwell in earlier messages demonstrates a syntactic relation
between words that might be explained by a number of different
theories.

Even simple models (like which word occurs N words to the right/left
of the word being "clustered") have real problems with data-sparsity
(or poverty of the stimulus), and more complex models of syntax just
exacerbate the problem. If you are simply saying "co-occurrence", as
in occurs in the same context (be that a phrase, sentence, document,
etc.), you are not saying anything about the arrangement of symbols,
merely that they co-occur. (I may have misinterpreted "co-occurrence",
but I cannot see how a long-distance relationship would be modelled
otherwise because you have not stated what counts as context, other
than being in the corpus.) At no point have you suggested that (non-
lexical) arrangements be learned or used, so I just wanted
clarification.

(2) If a token is treated as incomplete, under the assumption that it
might represent a different sense (or syntactic class, or "meaning",
or whatever the terms) to other instances of the type, is the same
assumption extended to the token's context? I would assume that you
have a circular definition here, unless at some point you bite the
bullet and assume some tokens are unambiguously defined. If so, which
bullets do you bite, and under which circumstances?

Cheers,
D

Message has been deleted

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 4:06:36 AM9/19/07
to Grammatical Incompleteness
by MIKE MAXWELL

Hmm, not sure if it's best to post a reply to this thread, or start a
new one. As you can see, I decided to do the former.

Back when this discussion was on CorporaList, I (Mike Maxwell) had
written:
> One advantage of splitting apart syntax and semantics ... is
> that you begin to see how you can understand long distance
> dependencies of meaning, for example
> The support that John brought was helpful to making our argument.
> 'support' means one thing; whereas in the following sentence
> The support that John brought was helpful for bracing the building.
> it means something quite different. Syntax allows you to relate the
> choice of sense in each case to words at an arbitrary distance away,
> in a way that collocation *by itself* cannot do.

--to which Rob Freeman replied:
> You can model long distance dependencies using raw word associations
> (collocations), Mike. I did an implementation of this. All you need to
> do is carry the (collocation) features through until they are needed to
> make a selection.

We're both right, and I think this very nicely brings out the
difference between science and engineering.

Namely, I'm prepared to believe that Rob's solution will,
statistically speaking, come up with the right answer most of the
time; and that's all you can expect from an engineering solution.
(Strictly speaking, that's all you can expect from people, too, which
is what inter-annotator agreement is about. But I digress.)

But it is not hard to come up with examples where this doesn't work--
and by "doesn't work", I mean that given such examples, people will
come up with a different answer than what the engineering approach
will come up with, and do so with some consistency (reasonably high
inter-annotator agreement).

For example:
Code:
They supported the building's use as a church on weekends.

The sense of 'support' here is clearly that of agreeing with,
providing moral support, etc. But the nearest noun to the verb
'support' here is 'building'. So the engineering approach would select
the physical sense of support (as would be appropriate for "They
supported the office building").

And contrariwise:
Code:
They supported our organization's building.

the sense that people would pick out is that of physically supporting,
even though the nearest noun is 'organization', which having an
abstract sense would require some other sense of 'support' (as would
be appropriate for "They supported the organization of workers into a
union.")

Now these are clearly contrived sentences--I haven't pulled them out
of a corpus. So a person who believed the only true evidence was
corpus evidence might not be convinced. But since we're talking about
people's interpretation of language, I'm not sure what would convince
such a skeptic--since corpora don't come labeled with senses, and we
have to use people's interpretation of the meaning, regardless of
whether the sentence comes from a corpus or from my head.

There are still issues of directionality, sparse data, how one deals
with ambiguity in the context, and how strong the association has to
be--some words are much weaker indicators of the meaning than others.
But apart maybe from the question of directionality, these are
engineering issues, and there may be engineering solutions that work
most of the time.

In sum, I don't doubt that the raw word association approach will work
most of the time. But there are cases (rare though they may be in
corpora) where it gives significantly different answers from what
people would give, particularly in cases where the syntax and
morphology are at odds with the "nearest neighbor approach." So I call
the raw word association method engineering, and the grammar method
science. That isn't to disparage engineering; you'd be hard pressed to
get me on board a plane that flapped its wings. But 747s aren't
science, and neither is raw word association.

BTW, I suspect someone to bring up the fact that if grammar were
science, as I am claiming, then we would have 100% inter-annotator
agreement on word sense tagging. That obviously doesn't happen, but I
think there are other reasons. Go ahead, make my day Smile.

Mike Maxwell

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 4:08:34 AM9/19/07
to Grammatical Incompleteness
by ROB FREEMAN

Hi Yorick,

I hope you make it over to this new forum.

> by YORICK WILKS ...


> I think you may misunderstand what machine learning can do---though of
> course it all depends on what you mean by learning/generalizing from

> the "same context" Modesty normally forbids me citing, say,http://citeseer.ist.psu.edu/stevenson01interaction.html, where


> Stevenson and I combined learners for word-sense disambiguation over
> quite a large corpus (The Red Badge of Courage), and one way of
> interpreting what the learner was doing (and it is something some
> would find distasteful) is that it was learning for-each-sense-of each-
> content -word what were the contexts and criteria that would
> disambiguate it-----there are other bits of contemporary and later
> work that could also be described that way (and this was not at all
> simple unsupervised learning, either).


The term "machine learning" covers a lot of ground. In this case,
though, I'm not sure if you are saying I am under appreciating or over
appreciating what it can do. Are you saying that in this case machine
learning was in fact learning a different representation for each
context, as I suggest it should?

So are you saying I am under appreciating what machine learning can
do, and does do, already?

-Rob

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 4:12:08 AM9/19/07
to Grammatical Incompleteness
by ROB FREEMAN

Mike Maxwell wrote:
> Back when this discussion was on CorporaList, I (Mike Maxwell) had
> written:
>
> > One advantage of splitting apart syntax and semantics ... is
> > that you begin to see how you can understand long distance
> > dependencies of meaning, for example
> > The support that John brought was helpful to making our argument.
> > 'support' means one thing; whereas in the following sentence
> > The support that John brought was helpful for bracing the building.
> > it means something quite different. Syntax allows you to relate the
> > choice of sense in each case to words at an arbitrary distance away,
> > in a way that collocation *by itself* cannot do.
>
> --to which Rob Freeman replied:
>
> > You can model long distance dependencies using raw word associations
> > (collocations), Mike. I did an implementation of this. All you need to
> > do is carry the (collocation) features through until they are needed to
> > make a selection.
>
> We're both right, and I think this very nicely brings out the
> difference between science and engineering.
>

> ...


>
> But it is not hard to come up with examples where this doesn't work--

Mike. You've invented something called the "engineering approach",
given it to me, and then proceeded to argue it does not work.

But you don't know what it is I am suggesting.

I said the method I use works much like feature unification in HPSG.
Is generative grammar implemented in HPSG then "engineering" or
"science"?

The bottom line is you can't argue something doesn't work without
knowing what it is.

I know you will dismiss what I suggest out of hand. But you are
getting ahead of yourself. Can you first look at what I am proposing
and _then_ dismiss it out of hand, thanks!

-Rob

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 4:13:02 AM9/19/07
to Grammatical Incompleteness
by ROB FREEMAN
David,

It is by no means certain I completely understand the nuances of your
questions, so let me tell you what I have done and we can try to work
from there.

Distributional analysis is a good starting point.

Take just the immediate context. The word immediately preceding or
following will do. Traditionally you would form a vector consisting of
these immediately preceding or following contexts for each word in
your corpus, and then try to form classes based on such contexts. I
don't recall if that is what Hinrich Schuetze or Steve Finch did in
their early work, but I think it is. Schuetze's is the first I came
across (e.g. Schuetze "Dimensions of Meaning" http://citeseer.ist.psu.edu/23424.html).

So the way this works is that if two words have a couple of contexts
in common, say they both have the preceding contexts "a" and "the",
you are well on your way to guessing you're looking at words belonging
to the same syntactic class. In this case you might call that class
"noun".

Note it is one or other set of contexts within all the contexts of two
words which determine whether they are in the same class.

Now, the incompleteness argument is that there will be more than one
set like "a"and "the" which different words have in common. In general
word A might have set X in common with word B, but set Y in common
with word C. So which set is A in?

Well it is in both, but it can't be in both at the same time. That is
the problem. Either A is in a class with B and not C, or C and not B.
It can't be in both because the classes are different.

I say you should let context decide.

Remember, you just need to choose a set, X or Y. Context can do that.
Put two words together in sequence, the word you want to "classify"
and the context you want to "classify" it in. You can let the contexts
of one select the contexts of another according to combinations which
occur together in sequence.

Basically, instead of matching contexts directly to find a set two
words have in common (as you do when you cluster classes normally),
now you use one word to select a set in another, according to which
contexts of each occur together in sequence.

You just need to choose a set, and the observed sequences can do that.

Does that make sense?

-Rob

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 5:02:31 AM9/19/07
to Grammatical Incompleteness
Hi Rob,

> It is by no means certain I completely understand the nuances of your
> questions, so let me tell you what I have done and we can try to work
> from there.
>
> Distributional analysis is a good starting point.
>
> Take just the immediate context. The word immediately preceding or
> following will do. Traditionally you would form a vector consisting of
> these immediately preceding or following contexts for each word in
> your corpus, and then try to form classes based on such contexts. I
> don't recall if that is what Hinrich Schuetze or Steve Finch did in
> their early work, but I think it is. Schuetze's is the first I came
> across (e.g. Schuetze "Dimensions of Meaning"http://citeseer.ist.psu.edu/23424.html).

Yes, this sounds like what I was trying to describe.

> So the way this works is that if two words have a couple of contexts
> in common, say they both have the preceding contexts "a" and "the",
> you are well on your way to guessing you're looking at words belonging
> to the same syntactic class. In this case you might call that class
> "noun".
>
> Note it is one or other set of contexts within all the contexts of two
> words which determine whether they are in the same class.

Agreed, this is the intuition. One looks for some substantial
similarity between the words, usually in terms of the size of the
intersection of these sets.

> Now, the incompleteness argument is that there will be more than one
> set like "a"and "the" which different words have in common. In general
> word A might have set X in common with word B, but set Y in common
> with word C. So which set is A in?
>
> Well it is in both, but it can't be in both at the same time. That is
> the problem. Either A is in a class with B and not C, or C and not B.
> It can't be in both because the classes are different.

Right, this is where we're getting a little fuzzy. I agree that if you
assume a word *type* to be ambiguous, you are necessarily suggesting
that the set of contexts is not a single set, but rather it should be
viewed as a set that is partitioned, with one partition for each
"sense" of a word. This is what Schuetze's later work attempted: it
started with the assumption that tokens of a type are similar, but may
constitute quite different senses dependent on context. I wouldn't
argue against his intuition or methodology, nor against your
description above.

This view is a "mixture model": we view the complete set of contexts
for a type (A is the type, and its complete set of contexts will at
least cover the union of X and Y) as a mixture of subsets, where the
correct mixture should find the sets X and Y exactly. However, in this
case it sounds easier than it is...

> I say you should let context decide.
>
> Remember, you just need to choose a set, X or Y. Context can do that.
> Put two words together in sequence, the word you want to "classify"
> and the context you want to "classify" it in. You can let the contexts
> of one select the contexts of another according to combinations which
> occur together in sequence.

Agreed again. This is the equivalent of using an established (trained)
classifier. If the observed contextual feature ("the context you want
to 'classify' it in") has some strong association with one
classification (A-X or A-Y) then the classifier will work. (I
mentioned data-sparsity in my earlier post, and this is where it kicks
in: not all features have strong associations; many feature-values are
infrequent; etc.)

> Basically, instead of matching contexts directly to find a set two
> words have in common (as you do when you cluster classes normally),
> now you use one word to select a set in another, according to which
> contexts of each occur together in sequence.

right. I think this is precisely the intuition between the works of
Schuetze and Clark I mentioned earlier. Their implementations go some
way towards what you are saying. Schuetze's work, for example, gives
equal weight to the relation of a token to its type, and the relation
to its context - he applies distributional analysis to each token,
considering vectors of type-contexts and context-contexts.

> Does that make sense?

It does, but the critical issue I think needs resolving is the
presence of ambiguity in both words and their contexts.

Hopefully I can clarify by example. Let's say we have a fragment:

"... Alfonso's strong desire to prospect in Bradford's land ...
" (nonsense, I know, though I'm sure I could find a corpus sentence
with equivalent properties of having an ambiguous function word
preceded/followed by ambiguous content words. Perhaps this is the
worst kind of introspection - made with faith that I could find an
example in a corpus ;) )

I'm particularly interested in "desire to prospect" here:
"to" can be either an infinitive marker (in this case) or a
preposition, and we need to know which sense we have.
"prospect" and "desire" may be too infrequent to give adequate support
for a cluster.

So let's go after the infrequent words first: "prospect". We might
want to know if it is a verb or a noun (or at least one of a number of
senses), but in order to do so we need to look at the context.
However, "to" is also ambiguous, so we now also have to decide upon a
sense for that!

Clearly, we would be unwise to make an assumption that the word is
unambiguous, since this is not the case here; equally if we assume
ambiguity we have a real problem, because we don't know which sense we
should use, and this notion extends to the contexts - they are also
subject to ambiguity, and *this* is the critical point, I think of
both of our arguments.

The approaches of Clark and Schuetze are admirable in that they
approach ambiguity, but they are still forced to assume, at some
point, that words or contexts are unambiguous. (This is a great
improvement on making the assumptions throughout, as in earlier work
on unsupervised clustering.)

I think the best machine learning can hope for is that we can have
some degree of faith that the context belongs to some class or other,
but to keep the options open to allow for competing explanations.
Probabilistic parsing runs along these lines (reranking, etc.), but
sadly it is usually treated as an entirely separate step to word
clustering. (ON an optimistic note, it is an aim of the UDOP
approach.) This is why I'm interested in your suggestion of
"incompleteness", because if it presents a workable solution to the
above ambiguity it would be very useful indeed.

Cheers,
D

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 5:18:30 AM9/19/07
to Grammatical Incompleteness
by ALEXC

I think from your description you are merely pointing out a flaw in
distributional modes of analysis -- completeness (in the logical
sense) is something utterly different.

Your examples above are surely just saying that observing two strings
in the same context does not imply they are syntactically congruent.
This limitation of the distributional approach was discussed in
Chomsky's LSLT a long time ago.

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 5:19:39 AM9/19/07
to Grammatical Incompleteness
by ALEXC

robf wrote:

> Well it is in both, but it can't be in both at the same time. That is the problem. Either A
> is in a class with B and not C, or C and not B. It can't be in both because the classes
> are different.

You are assuming that the classes are disjoint -- which is false,
since we know that words are lexically ambiguous. So 'can' could be a
noun or a modal auxiliary. Clearly the syntactic classes must overlap.
If you wish them not to overlap, then you have to expand out the
ambiguities by having a richer set of classes (a power-set
construction like for determinisation of FSAs).

D.J.B...@cs.bham.ac.uk

unread,
Sep 19, 2007, 5:21:40 AM9/19/07
to Grammatical Incompleteness
by IainD

I've avoided this discussion in the main forum, but now I feel able to
roll my sleeves up and put the gloves on...

As background, I'm an engineer which means I care more that things
work than knowing why they work (though the former generally requires
the latter).

For me there have been a number of points make.

1. Syntactical analysis (parsing) does not work well if considered in
isolation from lexis, semantics and pragmatics.

2. Language and lexis change over subjext, style, person and time. Any
non-adaptive language analysis process will fail (perform poorly) over
time or in different environments (this is particularly close to my
heart though mainly in respect of classification).

3. It is possible to extract patterns / grammar from unstructured text
through various machine learning processes. This is a well researched
process for lexis, though 'grammar' induction is in it's infancy.

For me, the problem with this approach is that we choose to ignore
pragmatics. The human approach to language appears to be to have a
message in ones head that one intends to give and a matching
expectation of a message in the receivers head that they expect to
receive. If the message arriving is significantly different from what
the receiver expects or has background to understand (which is broadly
the same thing), then communication stalls or fails.

In short, I think the alternative meanings are selected by the general
subject matter (social organisation vs architecture) in the
communicators heads and it will NEVER be possible to disambiguate that
without (somehow) taking that into account.

The question then arises as to the extent that the context can be
determined by the surrounding text (quite possible in the case of the
example of support for the Church) and the extent to which 'world
knowledge' is required.

For example, despite being an intelligent (well I think so) and
reasonably well read (in computational lingustics and related areas)
person, I frequency come across papers which are totally
incomprehensible at first. Comprension requires me to acquire
something of writers mental processes through a combination of
external background reading (often wikipedia these days) and an
assimilation of the style of the writer, usually by repeated reading
of the paper and so on.

This is arguably an extreme example (and one devoid of corpus proof
Wink ), but publications in any niche area can be a struggle to read
until you gain the flavour of it.

So I think that the excercises to induce an adaptive grammer (which is
in effect what I think Rob is doing, though I'm not entirely sure he
would agree) from unstructured text are worthy and valuable. However,
I don't think the problem of language analysis will be solved until we
recognise that pragmatics, semantics, syntax and lexis are not
distinct.

Rob Freeman

unread,
Sep 19, 2007, 5:48:12 AM9/19/07
to grammatical-i...@googlegroups.com

by ALEXC

I think from your description you are merely pointing out a flaw in
distributional modes of analysis -- completeness (in the logical
sense) is something utterly different.

I'm interested to know about this difference. I think they are they same, but I am not well versed in how this is seen in logical theory. If you can elaborate that would be good.

Your examples above are surely just saying that observing two strings
in the same context does not imply they are syntactically congruent.
This limitation of the distributional approach was discussed in
Chomsky's LSLT a long time ago.

What was Chomsky's discussion?

-Rob


alexs...@googlemail.com

unread,
Sep 19, 2007, 6:01:41 AM9/19/07
to Grammatical Incompleteness

On Sep 19, 10:48 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:

Simply that we have pairs of strings like "I am thin" and "I am an
Englishman", but "thin" and "an Englishman" are not syntactically
congruent (not his example).

Alex

Rob Freeman

unread,
Sep 19, 2007, 6:02:56 AM9/19/07
to grammatical-i...@googlegroups.com

I'm not assuming they are disjoint. I am assuming they are not the same.

Sure, if there is overlap you can take the intersection as your class defining set. That would be something like what machine learning does now. But then you are throwing away all the information about properties which different sub-classes of words have in common, but which are not common to them all.

It is exactly this last information which I think is important.

Here's what Wittgenstein had to say about this, as a matter of interest:

This is Kuhn citing Wittgenstein:

Kuhn, The Structure of Scientific Revolutions, p.g. 44-45 (citing Ludwig Wittgenstein, Philosophical Investigations, trans. G. E. M. Anscombe, pp 31-36):

"What need we know, Wittgenstein asked, in order that we apply terms like 'chair', or 'leaf', or 'game' unequivocally and without provoking argument?

That question is very old and has generally been answered by saying that we must know, consciously or intuitively, what a chair, or a leaf, or game _is_. We must, that is, grasp some set of attributes that all games and only games have in common. Wittgenstein, however, concluded that, given the way we use language and the sort of world to which we apply it, there need be no such set of characteristics. Though a discussion of _some_ of the attributes shared by a _number_ of games or chairs or leaves often helps us learn how to employ the corresponding term, there is no set of characteristics that is simultaneously applicable to all members of the class and to them alone."

-Rob

Rob Freeman

unread,
Sep 19, 2007, 6:13:47 AM9/19/07
to grammatical-i...@googlegroups.com

> > Your examples above are surely just saying that observing two strings
> > in the same context does not imply they are syntactically congruent.
> > This limitation of the distributional approach was discussed in
> > Chomsky's LSLT a long time ago.
>
> What was Chomsky's discussion?

Simply that we have pairs of strings like "I am thin" and "I am an
Englishman", but "thin" and "an Englishman" are not syntactically
congruent (not his example).

Yes, this does seem to be the same.

Two words can have contexts in common, and yet not have all contexts in common. This creates a paradox from the point of view of trying to form a complete grammatical description. Sometimes the words are the same, and sometimes they are different.

I don't know what the rest of Chomsky's analysis was. Any more details you can give would be welcome.

How has machine learning dealt with this problem for that matter?

-Rob


David Brooks

unread,
Sep 19, 2007, 6:41:00 AM9/19/07
to grammatical-i...@googlegroups.com
Rob,

> We must, that is, grasp some set of attributes that all games and only
> games have in common. Wittgenstein, however, concluded that, given the way
> we use language and the sort of world to which we apply it, there need be no
> such set of characteristics. Though a discussion of _some_ of the attributes
> shared by a _number_ of games or chairs or leaves often helps us learn how
> to employ the corresponding term, there is no set of characteristics that is
> simultaneously applicable to all members of the class and to them alone."

Right, but I think the field of machine learning appreciates this.
First, unsupervised clusterings of words are not always made on a
categorical basis (contrast hard- and soft-clustering models, the latter
of which allow you to say "with probability P1, word w1 belongs to class
C1; with probability P2 w1 belongs to C2, etc."), which allows for
uncertainty in class assignments. I don't know how this relates to your
notion of incompleteness.

Second, soft-clustering is only possible because we assume is that there
is no set of characteristics that determine precisely which
classification is right (a soft-clustering under such circumstances
would yield a hard-clustering, since there would be no overlap or
similarity between instances, only a clear division). Of course it helps
if your data has a nice split of characteristics that map exactly on to
classes, but this is rarely the case in interesting machine learning
problems. Where this is not the case, an approximate "rule" (however it
is embodied) is formed, considering more or less useful characteristics,
and this is what machine learning seeks.

The contextual feature-sets analysed by any technique are usually
constrained (in terms of how you specify the underlying model) to make
the problem tractable, so they only capture useful trends in the data
that are consistent with these constraints. This is why I earlier asked
you if you wanted to define your model a priori (and I said this was
equivalent to defining a syntax) or to learn your model somehow.

Finally, I think there is also an acknowledgement (by the fact that ML
techniques are usually scored on performance out of 100%, rather than
simply saying "it does/doesn't work") that machine learning only
attempts to account for /some/ of the regularities.

A framework like Memory-Based Learning (or k-Nearest Neighbour) will
account for more because it retains specific instances when
generalisations do not hold -- i.e. if we can't interpret according to
existing models, keep the instance in the hope we can form a model
later, or simply use it as an exception-case. Although this is similar
in intention to your suggestions, I don't think this is what you are
arguing for, but I might be wrong.

D

Rob Freeman

unread,
Sep 19, 2007, 6:43:12 AM9/19/07
to grammatical-i...@googlegroups.com
Ian,

by IainD

...

For me, the problem with this approach is that we choose to ignore
pragmatics.

I don't see pragmatics as a problem. John Sowa raised this point. As I said to him, it is one of the nice features of the ad-hoc generalization approach I'm suggesting, that a basis for meaning in context provides a natural path to extend the kind of context we regard as "meaningful" beyond text, as far as we like.

And within text or without, there is no reason why we should exclude pragmatic information. What we mean by pragmatics might emerge naturally from patterns in contexts.

In short, I think the alternative meanings are selected by the general
subject matter (social organisation vs architecture) in the
communicators heads and it will NEVER be possible to disambiguate that
without (somehow) taking that into account.

It depends whether you think what is in their heads is intrinsically different from what comes out of their mouths. I suggest the one models the other.

I think it is worth exploring as a model, anyway, right through to pragmatics. Why not? Analyse pragmatics in terms of patterns in contexts, social contexts, etc.

The question then arises as to the extent that the context can be
determined by the surrounding text (quite possible in the case of the
example of support for the Church) and the extent to which 'world
knowledge' is required.

For example, despite being an intelligent (well I think so) and
reasonably well read (in computational lingustics and related areas)
person, I frequency come across papers which are totally
incomprehensible at first. Comprension requires me to acquire
something of writers mental processes through a combination of
external background reading (often wikipedia these days) and an
assimilation of the style of the writer, usually by repeated reading
of the paper and so on.

I think your experience fits the model I am suggesting very well.

Meaning is an organization of examples. If you don't have access to the same examples, you will not understand. When you say that doing some background reading on a subject enables you to understand, you are supporting a corpus model of meaning.

And your stipulation of "world knowledge" is not problem at all. A basis for meaning in context does not imply "context" should be limited to text. I don't think it is. There is no reason we should limit context in our models to be text. We should index it with text, language anyway. But I'm sure it will prove necessary to extend what we think of as context in all kinds of ways.

One nice thing about this model is that it provides a natural way to do this. More general context can be handled in exactly the same as text context.

As a first step on the path to broadening context from text we might work directly with speech and capture the "meaning" implicit in the context of pitch, stress, intonation etc.

So I think that the excercises to induce an adaptive grammer (which is
in effect what I think Rob is doing, though I'm not entirely sure he
would agree) from unstructured text are worthy and valuable. However,
I don't think the problem of language analysis will be solved until we
recognise that pragmatics, semantics, syntax and lexis are not
distinct.

A common definition in terms of context lets us integrate them.

-Rob

Rob Freeman

unread,
Sep 19, 2007, 7:01:33 AM9/19/07
to grammatical-i...@googlegroups.com
David,

There's a reply to your earlier message in the works.


On 9/19/07, David Brooks <d.j.b...@cs.bham.ac.uk> wrote:

> We must, that is, grasp some set of attributes that all games and only
> games have in common. Wittgenstein, however, concluded that, given the way
> we use language and the sort of world to which we apply it, there need be no
> such set of characteristics. Though a discussion of _some_ of the attributes
> shared by a _number_ of games or chairs or leaves often helps us learn how
> to employ the corresponding term, there is no set of characteristics that is
> simultaneously applicable to all members of the class and to them alone."

Right, but I think the field of machine learning appreciates this.
First, unsupervised clusterings of words are not always made on a
categorical basis (contrast hard- and soft-clustering models, the latter
of which allow you to say "with probability P1, word w1 belongs to class
C1; with probability P2 w1 belongs to C2, etc."), which allows for
uncertainty in class assignments. I don't know how this relates to your
notion of incompleteness.

I think it says incompleteness explains why statistical language models appear to be random.

Second, soft-clustering is only possible because we assume is that there
is no set of characteristics that determine precisely which
classification is right (a soft-clustering under such circumstances
would yield a hard-clustering, since there would be no overlap or
similarity between instances, only a clear division). Of course it helps
if your data has a nice split of characteristics that map exactly on to
classes, but this is rarely the case in interesting machine learning
problems. Where this is not the case, an approximate "rule" (however it
is embodied) is formed, considering more or less useful characteristics,
and this is what machine learning seeks.

So you blur your model to take into account the fact it can't handle all your information.

Once again, this under specifies the system and makes it appear random.

The contextual feature-sets analysed by any technique are usually
constrained (in terms of how you specify the underlying model) to make
the problem tractable, so they only capture useful trends in the data
that are consistent with these constraints. This is why I earlier asked
you if you wanted to define your model a priori (and I said this was
equivalent to defining a syntax) or to learn your model somehow.

So you ignore features which don't fit your model?

If you can do that and still make useful predictions, I guess it is OK. But in fact our language models all turn out random.

Finally, I think there is also an acknowledgement (by the fact that ML
techniques are usually scored on performance out of 100%, rather than
simply saying "it does/doesn't work") that machine learning only
attempts to account for /some/ of the regularities.

And such lowered expectations are a good thing?

A framework like Memory-Based Learning (or k-Nearest Neighbour) will
account for more because it retains specific instances when
generalisations do not hold -- i.e. if we can't interpret according to
existing models, keep the instance in the hope we can form a model
later, or simply use it as an exception-case. Although this is similar
in intention to your suggestions, I don't think this is what you are
arguing for, but I might be wrong.

I like MBL and kNN. As you say they deal with more complexity by postponing generalization. The problem with MBL and kNN is that once again they assume a grammar. In this case the grammar they assume is their solution set.

MBL and kNN both work as classifiers. If you don't have a complete set of classes (and cannot get a complete set, can never have a complete set) they can never work (beyond the randomness inherent in their solution set.)

So, in general, I agree machine learning handles these kinds of things now, in a way, by accepting randomness. I'm just suggesting that the randomness comes from incompleteness in the underlying model. And looking at incompleteness in the underlying model might be a better way to handle it.

-Rob

David Brooks

unread,
Sep 19, 2007, 7:36:13 AM9/19/07
to grammatical-i...@googlegroups.com
Rob Freeman wrote:

> I think it says incompleteness explains why statistical language models
> appear to be random.

They may be random, but the aim of machine learning (at least the
generative modelling stuff) is to impose structure on that randomness in
useful ways.

> So you blur your model to take into account the fact it can't handle all
> your information.
>
> Once again, this under specifies the system and makes it appear random.

I wouldn't say "to take into account", I would say "as acknowledgement
of". It allows words to belong simultaneously to two-or-more classes,
where it has similarity with each. The most similar in any context wins,
presumably, but how you define "similarity" and "context" is a
limitation of your model, and cannot possibly account for all contextual
information. But then, I don't believe (yet!) that your suggestion will
either.

Soft-clustering allows uncertainty, but is certainly not random, as above.

> So you ignore features which don't fit your model?

Yes! I don't think it is a good thing either, but determining relevant
features is tantamount to proposing a syntax, which is why I think there
is a good deal of circularity in the approach. Simply storing instances
does not get around this, because you still need to specify *how you
then analyse them*, which again is a syntactic decision.

> If you can do that and still make useful predictions, I guess it is OK. But
> in fact our language models all turn out random.

Again, I don't think "random" quite describes it. Certainly, they
involve randomness, but there is a world of difference between uniform
distributions and other distributions. Depending on how well your model
fits your problem, "random" (selection within model parameters) can be
very good indeed!

> And such lowered expectations are a good thing?

No, but in the absence of a perfect model, the best approximation wins!
Furthermore, they lead to deepening insight and scientific progress
(because you can analyse and understand their flaws), which are good things.

> I like MBL and kNN. As you say they deal with more complexity by postponing
> generalization. The problem with MBL and kNN is that once again they assume
> a grammar. In this case the grammar they assume is their solution set.

I don't think that's necessarily true. You may have to start with a
model to use kNN, but what you do with exceptions (simply retain, or
reanalyse in an attempt to form another model/grammar that accounts for
exceptions) is a matter of choice, surely? My original interpretation of
your suggestion of ad hoc grammar runs along these lines:

[1] assume a simple model, and learn as best you can;
[2] find all things that don't fit the model;
[3] analyse these outliers and propose some new, more complex model
machinery to deal with them;
[4] go to [1] until convergence.

This is *incredibly* naive in that it is hard to define how you might
actually do any of the above (well, except [4]), but that was my
interpretation and caused my link to Subsumption Architecture. As you
can see, it is quite different to your suggestion because it
necessitates a commitment to model syntax at each stage, but I thought
"ad hoc" might mean "form a new model to fit the data", so you can
(hopefully) see the connection.

> MBL and kNN both work as classifiers. If you don't have a complete set of
> classes (and cannot get a complete set, can never have a complete set) they
> can never work (beyond the randomness inherent in their solution set.)

Right, so it isn't a great model for unsupervised learning...

> So, in general, I agree machine learning handles these kinds of things now,
> in a way, by accepting randomness. I'm just suggesting that the randomness
> comes from incompleteness in the underlying model. And looking at
> incompleteness in the underlying model might be a better way to handle it.

I agree with what you say here, which is why I am interested in /how/
you incorporate incompleteness, and what makes it better in practice.

D

Rob Freeman

unread,
Sep 19, 2007, 7:50:18 AM9/19/07
to grammatical-i...@googlegroups.com
David,

A reply to your earlier message.


This view is a "mixture model": we view the complete set of contexts
for a type (A is the type, and its complete set of contexts will at
least cover the union of X and Y) as a mixture of subsets, where the
correct mixture should find the sets X and Y exactly. However, in this
case it sounds easier than it is...

I think the problem of finding a single sufficient mixture is impossible. I'm not sure why we assume it could be possible, or even desirable.

Has anyone looked into proving it is possible (or not)?

> I say you should let context decide.
>
> Remember, you just need to choose a set, X or Y. Context can do that.
> Put two words together in sequence, the word you want to "classify"
> and the context you want to "classify" it in. You can let the contexts
> of one select the contexts of another according to combinations which
> occur together in sequence.

Agreed again. This is the equivalent of using an established (trained)
classifier. If the observed contextual feature ("the context you want
to 'classify' it in") has some strong association with one
classification (A-X or A-Y) then the classifier will work. (I
mentioned data-sparsity in my earlier post, and this is where it kicks
in: not all features have strong associations; many feature-values are
infrequent; etc.)

This would be good.

The only possible flaw I can see is that you are implementing a classifier. Why? What makes you think there is a complete set of classes to resolve to which will completely parametrize your problem?

... Schuetze's work, for example, gives

equal weight to the relation of a token to its type, and the relation
to its context - he applies distributional analysis to each token,
considering vectors of type-contexts and context-contexts.

He also tries to learn a complete set of classes this way does he?

...the critical issue I think needs resolving is the

presence of ambiguity in both words and their contexts.

...


I think the best machine learning can hope for is that we can have
some degree of faith that the context belongs to some class or other,
but to keep the options open to allow for competing explanations.
Probabilistic parsing runs along these lines (reranking, etc.), but
sadly it is usually treated as an entirely separate step to word
clustering. (ON an optimistic note, it is an aim of the UDOP
approach.) This is why I'm interested in your suggestion of
"incompleteness", because if it presents a workable solution to the
above ambiguity it would be very useful indeed.

I can't see that UDOP will help. The entire content of DOP seems to me that it codes grammar in a tree-bank. There may be some subtle differences associated with that. But I can't see them. It is just grammar by another name.

I'm glad Rens Bod has moved away from the extreme expression of this in DOP-LFG and is now looking at doing machine learning at least. But I just can't see how coding grammar in trees is going to make much difference.

What will make a difference, I think, is accepting that the solution set is too big to be summarized completely.

And that brings us to the problem you have raised. I think the problem you see is somewhere in the poverty of the solution set you assume.

I don't see a problem. I don't see why you need to assume something is unambiguous at some point so you can resolve recursive ambiguities.

Why can't the two select each other uniquely? Why can't an entire sequence of words select a unique set of elements to express their combination, and that combination alone?

Why does it have to be a question of assuming either the first word is a noun or the second, so you can resolve the sequence to be either NV or VN? Why shouldn't the case be that "this word" + "that word" _always_ means NV or VN, uniquely specified, no arguments, and no need to guess one first (and if they don't add some more context until they do.)

-Rob

David Brooks

unread,
Sep 19, 2007, 8:51:51 AM9/19/07
to grammatical-i...@googlegroups.com
Rob Freeman wrote:
> I think the problem of finding a single sufficient mixture is impossible.
> I'm not sure why we assume it could be possible, or even desirable.

Of course it is possible, /for any given corpus/, although it may be
unlikely. If you mean "sufficient for language not present in the
corpus" then that's a different matter, and one that no-one in machine
learning would claim to achieve.

Desirable in practical terms: if it works, we keep it.

> Has anyone looked into proving it is possible (or not)?

Don't know, I'm afraid.

> ... The only possible flaw I can see is that you are implementing a classifier.


> Why? What makes you think there is a complete set of classes to resolve to
> which will completely parametrize your problem?

I don't think there is necessarily a complete set of classes. I am
simply saying that is how these DA algorithms work, on my way towards
the "recursive ambiguity" problem (that you coin below).

>> ... Schuetze's work...


> He also tries to learn a complete set of classes this way does he?

Granted, it is a complete set of classes. Clark has a class of
unresolved entities in his model, though, if this appeals to the
"incompletist" in you. (Not much of a concession, since it is still a
class! But I'm sure some more complicated machinery could be brought in
to deal with the class members.)

> I can't see that UDOP will help. The entire content of DOP seems to me that
> it codes grammar in a tree-bank. There may be some subtle differences
> associated with that. But I can't see them. It is just grammar by another
> name.

I'll grant you that DOP has Tree Substitution Grammar at its heart,
which is clearly an a priori model. The form of UDOP is really
unimportant; it is the aims which are optimistic:

UDOP has been established to work with POS categories. The project
coordinators hope to:
[1] extend UDOP so it uses an unsupervised classification algorithm to
select POS categories (extending to words-only corpora)
[2] extend again so that syntactic constraints within the UDOP trees are
considered when making POS category assignments (thus allowing learned
syntactic constraints to interact with decisions that are traditionally
approached with only N-word windows of context features).

That was my note of optimism: feedback from non-adjacent, possibly
syntactically related words, into the word-classification process.

> I don't see a problem. I don't see why you need to assume something is
> unambiguous at some point so you can resolve recursive ambiguities.

Ok. Let's assume that all word tokens are completely independent of
their type, and that they are only associated with the type if there is
sufficient evidence, for which there is initially none. Our corpus can
then be rewritten by replacing each word with a unique symbol, to remove
any preconceptions about what the tokens mean. One such scheme might
just be to assign each token a positive integer starting from 1 - in
which case you have a sequence of numbers from 1 to N (size of corpus),
which is impossible to generalise over.

The alternative is to conflate tokens against their type, which makes an
assumption that leads towards mixtures. You then have to deal with the
problem of ambiguity. This pervades the tokens you look at, the tokens
you consider as contexts, etc. /That/ is why I think recursive ambiguity
is a problem.

> Why can't the two select each other uniquely? Why can't an entire sequence
> of words select a unique set of elements to express their combination, and
> that combination alone?

I'm not saying they cannot, but have you tried doing this on a large scale?

Can you give a simple example of a complete corpus, a procedure for
performing your combination, and the end result? I think it would really
help me understand. Your sentences in earlier messages/threads all
assume they are drawn from a corpus containing more information, where
identified noun-phrases are used as contexts without describing how you
arrive at their identification.

Have you seen the EMILE algorithm (Adriaans, 1992, "Language Learning
from a Categorial Perspective")? It may be relevant - discusses
"support" and "characteristic sets" in a distributional setting.

> Why does it have to be a question of assuming either the first word is a
> noun or the second, so you can resolve the sequence to be either NV or VN?
> Why shouldn't the case be that "this word" + "that word" _always_ means NV
> or VN, uniquely specified, no arguments, and no need to guess one first (and
> if they don't add some more context until they do.)

A probabilistic treatment that leaves open all possibilities does this,
and affords the power of a metric of preference between possibilities.
It's just very hard to achieve.

I'd also like to clarify my position that even the assumption of linear
order between pairs of words is of a syntactic nature, and vital to your
model. We are always committing to such syntactic models. In ML cases,
this commitment is where the model to be learnt is defined; in your case
the commitment is made when you analyse a word to find its meaning -
/where/ you look in the corpus *is* your definition of syntax. Are you
saying you make no such commitments?

Cheers,
D

Iain

unread,
Sep 19, 2007, 8:52:10 AM9/19/07
to Grammatical Incompleteness

On Sep 19, 11:43 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:
> Ian,


> > For me, the problem with this approach is that we choose to ignore
> > pragmatics.
>
> I don't see pragmatics as a problem. John Sowa raised this point. As I said
> to him, it is one of the nice features of the ad-hoc generalization approach
> I'm suggesting, that a basis for meaning in context provides a natural path
> to extend the kind of context we regard as "meaningful" beyond text, as far
> as we like.
>
> And within text or without, there is no reason why we should exclude
> pragmatic information. What we mean by pragmatics might emerge naturally
> from patterns in contexts.
>

The point I was trying to make about pragmatics (or context), is that
some can most certainly gleaned from the text immediately around
phrases. Some may be available from documents which are in some way
associated with the one under consideration. Some may be in text
somewhere, but not even google may have access to it and some, perhaps
really is tacit. It is purely a model inside our heads, perhaps
originating in another medium (vision, smell, touch) or simply (!)
being the digest of many years of living in the world.

In short I don't think we can ever acheive language processing purely
by bootstrapping text.

The problem I see is the implementation of a system which treats each
phrase / sentence in terms of the totality of all the other phrases in
your collection, is simply the vastness of it. We need to have some
means of reducing the vastness of it to some kind of model. Even if
the model is just an index to the original data.

> In short, I think the alternative meanings are selected by the general
>
> > subject matter (social organisation vs architecture) in the
> > communicators heads and it will NEVER be possible to disambiguate that
> > without (somehow) taking that into account.
>
> It depends whether you think what is in their heads is intrinsically
> different from what comes out of their mouths. I suggest the one models the
> other.

There is clearly a relationship. Though tenous sometimes in my
experience ;)... However, human communication is much richer than
speech so it seems to me that our internal models are richer than
speech. Unfortunately, here we run into philosophy (or perhaps
religion) as it is extremely hard to experiment with our internal
models!

>
> I think it is worth exploring as a model, anyway, right through to
> pragmatics. Why not? Analyse pragmatics in terms of patterns in contexts,
> social contexts, etc.

Broadly, I agree. In fact is that not what science does? Analyses
'things' in terms of patterns in contexts?


>
> Meaning is an organization of examples. If you don't have access to the same
> examples, you will not understand. When you say that doing some background
> reading on a subject enables you to understand, you are supporting a corpus
> model of meaning.

I find myself unable to believe in a one to one correspondance between
words (however processed) and meaning. Sorry!


alexs...@googlemail.com

unread,
Sep 19, 2007, 2:23:42 PM9/19/07
to Grammatical Incompleteness

On Sep 19, 11:13 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:


I think your use of the word paradox here is misleading: this is not
a logical issue. It's a problem for naive learning algorithms, not a
paradox.
This is just a fact about certain classes of languages -- and is quite
different to the issue of lexical ambiguity.


Rob Freeman

unread,
Sep 19, 2007, 10:19:01 PM9/19/07
to grammatical-i...@googlegroups.com
On 9/19/07, David Brooks <d.j.b...@cs.bham.ac.uk> wrote:

Rob Freeman wrote:
> Why can't the two select each other uniquely? Why can't an entire sequence
> of words select a unique set of elements to express their combination, and
> that combination alone?

I'm not saying they cannot, but have you tried doing this on a large scale?

The most complete implementation I have done used a corpus of some 40 million words. I did not attempt to label words in any way, but it naturally finds a kind of phrase structure in any sentence you present to it.

A Chinese student applied the algorithm to Chinese, and got branch precision and recall results which he claimed matched state-of-the-art symbolic parsers for that language.

Can you give a simple example of a complete corpus, a procedure for
performing your combination, and the end result? I think it would really
help me understand.

There was a Web demo which is a bit old now. It includes flawed assumptions. Still, it demonstrates some basic ideas.

If you have a server with enough RAM to spare (about 200Mb resident) we could probably put that old demo back on the Web within a day or so.

Your sentences in earlier messages/threads all
assume they are drawn from a corpus containing more information, where
identified noun-phrases are used as contexts without describing how you
arrive at their identification.

My earlier messages don't assume structure. I just didn't mention it. That was mostly to keep things simple. I first wanted to point out the representational power you get by selecting sets ad-hoc among contexts.

It turns out that power is enough to deal with any structural issues by itself. All you have to do is iterate selections between immediately adjacent tokens.

So given a sequence of words ABC, if you first select combinations between AB or BC, and then select from among the results of that combination with what was left ((AB)C) or (A(BC)), that naturally gives a best order of combination for ABC. That most natural "order of combination" turns out to look a lot like parse or constituent structure.

As part of that, halfway through combining all these groups of words into partial combinations, you naturally arrive at a point where a group which looks a lot like a noun-phrase, or other, is acting as a context to select something else, as you mention above.

It happens, but all you have to do to make it happen is model selections between immediately adjacent tokens, and iterate that.

As I say, it is all really down to the power of selections to find new sets. That means you can always find a new set of contexts which uniquely specifies the behaviour of each partial combination AB, BC, ((AB)C)... etc, and can iterate the selection.

I'd also like to clarify my position that even the assumption of linear
order between pairs of words is of a syntactic nature, and vital to your
model. We are always committing to such syntactic models. In ML cases,
this commitment is where the model to be learnt is defined; in your case
the commitment is made when you analyse a word to find its meaning -
/where/ you look in the corpus *is* your definition of syntax. Are you
saying you make no such commitments?

Perhaps I shouldn't have mentioned so early the way you can interpret sets of contexts as "meaning". We can forget about meaning completely if you like. Talk only about syntax and selection between sets of immediate contexts.

I'm not quite sure what you mean by "committing to ... syntactic models". My syntactic model is the corpus. If a combination of words occurs in the corpus, it is in the syntactic model. (If two words share a lot of contexts they are hypothesized to belong to one or other same syntactic class, and allowed to make predictions about each other.) Up to now I've only looked at immediate contexts. That is a commitment of sorts. I don't see why it shouldn't be extended though.

-Rob

Rob Freeman

unread,
Sep 19, 2007, 10:23:55 PM9/19/07
to grammatical-i...@googlegroups.com

On Sep 19, 11:13 am, "Rob Freeman" < gro...@chaoticlanguage.com> wrote:
>
> Two words can have contexts in common, and yet not have all contexts in
> common. This creates a paradox from the point of view of trying to form a
> complete grammatical description. Sometimes the words are the same, and
> sometimes they are different.
...


I think your use of the word paradox here is misleading:  this is not
a logical issue. It's a problem for naive learning algorithms, not a
paradox.
This is just a fact about certain classes of languages -- and is quite
different to the issue of lexical ambiguity.

I think the issues are the same, but I would be interested to hear arguments distinguishing them.

-Rob


Rob Freeman

unread,
Sep 19, 2007, 11:16:33 PM9/19/07
to grammatical-i...@googlegroups.com
On 9/19/07, Iain <ia...@idcl.co.uk> wrote:

In short I don't think we can ever acheive language processing purely
by bootstrapping text.

Well, to make distinctions based on broader information we would need to store that broader information: vision, smell, touch, you name it. I just don't see why that broader information, once stored, should not be analysed in the way I am suggesting for text, and indexed on text (that which we call a rose... etc.)

The problem I see is the implementation of a system which treats each
phrase / sentence in terms of the totality of all the other phrases in
your collection, is simply the vastness of it.  We need to have some
means of reducing the vastness of it to some kind of model.  Even if
the model is just an index to the original data.

Potentially it is very big. I don't think we store everything. It is just that I believe what we do store, we store in the form of examples.

I find myself unable to believe in a one to one correspondance between
words (however processed) and meaning.  Sorry!

It is essential to my point that I agree! Words do not have a one-to-one correspondence with meaning. Sets of words, however, I believe do have enough power to represent the seemingly infinite gradations of meaning. That is why I suggest we represent meaning with sets of words (or any sets of examples.)

-Rob

Iain Downs

unread,
Sep 20, 2007, 3:57:46 AM9/20/07
to grammatical-i...@googlegroups.com

So given a sequence of words ABC, if you first select combinations between AB or BC, and then select from among the results of that combination with what was left ((AB)C) or (A(BC)), that naturally gives a best order of combination for ABC. That most natural "order of combination" turns out to look a lot like parse or constituent structure.

As part of that, halfway through combining all these groups of words into partial combinations, you naturally arrive at a point where a group which looks a lot like a noun-phrase, or other, is acting as a context to select something else, as you mention above.

It happens, but all you have to do to make it happen is model selections between immediately adjacent tokens, and iterate that.

As I say, it is all really down to the power of selections to find new sets. That means you can always find a new set of contexts which uniquely specifies the behaviour of each partial combination AB, BC, ((AB)C)... etc, and can iterate the selection.

[[Iain]] This seems to be very similar to the methods used to perform unsupervised grammar induction.  Both Alignment Based Learning (van Zaanen) and ADIOS (Solan) have the idea of selecting partial combinations and iterating from there.  Both claim some success at identifying remote connections.  How does your approach differ?

 

Iain

alexs...@googlemail.com

unread,
Sep 20, 2007, 4:18:10 AM9/20/07
to Grammatical Incompleteness

On Sep 20, 3:23 am, "Rob Freeman" <gro...@chaoticlanguage.com> wrote:

The example I gave does not use lexical ambiguity -- that should be
argument enough!

David J Brooks

unread,
Sep 20, 2007, 6:01:26 AM9/20/07
to grammatical-i...@googlegroups.com
> If you have a server with enough RAM to spare (about 200Mb resident) we
> could probably put that old demo back on the Web within a day or so.

I can look into it, but that's probably not going to be provided by
google groups ;)

> It turns out that power is enough to deal with any structural issues by
> itself. All you have to do is iterate selections between immediately
> adjacent tokens.

Right, so you're talking about a greedy learning framework, such as
those described by Iain (ABL-incr is the only instantiation of ABL that
behaves in this way, however). There have been many attempts to model
language learning in this way: ADIOS, ABL, EMILE, the work of J. Gerard
Wolff, Stolcke and Omohundro, and the more ad hoc work of Smith and
Witten or my own thesis. Alex Clark and Dan Klein have both discussed
the problems of such an approach.

> I'm not quite sure what you mean by "committing to ... syntactic models". My
> syntactic model is the corpus. If a combination of words occurs in the
> corpus, it is in the syntactic model. (If two words share a lot of contexts
> they are hypothesized to belong to one or other same syntactic class, and
> allowed to make predictions about each other.) Up to now I've only looked at
> immediate contexts. That is a commitment of sorts. I don't see why it
> shouldn't be extended though.

I'm not explaining this very well (evidently). I'm trying to say that by
introducing a constraint that says "initially we only consider relations
between adjacent words", you are introducing a syntactic constraint.

Granted, as in greedy/iterative learning algorithm, it is /possible/
that *if* you make all the correct choices for adjacent words, you
*might* end up being able to consider relations between syntactically
related non-adjacent words. But the problem with greed is that your
success at a high level is entirely dependent on success at local
levels, and if you get it wrong locally then long-distance dependencies
will not be considered.

This is why it is a syntactic constraint: if your algorithm rules out
/any/ analyses at the beginning (like the relation between non-adjacent
words), it is making a syntactic commitment. That your algorithm has
only loose specification of syntax (adjacency), and that the overall
syntax is an emergent property, does not exempt it from my observation.

We can only see how well it works by comparing it empirically to other
models, though such comparison is a difficult problem in itself! (Again,
Alex Clark, among others, have discussed the problems of evaluation.)

To try to convince you that your commitment to adjacency is a commitment
to syntax, I'd like to consider English constructions where adjacency
doesn't properly account for the relationships. Cross-serial
dependencies in some sentences with "respectively" are an example:

[from Mark Davies' View/BNC interface:]
"71 per cent and 75 per cent in the grammatical and oral examinations
respectively"

Here the relationship is much more than can be captured with head-words
and adjacency, and in fact an adjacency model will not capture the
correct relationships. To correctly interpret the above sentence
requires indexing of the order of (conjunction) participants, and
duplication of the coupling relation ("in" in the above example).
(However you specify the syntax, it must at least capture these relations.)

Cross-serial constructions are a problem for adjacency (as the name
implies): if one of "71 per cent" or "75 per cent" is selected as a
head, then the other cannot be related to "in" or to its partner in the
complement. In other languages the situation will be different because
there may be a different incidence of such constructions.

Under my misinterpreted view of "ad hoc generalisation", I think that
structures like this (and like the lists that John Sowa mentioned way
back in the old thread) are treatable: you said yourself that
uninterpretable constructions may be reanalysed as a syntactc oddity,
which may then be reinforced by other instances of the patter, until
eventually it becomes an ossified pattern.

Cheers,
D
--
David Brooks
Teaching Instructor
http://www.cs.bham.ac.uk/~djb

Rob Freeman

unread,
Sep 20, 2007, 6:40:19 AM9/20/07
to grammatical-i...@googlegroups.com
I wasn't familiar with ADIOS. Thanks for the tip.

Alignment based learning I am aware of. Rens Bod told me about it in 2000. I had extensive discussions with Menno about it in 2004.

In principle it is good. The problem I have with it is the problem I find common to all current machine learning. I think ADIOS will prove to have the same problem.

As I have said many times, it is the goals of current machine learning which I think are wrong, not their methods.

In a word, current machine learning methods do not accept that the grammars they find must be incomplete.

That said I don't know how Menno has developed ABL in the last three years. I know he also had a post-doc working on it in 2006.

I also had extensive correspondence with his post-doc, and gave him my demo to play with. Still, as I recall, I could not persuade him that it was impossible to learn a complete set of classes to summarize all the distributional information.

That was the key difference, and remains the key difference as far as I know.

-Rob

Rob Freeman

unread,
Sep 20, 2007, 7:21:44 AM9/20/07
to grammatical-i...@googlegroups.com
On 9/20/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:

> It turns out that power is enough to deal with any structural issues by
> itself. All you have to do is iterate selections between immediately
> adjacent tokens.

Right, so you're talking about a greedy learning framework, such as
those described by Iain (ABL-incr is the only instantiation of ABL that
behaves in this way, however). There have been many attempts to model
language learning in this way: ADIOS, ABL, EMILE, the work of J. Gerard
Wolff, Stolcke and Omohundro, and the more ad hoc work of Smith and
Witten or my own thesis. Alex Clark and Dan Klein have both discussed
the problems of such an approach.

Maybe. It would be the word "learning" which I would be afraid of in that summarization. I don't think it is possible to "learn" a complete set of classes to summarize all the information there is in the distributions.

To try to convince you that your commitment to adjacency is a commitment
to syntax, I'd like to consider English constructions where adjacency
doesn't properly account for the relationships. Cross-serial
dependencies in some sentences with "respectively" are an example:

[from Mark Davies' View/BNC interface:]
"71 per cent and 75 per cent in the grammatical and oral examinations
respectively"

Here the relationship is much more than can be captured with head-words
and adjacency, and in fact an adjacency model will not capture the
correct relationships. To correctly interpret the above sentence
requires indexing of the order of (conjunction) participants, and
duplication of the coupling relation ("in" in the above example).
(However you specify the syntax, it must at least capture these relations.)

Cross-serial constructions are a problem for adjacency (as the name
implies): if one of "71 per cent" or "75 per cent" is selected as a
head, then the other cannot be related to "in" or to its partner in the
complement. In other languages the situation will be different because
there may be a different incidence of such constructions.

Firstly let me say I am not stuck on immediate contexts. I think much broader context information could be added, and should be. I chose immediate contexts because I was lazy, and eager to demonstrate the general complexity result.

But I don't like the idea that any syntactic principle will be built into the context information. Context information may be extensive, but I think it should be naive.

Anyway, I think we should use what context information we can, so I'm not going to argue that my immediate contexts are the definitive choice.

Beyond that. I would guess that problems like your cross-serial construction _might_ be handled by this complexity I'm talking about again (or there might be some selection principle which acted on broader contexts.)

You say you can't select "71 per cent" or "75 per cent" as the head and attach it alone to either half of the compliment, because you need to attach each to the other half too.

The thing is that when I combine tokens one or other head need not dominate. What I get is a representation which depends on both. You have enough power to have a separate representation for every possible combination of words.

So by my guess you could get a representation which looked something like "both" in "both" respectively, where each "both" depended on both of its parts in ways which could be pulled apart.

How you would then tease that apart to get the "respective" associations, frankly, I have no idea. But I think the complexity I am talking about will help.

So my point for now begins and ends with complexity: grammatical incompleteness.

What I want to get across is that you have a tremendous amount of power in all the ways you can select patterns among sets of contexts. There is far more power than we have suspected. In particular there is more power than it is possible to summarize in any complete set of classes.

There may be detailed problems in the ways contexts select each other, but for now the big problem I see is that we have no idea of the complexity combinations of contexts give us, and how we could be using it.

By all means extend the information captured in contexts. But the first thing I think we need to do is start looking at these contexts in terms of ad-hoc sets, and stop trying to summarize them in grammars.

-Rob

David J Brooks

unread,
Sep 20, 2007, 8:00:25 AM9/20/07
to grammatical-i...@googlegroups.com
> Maybe. It would be the word "learning" which I would be afraid of in that
> summarization. I don't think it is possible to "learn" a complete set of
> classes to summarize all the information there is in the distributions.

I didn't say anything about a complete set of classes here. Some of
these algorithms do attempt to find complete classes; some of them
explicitly do not (original EMILE and ADIOS); some of them attempt more
greedy merges of classes where enough similarity exists (practical
implementations of EMILE that reduce the constraint of "characteristic
support" to a similarity measure, and my work).

These algorithms primarily learn structure, and not necessarily
classifications.

In my approach, I wouldn't say that there is a predefined idea of
what/how many classes there should be -- they are just formed where
enough similarity is present. I'm not sure you'd accept this as "ad
hoc", though, because my choice of similarity threshold effectively
specifies the number of classes in advance, although not in such a
transparent way. The motivation for this was that there simply were not
enough instances of any observed combination to support further
learning. I think this is a problem for all such approaches, and I
merely made a choice in one particular direction, and your criticisms
are of course valid.

> Firstly let me say I am not stuck on immediate contexts.

In which case your openness to other syntactic relations removes the
syntactic constraint I described. You can see my point, and are aware of
my concern. (It's still hard to do the combinations without defining the
syntax, but I'll roll with it for now...)

> What I get is a representation which depends on both.

Ok, more reasonable. I've seen plenty of theories that don't even allow
this.

> How you would then tease that apart to get the "respective" associations,
> frankly, I have no idea. But I think the complexity I am talking about will
> help.

Agreed. Bolting on a more complex mechanism is a possibility, and the
details are not immediately important.

Rob Freeman

unread,
Sep 20, 2007, 8:39:34 AM9/20/07
to grammatical-i...@googlegroups.com
On 9/20/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:

> Maybe. It would be the word "learning" which I would be afraid of in that
> summarization. I don't think it is possible to "learn" a complete set of
> classes to summarize all the information there is in the distributions.

I didn't say anything about a complete set of classes here. Some of
these algorithms do attempt to find complete classes; some of them
explicitly do not (original EMILE and ADIOS)...


These algorithms primarily learn structure, and not necessarily
classifications.

If they just try to structure text and not learn a set of classes to parametrize that structure, that could be good.

Structure is also ultimately subjective, e.g. you get some branches where it does not matter which you choose as "head", and some will be sensitive to your exact corpus. But if you are letting the full complexity of selection between sets of contexts govern structure, then that is the same as me, and I think you have the complexity you need.

I want to emphasize the complexity angle of it, that it is necessary.

I also want to emphasize that this complexity angle is good. It means you can have a unique set of contexts to represent every combination of words. This has implications both for syntax and meaning.

Though it does mean grammar must always be incomplete.

Let's bring this out into the open and discuss it. What is this telling us? Why is it so? Just what is the complexity of restriction you can get between different sets of contexts in a distributed representation?

I think this is the single most ignored fact in the study of language today. If some work is now using the power of it by default then that is good. But lets talk about it, and raise awareness that it is necessary.

-Rob

David J Brooks

unread,
Sep 20, 2007, 9:27:09 AM9/20/07
to grammatical-i...@googlegroups.com

Rob Freeman wrote:
> On 9/20/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:
>>
>>> Maybe. It would be the word "learning" which I would be afraid of in
>> that
>>> summarization. I don't think it is possible to "learn" a complete set of
>>> classes to summarize all the information there is in the distributions.
>> I didn't say anything about a complete set of classes here. Some of
>> these algorithms do attempt to find complete classes; some of them
>> explicitly do not (original EMILE and ADIOS)...
>>
>> These algorithms primarily learn structure, and not necessarily
>> classifications.
>
>
> If they just try to structure text and not learn a set of classes to
> parametrize that structure, that could be good.
>
> Structure is also ultimately subjective, e.g. you get some branches where it
> does not matter which you choose as "head", and some will be sensitive to
> your exact corpus. But if you are letting the full complexity of selection
> between sets of contexts govern structure, then that is the same as me, and
> I think you have the complexity you need.

Just to clarify: in my work, I found that *without* the generalisation
afforded by fitting to (albeit ad hoc) classification, I couldn't
leverage larger units because there was so little contextual support for
potential units. I resorted to classification, and while I don't view
that as a bad thing per se, it does raise issues of the sort you are
trying to address. I think this is the problem, as Alex said, with naive
learning algorithms...

However, others (ADIOS and EMILE) are closer; certain instantiations of
ABL likewise.

I am quite keen to see a working implementation of your program now. Can
it be run on an Apache web-server on a local machine?

Rob Freeman

unread,
Sep 20, 2007, 9:47:33 AM9/20/07
to grammatical-i...@googlegroups.com
On 9/20/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:

I am quite keen to see a working implementation of your program now. Can
it be run on an Apache web-server on a local machine?

I don't see why not. I run it on my laptop with this nice, install anywhere, perl webserverette: http://www.openpsp.org/.

There's a binary for the engine which should run on any linux machine.

There's a perl wrapper which runs the engine and provides a socket that some cgi scripts talk to. The cgi scripts give you the interface.

There's a data file about 160Mb big which you need. That sits in memory and is what takes all the room.

-Rob




Rob Freeman

unread,
Sep 20, 2007, 10:18:06 PM9/20/07
to grammatical-i...@googlegroups.com
David,


On 9/20/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:

> It turns out that power is enough to deal with any structural issues by
> itself. All you have to do is iterate selections between immediately
> adjacent tokens.

Right, so you're talking about a greedy learning framework, such as
those described by Iain (ABL-incr is the only instantiation of ABL that
behaves in this way, however). There have been many attempts to model
language learning in this way: ADIOS, ABL, EMILE...

I'm having more and more problems with the equivalence you have drawn between my approach and ADIOS and EMILE, as well as with the general characterization as a "a greedy learning framework".

Firstly, I don't think what I am describing is a "greedy" algorithm. "Greedy" algorithms involve making decisions early, right? I don't make decisions early at all.

Yes, I iterate, but the iteration causes a proliferation of possibilities which are selected among when all the information has been added. It doesn't make early decisions.

Secondly, looking more closely at ADIOS, and EMILE. I don't see the parallel, beyond the common basis in distributional analysis. Both strike me as still trying to learn grammars.

Maybe they don't label classes. But they fix clusters and stick with them.

This is a description of ADIOS in an article off the ADIOS homepage:

http://www.scienceblog.com/cms/node/8802

"The algorithm ... can take a body of text, **abstract from it a collection of recurring patterns or rules**..."

Equally, in a paper comparing ABL and EMILE, the abstract specifically says:

"EMILE's learned string language **converges to a string language** from which samples are taken"

By incompleteness I am specifically conjecturing that natural language strings _do not_ converge to a complete string language.

The emphasis is completely different.

I am taking a corpus of fragments, and constantly seeking new ways they can be combined (like a search engine, I let each new sentence try to force new combinations among indexed contexts.) They are seeking to summarize the ways fragments can be combined by generalizing them into a formal language.

It is as if I have an indexed search engine and let people search for different combinations of text fragments limited only by combinations of search keys (syntax). While they try to find the results for possible searches first, and store only those which are most common.

The difference is a difference of complexity. But that is the point I am trying to get across.

-Rob

Rich Cooper

unread,
Sep 20, 2007, 11:08:46 PM9/20/07
to grammatical-i...@googlegroups.com

Greedy algorithms defined:

http://en.wikipedia.org/wiki/Greedy_algorithm

 

 

Sincerely,

Rich Cooper

http://www.EnglishLogicKernel.com

 


David Brooks

unread,
Sep 21, 2007, 4:10:18 AM9/21/07
to grammatical-i...@googlegroups.com
Rob,

I'm happy to admit that I don't really understand how your algorithm is
intended to work.

> I'm having more and more problems with the equivalence you have drawn
> between my approach and ADIOS and EMILE, as well as with the general
> characterization as a "a greedy learning framework".
>
> Firstly, I don't think what I am describing is a "greedy" algorithm.
> "Greedy" algorithms involve making decisions early, right? I don't make
> decisions early at all.

So you don't make any commitments? So a phrase like "the man with the
stick" is not treated as a unit of some kind? Then how can it act as
syntactic context as if it were a unit?

Maybe I misunderstood - are you saying that this interpretation is just
*one of an enormous space of possibilities* that is considered in
parallel? Would you therefore also consider "the man with the", "the man
with", "the man * the stick" etc., all at the same time? If so, I don't
see how it is possible to cover all the possible relations.

Perhaps I'm not clear what "iterate" means in this context. I took it to
mean "iteratively committing to current best interpretations", hence I
used the term "greedy".

I interpreted greed because the explosion of alternatives seems so
massive that I assumed you'd have to do something to make processing
tractable. You seem to be saying that it is this enormous space of
possibilities that defines your approach. Coming from an AI background I
can't really see how (especially in 200MB of RAM!), but that's why I
want to learn more about it.

Can you send me (or link off this group) some operational code for your
system so I can try it out on my local server? (And some instructions
for use.)

D

Rob Freeman

unread,
Sep 21, 2007, 5:55:46 AM9/21/07
to grammatical-i...@googlegroups.com
On 9/21/07, David Brooks <D.J.B...@cs.bham.ac.uk> wrote:

...

I interpreted greed because the explosion of alternatives seems so
massive that I assumed you'd have to do something to make processing
tractable. You seem to be saying that it is this enormous space of
possibilities that defines your approach. Coming from an AI background I
can't really see how (especially in 200MB of RAM!), but that's why I
want to learn more about it.

The explosion of possibilities in the parse tree is a problem. There are some optimizations you can make, but yes, at the moment I essentially generate a representation for each possible parse before I select between them.

It is a parallel search problem, so not difficult in theory. But if you can't do parallel search efficiently, yes, it is a problem.

I've sent you an off-line note about code.

-Rob

Iain Downs

unread,
Sep 21, 2007, 6:46:45 AM9/21/07
to grammatical-i...@googlegroups.com

[[Iain]] I'm on David's side here, I can't see how you can perform what amounts to an exhaustive search as the historic data increases.  I assume the 200MB is just for your indexes.

 

I also presume that you DO have some kind of index structure - an exhaustive search over all priors is just never going to work in real life!

 

Can you provide more information on your algorithm?

 

Iain

Rob Freeman

unread,
Sep 21, 2007, 7:51:02 AM9/21/07
to grammatical-i...@googlegroups.com
On 9/21/07, Iain Downs <ia...@idcl.co.uk> wrote:
>
> [[Iain]] I'm on David's side here, I can't see how you can perform what amounts
> to an exhaustive search as the historic data increases. I assume the 200MB is
> just for your indexes.

The 200Mb is just for the indexes, yes.

The search space is big, yes too, because the potential number of
parses goes up fast as sentence length increases. There are some
optimizations you can do which keep the time polynomial, but if you
search over all of them in serial, it does take time.

The solution would be to do searches in parallel.

> Can you provide more information on your algorithm?

It is not so hard. If Google can do ad-hoc searches across the whole
Web, I can do ad-hoc searches across 40 million words.

I start with words indexed on lists of contexts. Then I have contexts
in these lists select each other according to the words in the search
query.

This very much like what happens in indexed search now.

A trick that is then unique to me is that I take each observed
combination between contexts in these lists, and expand them out on
the contexts _they_ occur in. So I get a list of contexts representing
the combination or words in the search query, in the order they are
presented. This is also a candidate "constituent", and the list of
contexts is its representation.

That is something search engines don't currently do. Which means the
order of search keys does not currently affect results, not beyond
crude exact phrase matches.

In this system the order of search keys makes a big difference. From
the point of view of language modeling it is the whole game.

So you iteratively combine lists of contexts associated with search
keys in this way, and end up with a list representing the whole
presented sentence, in the order it was presented (and also according
to the order "constituents" were found--the parse structure.)

The size of the list gives you a measure for how well that sequence of
search keys fits combinations of contexts in your data-base.

The actual list looks a lot like it is collected on the meaning of the
search query. Just as search results do now. The only difference is
that the order of the search keys in the query has selected the
contexts too.

-Rob

David J Brooks

unread,
Sep 21, 2007, 7:57:06 AM9/21/07
to grammatical-i...@googlegroups.com
Rob,

> It is a parallel search problem, so not difficult in theory. But if you
> can't do parallel search efficiently, yes, it is a problem.

Ok, I'm getting a better handle on this.

I've just googled you and found your website
(http://www.chaoticlanguage.com/), containing the quote:

"Essentially we do the same, with the twist that we replace the classes
with vectors or lists of examples."

I think this makes the concept much clearer for me (as do subsequent
descriptions about "infinite numbers of [virtual] rules at runtime").

The implementation you have is presumably one you've restricted only to
vectors of adjacent words (cf. my discussion of syntactic constraints -
with your ). This is just so I understand this instantiation, not so I
can critise the approach in general - I think you've allayed my earlier
concerns, and I just want to be clear before I start thinking harder
about it.

Rob Freeman

unread,
Sep 21, 2007, 8:47:22 AM9/21/07
to grammatical-i...@googlegroups.com

Yes. So far I've restricted my implementation to vectors of adjacent words. More exactly the examples on the website used vectors of "similar words" made by clustering on the same prior and post context. So ABC and ADC => {B,D} because of common context A_C.

That is an old implementation. It is wrong because pre-clustering in this way actually throws away some of the context information (i.e. it should assume B is in a class with D _only_ in the context of A_C.) I should not have clustered so early. This smudged some distinctions the system should have been able to make.

The implementation I'm looking at now deals directly with contexts. This makes the convolutions of the vectors a bit more mind bending to contemplate, but it keeps all the relevant information until it is needed.

-Rob


David J Brooks

unread,
Sep 21, 2007, 8:57:20 AM9/21/07
to grammatical-i...@googlegroups.com
> (i.e. it should assume B is in a class with D _only_ in the context of A_C.)

FYI: My comparison to ADIOS was made on the basis that this is what I
believe ADIOS does. I think the same pertains for EMILE *only* in the
case of "characteristic contexts" (defined by Pieter Adriaans).

I know EMILE has been extended to perform similarity-based
generalisations (in effect making it context-free, at least to some
degree), but I don't know if ADIOS has been extended similarly.

I'm not suggesting that they are doing the same thing necessarily; I
just wanted to clarify the basis for my comparison.

David J Brooks

unread,
Sep 21, 2007, 9:08:23 AM9/21/07
to grammatical-i...@googlegroups.com
> The solution would be to do searches in parallel.

I don't wish to imply that I doubt the method, but the same can be said
for a lot of computationally hard problems, and this is why I (and maybe
also Iain) asked.

> It is not so hard. If Google can do ad-hoc searches across the whole
> Web, I can do ad-hoc searches across 40 million words.

But google doesn't index function words, which I think is an important
difference in terms of effectiveness (but not in terms of problem-scale,
where your observation is quite true).

Two corner-cases that you must be able to argue are:
[1] extremely frequent words. These offer little information about
co-occurrence - in terms of traditional measures like Mutual Information
that allow you to constrain the problem.
[2] hapax legomena (because their information is too limited to
generalise effectively).

Google doesn't do too well with either, AFAIK. The fact that google
works is a product of relations between content words, which select each
other syntactically only as a by-product of common semantic associations.

I suspect that the effects of lexical ambiguity and "syntactic
incongruence despite contextual congruence" (an abstraction of Alex
Clark's description that is probably inaccurate; Menno van Zaanen
describe this as "dual-level", but I don't know what the correct term
is?) are particularly important in these corner cases.

I'm not saying you can't argue these corner-cases, and I'm sure you've
thought about this before and have an argument prepared. But perhaps I
should think more carefully about the workings before we enter into this
discussion.

Rob Freeman

unread,
Sep 21, 2007, 9:15:43 AM9/21/07
to grammatical-i...@googlegroups.com
On 9/21/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:

> (i.e. it should assume B is in a class with D _only_ in the context of A_C.)

FYI: My comparison to ADIOS was made on the basis that this is what I
believe ADIOS does. I think the same pertains for EMILE *only* in the
case of "characteristic contexts" (defined by Pieter Adriaans).

A lot of people do this. The whole SourceForge Senseclusters project does this.

For me it was a pre-processing stage.

I now think it was a mistake. The system should assume B is in a class with D _only_ in the context of A_C. It is wrong to cluster on contexts early. It means your system starts thinking B should act like D in all contexts, when that is not true. Instead I should index on contexts, and select directly between contexts, not similar words.

-Rob


alexs...@googlemail.com

unread,
Sep 21, 2007, 9:52:33 AM9/21/07
to Grammatical Incompleteness
Yes, this is one of the innovations of the ADIOS system; that they
have a rather ad hoc method for only applying reduction rules in a
context.

Emile's approach is slightly different, but I am not 100% sure.

Rob Freeman

unread,
Sep 21, 2007, 9:54:17 AM9/21/07
to grammatical-i...@googlegroups.com
On 9/21/07, David J Brooks <D.J.B...@cs.bham.ac.uk> wrote:
Google doesn't do too well with either, AFAIK. The fact that google
works is a product of relations between content words, which select each
other syntactically only as a by-product of common semantic associations.

I'm reassured! If you are questioning my algorithm in the same breath as you're questioning Google's, I don't think I need worry too much :-)

Google works pretty well!

Search is already arguably the most commercially successful natural language processing application of all time. And it _is_ NLP. It does better at matching words to meaning than all those rules we play with.

There must be a reason for that. Why has indexed search proven much more practical than anything based on rules? It wasn't the only model. Yahoo started off with hand crafted hierarchies (hence: Yet Another Hierarchical Officious Oracle.)

I think the success of indexed search at modeling meaning is telling us something. We should take the hint and see what it can tell us about the complexity of syntax.

And maybe what we find out about syntax, can tell us how to do indexed search better.

I suspect that the effects of lexical ambiguity and "syntactic incongruence despite contextual congruence" ... are particularly important in these corner cases.

Syntactic incongruence despite contextual congruence is exactly what I am capturing. No-one else does this.

I make it the cornerstone of my method.

It gives us so many more sets to use. . All those syntactic incongruences. Each one with its own set. Each set specifying a different collocation, or making a different meaning distinction.

I'm not saying you can't argue these corner-cases, and I'm sure you've
thought about this before and have an argument prepared. But perhaps I
should think more carefully about the workings before we enter into this
discussion.

Some of them I've worked on. Others need work.

But there is a basic idea there which I think is worth looking at.

The basic idea is that all those possible sets you can find, all different possible clusterings of all those odd "syntactic incongruences", give you much more power to describe syntax, and semantics.

Maybe it will all unravel. But right now it looks to me like the power to describe the idiosyncrasy (and incompleteness) of grammar, we've been looking for.

-Rob

David Brooks

unread,
Sep 26, 2007, 9:00:26 AM9/26/07
to grammatical-i...@googlegroups.com
> The example I gave does not use lexical ambiguity -- that should be
> argument enough!

I think there is an important distinction: the phenomenon Alex
highlights is present *even if we managed to resolve all lexical
ambiguity*. (I know we're in a discussion about whether or not that is
necessary/possible, but I think it's especially important for someone
adopting a position like Rob's - if your approach copes with this where
others fail, then it is an advantage you should highlight.)

I guess that Rob conflates the issue because from a set-overlap
perspective, he's saying his system can handle:

A _B_ (term B observed in the context of A)

even where:
[1] instances of B may be syntactically incongruent (B is ambiguous)
[2] A may /occur with/ syntactically incongruent terms.

For [2] to occur, it is possible that A is simply ambiguous, and this
relates to my earlier discussion of contextual ambiguity; however,
ambiguity only accounts for a subset of these observations, not
including Alex's example.

As Alex also pointed out, this is a problem for naive learning
algorithms: if an algorithm that forms classifications successfully
discerns between syntactically incongruent instances of B (on the basis
of the surface form of B), then it is quite reasonable to say it
accounts for the problem. Data-sparsity in the surface form is one
reason why this may not be possible, and it is particularly relevant
when considering raw text (as opposed to POS-tagged texts).

In my view, the arguments distinguishing ambiguity from the phenomenon
Alex described are important as they define capacities of learning
algorithms, even though the approaches to resolving them might be similar.

For example, van Zaanen's thesis, Section 7.3 [p.101-102] suggests that
clustering is a viable approach to resolving the following case (from
ATIS corpus):

show me the _nonstop_ flights
show me the _morning_ flights

He argues that distributional analysis over the words in question would
lead us to think that they are in separate classes. My earlier point
about data-sparsity (especially hapax legomena) is relevant here: if
these are the only occurrences of the words, how do you determine that
they belong to different syntactic types? I can't see how a set-based
view such as that expounded by Rob would resolve this problem, unless
there is much more data about the words in question.

D

David Brooks

unread,
Sep 26, 2007, 9:09:30 AM9/26/07
to grammatical-i...@googlegroups.com
Rob,

can you briefly explain how your demo (the picture on your website)
arrives at constituent bracketings from sets?

cheers,
D

Reply all
Reply to author
Forward
0 new messages