Constituent bracketing from inconsistent sets

5 views

Skip to first unread message

Rob Freeman

unread,

Sep 27, 2007, 2:33:05 AM9/27/07

to grammatical-i...@googlegroups.com

On 9/26/07, David Brooks <d.j.b...@cs.bham.ac.uk> wrote:
>
> Rob,
>
> can you briefly explain how your demo (the picture on your website)
> arrives at constituent bracketings from sets?

The brief explanation is that it is probably much in the same way
grammatical induction does it now.

Aren't ADIOS and EMILE doing this?

The big difference with my model is that I don't assume the sets I
find converge to a single complete grammar.

The only thing that might give you trouble is the way I generate a
unique representation for each intermediate constituent.

Here is how I described this before:

"A trick that is then unique to me is that I take each observed
combination between contexts in these lists, and expand them out on
the contexts _they_ occur in."

So my representation for a candidate intermediate constituent AB, is
the expansion of contexts observed for pairs between the contexts of A
and B.

This gives me a unique representation for AB in terms of contexts. I
can then iterate the combination to generate an intermediate
constituent representation for ((AB)C) etc.

You could say I use the power my model (lots more ad-hoc sets) to
generate a unique representation for each intermediate constituent
(solving your "recursive ambiguity" problem), and postpone the overall
parse decision until all the information has been folded in.

I don't know what ADIOS or EMILE do for these intermediate constituent
representations, but I find that the intermediate representation I get
for ABC is different according to whether I combine ((AB)C) or
(A(BC)), and that is what gives me a parse.

How do ADIOS and EMILE do this? Do they just list possible
constituents and not bother to disambiguate ((AB)C) or (A(BC))?

Note: the example on my website is in terms of similar words (by
clustered contexts), so is a bit different, but the principle is the
same. In that example a representation for "foreign exchange" is
generated by finding observed pairs between words similar to "foreign"
and "exchange", and expanding a representation for "foreign exchange"
using words similar to that observed pair. So if "currency" is similar
to "foreign", "exchange" is trivially similar to "exchange", and
"currency exchange" is observed, then you can add the representation
for "currency exchange" to the new representation for "foreign
exchange". The combination of representations for all the observed
pairs, highlighted in red on the right, gives you the "NP" list on the
left.

The intuition here is that a combination between similar words will be
similar to their combination (c.f. Dagan et al. "similarity
modeling".)

As I say, I was wrong to generalize context early and use lists of
"similar words" in this way. It should all be done directly in terms
of contexts. But the generation of the representation for a new pair,
from an expansion of representations for observed pairs, is the same.

-Rob

David Brooks

unread,

Sep 27, 2007, 5:57:52 AM9/27/07

to grammatical-i...@googlegroups.com

Rob,

> The brief explanation is that it is probably much in the same way
> grammatical induction does it now.

I don't think there is a universal rule for how it works.

> Aren't ADIOS and EMILE doing this?

ADIOS doesn't. ADIOS creates a graph where all sentences are initially
one path through the graph, then attempts to find subpaths that can be
merged, where the context is the same and there is a paradigm covering
word-sequences that occur in that context. Although this may be
interpreted as a grammar rule, it is strictly context-sensitive. ADIOS
does not, I believe, induce constituent bracketings, at least in part
because patterns are allowed to overlap which creates a problem for
constituent bracketings. In changing the representation (distilling the
graph, as Solan et al. term it) they are clearly seeking a single
representation, which on your terms would make it an attempt to create a
complete grammar.

ABL, EMILE, and other systems (Alex Clark's; Dan Klein's; my own) make
explicit choices about where constituent brackets are inserted, but the
methods used for these decisions are different. They do rely on
substitutability to some degree, though. I can't see substitutability as
being made fundamental in your approach, which is why I asked.

> "A trick that is then unique to me is that I take each observed
> combination between contexts in these lists, and expand them out on
> the contexts _they_ occur in."
>
> So my representation for a candidate intermediate constituent AB, is
> the expansion of contexts observed for pairs between the contexts of A
> and B.

This is ambiguous. Do you mean you look for observed pairs that occur in
the context "A*B", where * is the place at which "a pair" occurs?

Or do you mean AB is a combination of two sets - *B (all things
occurring to the left of B) and A* (all things occurring to the right of
A), or:

{*B}{A*}

Or something different?

If it is as I describe above, you still need to be more explicit about
how you consider combinations, because you are left with a number of
possible interpretations, and each is just a combination of sets (and
indistinguishable as such). For example, in considering ABC, there are
possibilities:

{*B}{A*^*C}{B*}

(Where A*^*C is the intersection of A* and *C). The combinations can
include relating {*B}{A*^*C}, or {A*^*C}{B*}, or {*B}...{B*}, however.
All of these just yield sets, and don't offer an obvious criterion for
how combination is achieved.

What properties of these sets do you look for when proposing a
constituent? What makes one set of sets more likely to represent a
constituent than another?

Can I suggest you make a small artificial corpus of sentences from a
very limited alphabet, and show how you intend it to work by example? It
would be much easier to discuss properly if we could fully understand
what you are intending.

> How do ADIOS and EMILE do this? Do they just list possible
> constituents and not bother to disambiguate ((AB)C) or (A(BC))?

ADIOS is driven by a heuristic towards certain combinations, so only
considers a subset of possible paradigms, but it does not attempt
constituent induction.

EMILE enumerates all possible combinations within a limited form
(splitting sentences into three parts: left and right context and an
"expression", and attempts to find paradigms covering expressions). I
couldn't really say how this would apply to the above example since it
doesn't fit the scheme, but I would imagine that probability of
combination accounts for the disambiguation.

ABL-incr does something like "Not bother to disambiguate" - it just
accepts the first bracketing it encounters in a very naive way (i.e. the
order in which bracketings are discovered is dependent on the order of
sentences presented, and can be quite wrong). My approach does something
similar, except the order of bracket proposal is determined by
heuristic. These are greedy algorithms because the decisions are
irreversible.

The probabilistic versions of ABL are more reasonable, and consider the
alternatives (but only if they are proposed). First, it explodes all
possible interpretations (in an Alignment phase), then combines those
interpretations into possible parses, preferring the most probable
combination (Selection phase). There are issues though, since not all
alternatives are necessarily created during Alignment, and therefore may
not be available to Selection.

In all the approaches described above, concessions are made to the sheer
size of the space of possible combinations. You seem to be suggesting
that this isn't necessary or desirable. The latter I might be inclined
to agree with, but I don't think that even parallelisation would help
overly with the former - it doesn't appear to be an "embarrassingly
parallel" problem, which in itself would only lead to an N-fold
reduction in complexity (for N parallel processors). But that is really
an aside (to which you might care to return at a later point), because
I'm more interested in the choice of constituent structure at present.

Cheers,
D

Rob Freeman

unread,

Sep 27, 2007, 8:57:51 AM9/27/07

to grammatical-i...@googlegroups.com

On 9/27/07, David Brooks <d.j.b...@cs.bham.ac.uk> wrote:
>
> ...I can't see substitutability as being made fundamental in your

> approach, which is why I asked.

Substitutabilty really means "having a context in common". Which is
also fundamental to my approach.

But substitutability in the case of ABL can mean as few as one context
in common. In my approach I work with sets of contexts. I think these
sets are a better criterion for "substitutability" or grammatical
class, whatever you want to call it. But in the limit these "sets"
could be one context, in which case my approach would reduce to a kind
of ABL (with particularly simple contexts.)

So substitutability is fundamental to my approach.

It is probably also fundamental to my idea of a constituent. (I still
assess my constituents on the basis of the number of common contexts,
it is only how I estimate the contexts which is different.)

The thing is that having extended single contexts to sets, I do things
with these sets. I argue their complexity means we must search for
them ad-hoc, and I combine them to find sets for combinations, etc.

> > "A trick that is then unique to me is that I take each observed
> > combination between contexts in these lists, and expand them out on
> > the contexts _they_ occur in."
> >
> > So my representation for a candidate intermediate constituent AB, is
> > the expansion of contexts observed for pairs between the contexts of A
> > and B.
>
> This is ambiguous. Do you mean you look for observed pairs that occur in
> the context "A*B", where * is the place at which "a pair" occurs?
>
> Or do you mean AB is a combination of two sets - *B (all things
> occurring to the left of B) and A* (all things occurring to the right of
> A), or:
>
> {*B}{A*}
>
> Or something different?

The second one.

"AB is a combination of two sets - *B (all things occurring to the
left of B) and A* (all things occurring to the right of A)"

Except that I don't stop there. I then expand out this combination by
replacing all the observed *BA* with _their_ contexts *(*BA*) and
(*BA*)* (to extend your notational convention.)

This gives me a representation for AB in terms of contexts, and not
just a representation in terms of combinations of contexts, which is
all you have initially on combining *B and A*.

> If it is as I describe above, you still need to be more explicit about
> how you consider combinations, because you are left with a number of
> possible interpretations, and each is just a combination of sets (and
> indistinguishable as such). For example, in considering ABC, there are
> possibilities:
>
> {*B}{A*^*C}{B*}
>
> (Where A*^*C is the intersection of A* and *C). The combinations can
> include relating {*B}{A*^*C}, or {A*^*C}{B*}, or {*B}...{B*}, however.
> All of these just yield sets, and don't offer an obvious criterion for
> how combination is achieved.

I think the expansion I describe above resolves these possibilities.
By the time I get to ABC, A* and *B have been subsumed into (AB)* etc.
The various possibilities you describe above no longer exist.

The combination *C and (AB)* will be different from the combination
*(BC) and A*.

> What properties of these sets do you look for when proposing a
> constituent? What makes one set of sets more likely to represent a
> constituent than another?

Crudely put, the more observed contexts, the higher I score
grammaticality. Just substitutability again, really. Only the way I
estimate contexts for an intermediate constituent is different.

> Can I suggest you make a small artificial corpus of sentences from a
> very limited alphabet, and show how you intend it to work by example? It
> would be much easier to discuss properly if we could fully understand
> what you are intending.

I know. But it becomes a notational nightmare.

The most concise description I have is due to a colleague who
summarized the successive estimation of constituent representations as
a kind of cross product..

Something like this:

phi_k(t):= sum_{i,j} phi_k(g(i,j))

Where g(i,j) -> {0,...,N} maps an observed pair of elements to their
contexts, and g(i,j)=0 means that no such pair is observed. (So
phi_k(g(i,j)) is the k'th component of a context representation for
the pair of words i followed by j.)

t is a putative constituent, or "tree". phi_k(t) is the k'th component
of the context representation for t, etc.

That you can do this all follows from the basic idea of
substitutability (if two constituents have the same contexts they are
the same.) Menno calls it Harris's Principle, or some such.

With this "cross product" I'm reversing it and saying if two
constituents are the same, they will have the same contexts.

And before that, to get constituents which are the same, I am saying
they are the same because they have the same context (they are
combinations formed by matching lists of contexts.)

All this has probably always been possible, and follows from Harris's
Principle. It is the complexity argument which makes it desirable.
Otherwise people would assume all this behaviour could be more
conveniently captured in classes.

It is not difficult, just a bit convoluted. Try describing a
cross-product in words and you will get some idea of what I mean

> In all the approaches described above, concessions are made to the sheer
> size of the space of possible combinations. You seem to be suggesting
> that this isn't necessary or desirable. The latter I might be inclined

> to agree with...

My argument is indeed for the latter.

If the former turns out to be true we are stuck. But up to now we
haven't even considered the possibility that the possible combinations
are not governed by a single complete grammar.

-Rob

David Brooks

unread,

Sep 27, 2007, 9:43:33 AM9/27/07

to grammatical-i...@googlegroups.com

Rob,

> Substitutabilty really means "having a context in common". Which is
> also fundamental to my approach.
>
> But substitutability in the case of ABL can mean as few as one context
> in common. In my approach I work with sets of contexts. I think these
> sets are a better criterion for "substitutability" or grammatical
> class, whatever you want to call it.

Right - the difference between a commutation test and proper
distributional analysis. I would agree that we'd always like to work
with distributions because commutation is mostly unreliable. However, I
would also argue the nature of text corpora precludes it, because there
are just so many infrequent events!

I'll pull a reference from later in your email up to here: I also think
that the original criticisms of Zellig Harris' distributional test
revolve around this. Harris said something along the lines of "if two
fragments can occur in all the same contexts, they can be said to have
the same syntactic type", but the trouble is in exhaustively determining
the sets of contexts that they *can* occur in is difficult (and there
wouldn't be a poverty of the stimulus if we had such a set). In
contrast, this is clearly distinct from observing the set of contexts in
which two fragments *do* occur, which can be subject to the
aforementioned poverty. As I see it, this procedure for syntactic
discovery is unworkable because you need a linguist to decide on the set
of all contexts - so they might as well just tell you that two fragments
are the same type!

> So substitutability is fundamental to my approach.

ok.

> The second one.
>
> "AB is a combination of two sets - *B (all things occurring to the
> left of B) and A* (all things occurring to the right of A)"
>
> Except that I don't stop there. I then expand out this combination by
> replacing all the observed *BA* with _their_ contexts *(*BA*) and
> (*BA*)* (to extend your notational convention.)
>
> This gives me a representation for AB in terms of contexts, and not
> just a representation in terms of combinations of contexts, which is
> all you have initially on combining *B and A*.
>

> I think the expansion I describe above resolves these possibilities.
> By the time I get to ABC, A* and *B have been subsumed into (AB)* etc.
> The various possibilities you describe above no longer exist.

Yes, I was asking what the next step was and I think you've described it
here.

> The combination *C and (AB)* will be different from the combination
> *(BC) and A*.

quite, and I was asking how you decide between them for the purpose of
bracketing...

> Crudely put, the more observed contexts, the higher I score
> grammaticality. Just substitutability again, really. Only the way I
> estimate contexts for an intermediate constituent is different.

... and this allows me to make some headway!

> That you can do this all follows from the basic idea of
> substitutability (if two constituents have the same contexts they are
> the same.) Menno calls it Harris's Principle, or some such.
>
> With this "cross product" I'm reversing it and saying if two
> constituents are the same, they will have the same contexts.

See my earlier point. In the limit of course you are correct because
then you can guarantee that constituents of the same type *do* (or do
not) share a context set. In corpus data, however, this is not true
simply because of data-sparseness.

> And before that, to get constituents which are the same, I am saying
> they are the same because they have the same context (they are
> combinations formed by matching lists of contexts.)

Right, a kind of circular definition. Again, in the limit this may be
possible to resolve, but I just can't see how it will work well in
noisy, sparse corpus data. Classifications and context-free
generalisations are a response to this.

You really should look at EMILE, because much of its terms are described
in this way. (The EMILE manual is probably your best bet in terms of
clarity of description, and the comparison paper for ABL is certainly
too short to do the considerable complexities justice.)

In EMILE, "Characteristic types" describe roughly what you are talking
about: all members of the associated characteristic expression (the
paradigm) occur with all members of the associated characteristic
context-set. (I think the ideas are similar though if you look carefully
at how it is applied to sentences it breaks from your idea quite
dramatically.) EMILE was "weakened" to allow for "mostly overlapping"
sets ("primary" and "secondary" types, I believe) because of
data-sparseness.

> All this has probably always been possible, and follows from Harris's
> Principle. It is the complexity argument which makes it desirable.
> Otherwise people would assume all this behaviour could be more
> conveniently captured in classes.

I don't think the classes are a matter of convenience or preference (at
least not in my view) - they are a concession to leverage more
generalisation. It's be much neater to work with complete sets.

> If the former turns out to be true we are stuck. But up to now we
> haven't even considered the possibility that the possible combinations
> are not governed by a single complete grammar.

This is for my other email...

Cheers,
D

Rob Freeman

unread,

Sep 27, 2007, 11:19:23 AM9/27/07

to grammatical-i...@googlegroups.com

On 9/27/07, David Brooks <d.j.b...@cs.bham.ac.uk> wrote:
>

> ...I would agree that we'd always like to work

> with distributions because commutation is mostly unreliable. However, I
> would also argue the nature of text corpora precludes it, because there
> are just so many infrequent events!

Well, of course I disagree. I think you are talking yourself out of a solution.

Rather than being slain by data sparseness I think distributional
analysis did quite well on its diet of available data. It was slain
instead by Chomsky's observation that it resulted in incoherent and
incomplete representations. Exactly the point I am trying to make.

Alex's "syntactic incongruence" may be Chomsky making the same point
in a different place. Thanks to Alex for the reference.

Sparseness is a problem, but it need not be insuperable.

Anyway, you keep trying to take my argument that the distributions we
_do_ observe are incomplete, and turn it into an argument that we
don't observe enough distributions.

I think the two points deserve to be considered separately.

It may well be that grammatical incompleteness solves some problems
with syntactic idiosyncrasy and ambiguity, and gives us the
motivation, or the inspiration, to go on and figure out where to get
the extra information to solve sparseness.

Perhaps broadening our idea of context, and making a firmer
identification of sets of contexts with "meaning". That might mean
that a single "meaningful" use of a word implies a range of contexts
etc.

There are lots of possibilities. But first lets see where grammatical
incompleteness leads us. No-one has even considered it, while the
other subjects have been thrashed to death with little to show for it.

> > That you can do this all follows from the basic idea of
> > substitutability (if two constituents have the same contexts they are
> > the same.) Menno calls it Harris's Principle, or some such.
> >
> > With this "cross product" I'm reversing it and saying if two
> > constituents are the same, they will have the same contexts.
>

> ...

>
> > And before that, to get constituents which are the same, I am saying
> > they are the same because they have the same context (they are
> > combinations formed by matching lists of contexts.)
>
> Right, a kind of circular definition.

More iterative than circular. The contexts which specify similarity
are not the ones I expand out on the basis of similarity.

But sure, it is iterative estimation, and my errors will grow.
Question is whether the signal grows faster.

My point is just that no-one has bothered to look and see. Everyone
has assumed the signal reduces a grammar and that they don't need to
look.

> You really should look at EMILE, because much of its terms are described
> in this way. (The EMILE manual is probably your best bet in terms of
> clarity of description, and the comparison paper for ABL is certainly
> too short to do the considerable complexities justice.)
>
> In EMILE, "Characteristic types" describe roughly what you are talking
> about: all members of the associated characteristic expression (the
> paradigm) occur with all members of the associated characteristic
> context-set. (I think the ideas are similar though if you look carefully
> at how it is applied to sentences it breaks from your idea quite
> dramatically.) EMILE was "weakened" to allow for "mostly overlapping"
> sets ("primary" and "secondary" types, I believe) because of
> data-sparseness.

I looked at his thesis. But of course that was too big. It is a book.

My main objection was that on a scan I could not see any mention of
the complexity issue. And in that abstract, brief or not, they
specifically state that they reduce everything to a grammar.

If he contradicts my main premise in the abstract, it is hard to see
how a closer reading is going to change anything.

Perhaps some details of how he forms his classes might be interesting.
But such details are not the point I am trying to make.

> > All this has probably always been possible, and follows from Harris's
> > Principle. It is the complexity argument which makes it desirable.
> > Otherwise people would assume all this behaviour could be more
> > conveniently captured in classes.
>
> I don't think the classes are a matter of convenience or preference (at
> least not in my view) - they are a concession to leverage more
> generalisation. It's be much neater to work with complete sets.

I think you are finding excuses not to consider the possibility that
these sets don't generalize.

Not generalizing globally means you have more data to work with, not
less. At the moment you have to throw away all the incongruent stuff
because it does not fit.

How does ignoring "syntactic incongruity" help?

-Rob

David Brooks

unread,

Sep 27, 2007, 2:13:03 PM9/27/07

to grammatical-i...@googlegroups.com

Rob,

> More iterative than circular. The contexts which specify similarity
> are not the ones I expand out on the basis of similarity.

yeah, a bad choice of words on my part.

> I looked at his thesis. But of course that was too big. It is a book.

Right, and the EMILE manual is a handful of pages, that's why I
recommended it.

> Perhaps some details of how he forms his classes might be interesting.
> But such details are not the point I am trying to make.

A strictly context-sensitive class is an ad hoc generalisation, as far
as I can see. The same is true in ADIOS. I'm just saying that what you
are suggesting occurs elsewhere in the literature, and I thought this
might be encouraging.

> I think you are finding excuses not to consider the possibility that
> these sets don't generalize.

No, I have said on a few occasions that the generalisation is a
dangerous thing because it often leads to errors. I've also said that I
had reasons to assume it, but that it was against my better judgement. I
am willing to be convinced that it needn't be the case, however, if you
will present the evidence.

> Not generalizing globally means you have more data to work with, not
> less. At the moment you have to throw away all the incongruent stuff
> because it does not fit.

No, you can leave incongruence as an uncertainty to be resolved when
more information is found.

> How does ignoring "syntactic incongruity" help?

On this specific point I would say that when two syntactically
incongruent sequences occur in the same context we would be foolish to
consider only the context. But I doubt very much that you would
disagree, due to the iterative definition above. The natural thing to do
would be to look at the distribution of the observed terms themselves -
we might hope that "an Englishman" or "thin" occurs elsewhere, and
enough times that we might make a decision as to what sort of thing it is.

However, even to take the point a little further: what if I said "I am
zzrbtifins"? Since I just made it up (and it is clearly meaningless) you
could have little prior instance to judge it by. But you could still
make an educated guess: if most of the time the context surrounds an
adjectival phrase we might be able to uncertainly suggest that the new
word has these properties, but I don't see how your approach handles this.

Since corpora are full of very rare events, I am expressing a concern
that this situation is a real difficulty, hence I keep bringing you back
to data-sparseness. I don't think I'm hiding here, I think I'm asking
questions of your approach.

This is not because I think your approach is wrong, it's because I don't
understand it!

Rob Freeman

unread,

Sep 28, 2007, 3:07:04 AM9/28/07

to grammatical-i...@googlegroups.com

On 9/28/07, David Brooks <d.j.b...@cs.bham.ac.uk> wrote:
>
> A strictly context-sensitive class is an ad hoc generalisation, as far
> as I can see. The same is true in ADIOS. I'm just saying that what you
> are suggesting occurs elsewhere in the literature, and I thought this
> might be encouraging.

I guess it depends what you mean by "context-sensitive class". All of
these distributionally defined classes are "context sensitive" in the
sense that they are specified using (sets of) contexts. The question
is do they allow sets of contexts to be selected and recombined in the
ways I am suggesting are necessary.

In particular the description that they take a body of text, and
"abstract from it a collection of recurring patterns or rules..."
seems to contradict the complexity argument which is what
differentiates my approach from prior attempts at distributional
classification.

In all this grammatical induction work I see distributional analysis.
I just don't see the complexity argument behind grammatical
incompleteness.

> > Not generalizing globally means you have more data to work with, not
> > less. At the moment you have to throw away all the incongruent stuff
> > because it does not fit.
>
> No, you can leave incongruence as an uncertainty to be resolved when
> more information is found.

Then you must store examples verbatim until you do that.

> > How does ignoring "syntactic incongruity" help?
>
> On this specific point I would say that when two syntactically
> incongruent sequences occur in the same context we would be foolish to
> consider only the context. But I doubt very much that you would
> disagree, due to the iterative definition above. The natural thing to do
> would be to look at the distribution of the observed terms themselves -
> we might hope that "an Englishman" or "thin" occurs elsewhere, and
> enough times that we might make a decision as to what sort of thing it is.
>
> However, even to take the point a little further: what if I said "I am
> zzrbtifins"? Since I just made it up (and it is clearly meaningless) you
> could have little prior instance to judge it by. But you could still
> make an educated guess: if most of the time the context surrounds an
> adjectival phrase we might be able to uncertainly suggest that the new
> word has these properties, but I don't see how your approach handles this.

I don't see why you think I will have a problem with this. On the
contrary, it is handled naturally. Since I represent words using their
contexts, an utterance sharing a context with adjectives will
automatically be seen by the system as adjectival.

For instance the system I described earlier (AB in terms of *BA*)
would evaluate the novel utterance as similar to {"I am hot", "I am
cold", "I am thirsty", ...} This set (or contexts of these
combinations) would then form the putative constituent representation,
and go on to be combined with the rest of the sentence.

Of course if zzrbtifins occurs in other contexts the expansion *BA*
will be a bit richer. At the moment *B is just one context. If A* is
similarly poor you are a bit stuck. And with particular reference to
the syntactic incongruence argument above it might be difficult to
argue lexical ambiguity for zzrbtifins based on this one use :-) But
certainly a distributional system will squeeze all possible
information from the contexts it is given.

Note also that new uses like this are difficult to handle in a system
based on a-priori generalizations. In a-priori system by the time you
hear a new word you've already thrown away the actual contexts words
occurred in. That means it is difficult to know exactly how this new
word might have related to them. Perhaps a word previously assigned to
some other class had a use parallel to the new one, but that
"incongruous" usage was thrown away as not statistically significant.
You have no way of knowing.

Ad-hoc generalization means novel words, and novel uses, are handled
seamlessly, and incrementally.

In general, data-sparseness is a problem for us all (anyone working
from corpora), but I'm not at all convinced it is a greater problem
for ad-hoc generalization. Ad-hoc generalization means you don't have
to say anything about a word until you feel you can. You can wait
until you have more evidence before you say anything about zzrbtifins
at all. The lack of any need to "learn" anything about it beyond the
context we already know means you can just let it be, and let the rest
of the sentence break up around it, if you like.

Truth is, I had problems with data-sparseness with my earlier
implementation when I tried to work with lists of similar words (e.g.
the example on my website.) That was because I tried to state
definitively that words were similar to each other before I performed
the selection and recombination process I use to find a parse. Working
directly with contexts I don't need to make definitive statements of
similarity before I start processing. It may be that data-sparseness
is much easier to handle that way.

Data-sparseness is a problem when you try to make statements early, on
partial evidence, e.g. when you try to learn an a-priori grammar
without exposure to all possible sentences. Ad-hoc generalization lets
you postpone any decisions until they are needed. It may be that any
context which makes a distinction necessary, always provides the
information needed to make it. Sparseness may cease to be a problem if
we cease looking for a-priori grammars. You may always be able to let
information from the rest of the sentence tell you what you need.