The tricky part of clustering is that the dataset
contains tens-of-thousands of "disjuncts" (in
the above example, the "Jp- D*u- Mp+" is one
disjunct; the above example uses three disjuncts)
For the above example, the other 9997 disjuncts
have a value of infinity (since -log 0 = infinity, and
the probability of observing the other disjuncts was
zero).
So the clustering problem is to find clusters in a
ten-thousand-dimensional space, where each
disjunct is an axis/dimension, and 9990 of the
coordinate values are infinite.
well, the above is a very crude clustering algo, and is not
ideal in a large number of ways. It did, however,
demonstrate to me that the concept is sound.
> In this case each word represents a point along the axis. The thing to note
> is two different words which have different disjuncts lie on different axis
> and can never be clustered into a common group since each axis is othogonal
> to each other.
Yes. I thought that this was an important feature!
> For eg:
>
> (gave, "Ss- Os+ MVp+") and (gave, "Ss- Os+ MVp+ MVp+") cannot be in a common
> cluster whereas I think it is valid to have both of them in the same
> cluster.
Why?
1) a-priori, there is no reason to beleive that these should
be in the same cluster, I don't see why you would want to
lump them together.
2) a posteriori, *after* looking at actual results, one could
descide that, for the case of repeated MV+ links, that
these could be collapsed together into a "one or more
MV+ link" class.
The point is that one must be careful with a-priori assumptions, as I
think there are counterexamples,
involving transitive verbs, direct and indirect objects,
and the like.
> Also (gave, "Ss- Os+ MVp+") and (gave, "Ss- Op+ MVp+") cannot be in a common
> cluster.
Same comments as above. You can say "I sang songs"
but not "I sang her". But "I sang her songs" is OK. I don't
want to start making a-prior assumptions about what's
grammatical, and what's not: I want to discover this in the
data.
Once a pattern is discovered, the parse rules themselves
can be modified to either strengthen it, or to weaken it.
I not only want to learn clusters, I want to learn how to modify
the grammar rules themselves i.e. to modify what sorts
of linkages a word can take.
> Also how do we know that log_cond_prob condition is a good constraint. Many
> times a word can have many senses and may have many disjuncts possible. In
> such a case, we cannot discard those words and log_cond_prob will not be a
> good constraint to decide in such cases.
I don't understand what you are trying to say here.
> The other way of viewing is that each word represents a point with disjuncts
> as its co-ordinates. But this will be completely wrong since a word may
> represent many senses and clustering words of different senses with another
> word does not make sense.
I don't get it. You say "the other way" but this is exactly
what cluster.scm implemented. So I we're somehow
talking past each other; I don't understand whtt this
"other way" is, nr why nits wrong. (What's wrong with
clustering different senses?) Again, I son't want to
assume that anything is "wrong" -- if clustering lumps
things together, then so be it -- and the scientific question
becomes "why did these things get lumped together?"
> 2. Link as an axis
>
> Each tuple is represented by a vector with its links as its co-ordinates.
> Tuple represents a point and the clustered group represents the region of
> points in the space behaving syntactically similar.
>
> So each point is a vector which looks like this
>
> (Ss-, Ss+, Sp+, Sp-, Os-, Os+, Op-, Op+, MVp+, MVp-, Ox+, Ox-, ..... blah,
> blah)
I don't quite get it: given a disjunct, say, for example,
(Ss- & Os+), then the actual vector has the value
(1,0,0,0,0,1,0,0,0,0,0,0,0,0,... blah)
right? so the coordinate is 1 if the link is present, and 0
if the link is absent, right? Or are you somehow
assigning coordinates in some other way?
> or each link can be further divided like this
>
> (S, s, p, -, + , O, s, p, -, +, MV, s, p, -, +, .... .... ..... )
This too needs an example, as i don't understand what
you are trying to do.
> The feature scores can be the log conditional probabilities or any other
> which we need to discuss.
I don't know what you mean by "feature score".
> We can also use methods used in previous works like [1]
I'll read that shortly.
> The difference is that they have (child_word, parent_word,
> link_name) tuples.
I beleive I have a table of those as well. No matter, even if
not, it can be generated readily enough.
> But all the above works are based on the assumption that words behaving
> syntactically similar are semantically similar. But in our case this may not
> be true.
well, I know explicitly that its not true. The collection of
transitive verbs is large, and behave in syntacticaly similar fashion,
yet are semantically diverse. I am hoping that
clustering will reveal a more fine-grained structure to the
class of all transitive verbs. Just how "semantic" this
structure will be, I don't know.
> We are trying to cluster words based on seeing only its disjuncts
> and nothing more than that. (like using the child_word, parent_word
> inforamtions). What do you think ?.
I have a wide variety of statistical information. I can
generate even more types of statistics fairly easily,
although it takes a lot of cpu time (weeks, typically)
> What does link clusters represent ?
I don't know; the goal is to find out.
--linas
2009/6/16 Siva Reddy <gvs....@gmail.com>:
> Hi Linas,
>
>>
> 1. Having disjunct as an axis
>
>
> One way is direct case using the constraints as implemented in
> /opencog/nlp/wsd-post/cluster.scm.
well, the above is a very crude clustering algo, and is not
ideal in a large number of ways. It did, however,
demonstrate to me that the concept is sound.
> In this case each word represents a point along the axis.
The thing to note
> is two different words which have different disjuncts lie on different axis
> and can never be clustered into a common group since each axis is othogonal
> to each other.
Yes. I thought that this was an important feature!
Why?
> For eg:
>
> (gave, "Ss- Os+ MVp+") and (gave, "Ss- Os+ MVp+ MVp+") cannot be in a common
> cluster whereas I think it is valid to have both of them in the same
> cluster.
1) a-priori, there is no reason to beleive that these should
be in the same cluster, I don't see why you would want to
lump them together.
2) a posteriori, *after* looking at actual results, one could
descide that, for the case of repeated MV+ links, that
these could be collapsed together into a "one or more
MV+ link" class.
The point is that one must be careful with a-priori assumptions, as I
think there are counterexamples,
involving transitive verbs, direct and indirect objects,
and the like.
Same comments as above. You can say "I sang songs"
> Also (gave, "Ss- Os+ MVp+") and (gave, "Ss- Op+ MVp+") cannot be in a common
> cluster.
but not "I sang her". But "I sang her songs" is OK. I don't
want to start making a-prior assumptions about what's
grammatical, and what's not: I want to discover this in the
data.
Once a pattern is discovered, the parse rules themselves
can be modified to either strengthen it, or to weaken it.
I not only want to learn clusters, I want to learn how to modify
the grammar rules themselves i.e. to modify what sorts
of linkages a word can take.
I don't understand what you are trying to say here.
> Also how do we know that log_cond_prob condition is a good constraint. Many
> times a word can have many senses and may have many disjuncts possible. In
> such a case, we cannot discard those words and log_cond_prob will not be a
> good constraint to decide in such cases.
> The other way of viewing is that each word represents a point with disjuncts
> as its co-ordinates. But this will be completely wrong since a word may
> represent many senses and clustering words of different senses with another
> word does not make sense.
I don't get it. You say "the other way" but this is exactly
what cluster.scm implemented.
I don't quite get it: given a disjunct, say, for example,
> 2. Link as an axis
>
> Each tuple is represented by a vector with its links as its co-ordinates.
> Tuple represents a point and the clustered group represents the region of
> points in the space behaving syntactically similar.
>
> So each point is a vector which looks like this
>
> (Ss-, Ss+, Sp+, Sp-, Os-, Os+, Op-, Op+, MVp+, MVp-, Ox+, Ox-, ..... blah,
> blah)
(Ss- & Os+), then the actual vector has the value
(1,0,0,0,0,1,0,0,0,0,0,0,0,0,... blah)
right? so the coordinate is 1 if the link is present, and 0
if the link is absent, right?
Or are you somehow
assigning coordinates in some other way?
This too needs an example, as i don't understand what
> or each link can be further divided like this
>
> (S, s, p, -, + , O, s, p, -, +, MV, s, p, -, +, .... .... ..... )
you are trying to do.
I'll read that shortly.
I beleive I have a table of those as well. No matter, even if
> The difference is that they have (child_word, parent_word,
> link_name) tuples.
not, it can be generated readily enough.
--linas
Yes. I like to think of it as a very high-dimensional
vector space; the number of dimensions is equal
to the number of possible disjuncts, which is 5K or 10K
or so. Rather than just thinking in terms of a "point"
on an axis, think instead "vector space" -- the vector
has a projection down to each axis, the projection
is just some number e.g. the mutual info, or entropy or whatever.
> Again, two tuples lying on different axes cannot be clustered into one
> group.
? If you has points in a vector space, then a natural
similarity measure is the euclidean distance between
the points.
When in doubt, envision everything as being
two-dimensional.
> If this is done, we
> can learn unseen disjuncts for some words.
Yes, exactly!
>> > (gave, "Ss- Os+ MVp+") and (gave, "Ss- Os+ MVp+ MVp+") cannot be in a
>> > common
>> > cluster whereas I think it is valid to have both of them in the same
>> > cluster.
>>
>> Why?
>> 1) a-priori, there is no reason to beleive that these should
>> be in the same cluster, I don't see why you would want to
>> lump them together.
>>
>> 2) a posteriori, *after* looking at actual results, one could
>> descide that, for the case of repeated MV+ links, that
>> these could be collapsed together into a "one or more
>> MV+ link" class.
>>
>> The point is that one must be careful with a-priori assumptions, as I
>> think there are counterexamples,
>> involving transitive verbs, direct and indirect objects,
>> and the like.
>>
>
> Ahh. This makes clustering more complex.
I don't think so: I was trying to assert something simpler:
when looking for patterns in data, it is faulty to assume
that some particular pattern must surely be there, and
then build that assumption into your methodology.
Don't assume. Let the data show you what is true.
> So what you are trying to say is if
> a word 'w' does not appear with a disjunct 'Dj' in our collected statistics,
> then the word 'w' cannot have 'Dj' as its disjunct.
No. If it does not appear, then maybe our sample set
was too small, or was skewed. It might appear (with
low probability) if the data set was larger.
>> > Also (gave, "Ss- Os+ MVp+") and (gave, "Ss- Op+ MVp+") cannot be in a
>> > common
>> > cluster.
>>
>> Same comments as above. You can say "I sang songs"
>> but not "I sang her". But "I sang her songs" is OK. I don't
>> want to start making a-prior assumptions about what's
>> grammatical, and what's not: I want to discover this in the
>> data.
>
> Then what is the use of a cluster. We are not learning anything new from our
> clusters.
? I didn't really understand what you meant by the above,
or what followed.
> If a word has many disjuncts possible, then p(D/w) is distributed over many
> disjuncts such that summation_over_D p(D/w) = 1 making the scores of each
> disjunct low.
Yes.
> Since our source of decision comes from the scores only, we
> may miss out these kind of words.
I don't know what you mean by "miss out". It is possible
that these might not be assigned to the same cluster,
when they should have been. So if we have a word
"asdf" which apears with many many disjuncts, and
another word "pqrs" which appears with many, but
different, disjuncts, then each may be assigned to its own,
unique cluster.
At some point, we print out the list of clusters, and look
at them. We''lk see "asdf" in one, and "pqrs" in another,
and we might say "gee, shouldn't these have been in
the same cluster? Why weren't they? Were our similarity
measures incorrect? Our cluster diameters too narrow?
our dataset too small? Or perhaps these words are
more different than they seem?" I don't know what the
answers to any of these questions are: and more, I don't
even want to assume or guess what the answers might
be: I'd rather run the experiment, look at the data, and
find out!
> Currently, I developed a method of clustering. Once I finish implementing
> it, I will explain it in detail.
Well, rather than implementing it, I'd like you to explain
what it is ... what is the space? How are coordinates assigned? what
is the similarity measure?
Give a simple example, with 2 or 3 or 4 data points, and
show how two are close by, and thus clustered, and how
the 3rd is not.
--linas
Well, you can certainly draw the tree you drew above,
but I don't understand what it means or what it's good
for.
> Also, do you have any distance functions defined between two links.
I don't understand what you mean. Given the word on the
left and the word on the right, I can tell you how often a link occurs.
Or do you mean "the length of a link" -- i.e. how many
words appear in between the two connected words?
If you are asking "how similar is Sp to Os" then I don't
know .. and I can't guess what that means or what its
good for.
Given a long list of disjuncts, I guess you could ask
questions like "what happens if I replace all occurances
of Sp+ by Os- ?" but I have some difficulty guessing why
this would be an interesting question.
I guess you could say that a "G" link is kind-of-like an
"GN" link, in a hand-waving kind of way. Perhaps,
due to a broken dictionary, we are incorrectly using
a G link when a GN should have been used. Then it
might make sense to ask "what happens if we lumped
these two together into one".
However, I'd rather see this question formulated as
a hypothesis against the data: if the hypothesis is true,
then there should be some pattern in the data. Then we
look at the data: Is the pattern there, or not?
(Certainly, the dictionary is "broken" in many many
ways, with all sorts of incorrect links being used.
How to find and correct these in a semi-automated
fashion is an open question.)
--linas
I agree.
But what is more important, some disjuncts may occur
far more frequently than others. So for example:
word1 occurs with A+ & B- with p=0.9
word1 occurs with C+ & D- with p=0.1
word2 occurs with A+ & B- with p=0.8
word2 occurs with C+ & D- with p=0.1
word2 occurs with A+ & D- with p=0.1
It is probably reasonable to assign
word 1 and 2 to the same cluster. Note that
word1 has an implicit coordinate at zero:
word1 occurs with A+ & D- with p=0.0
which is not all that far away from word2.
If we use Euclidean distance, these are
sqrt ((0.9-0.8)^2 + (0.1-0.1)^2 + (0.1)^2) = 0.1414
Compare now word3:
word3 occurs with A+ & B- with p=0.15
word3 occurs with C+ & D- with p=0.8
word3 occurs with A+ & D- with p=0.05
It is probably incorrect to find word3 in the
same cluster as words 1 & 2.
The distance is greater:
the Euclidean distance between word3 and
the center of the 1-2 cluster is
sqrt((0.85-0.15)^2 + (0.8-0.1)^2 + 0^2) = 0.99
Instead of using p, it is interesting to think about
using -ln p as the coordinate value. Instead of using
Euclidean distance, other distances (or divergences)
might be reasonable.
--linas
Cluto is not, it is proprietary. see e.g.
http://glaros.dtc.umn.edu/gkhome/node/332
I think that rules it out for general use.
Weka might be the way to go.
There is a long list of open-source clustering s/w at
http://mloss.org/software/tags/clustering/
I have not read through this list myself.
> Is it
> possible for a word to go to multiple clusters.
Not usually, but that depends on the clustering algorithm.
Most clustering algorithms try to assign any given item to
only one cluster.
> Clustering Algorithms:
> 1. Direct (already implemented by you)
Badly...
> 2. K-means clustering.
I am concerned that the K-means algo is very slow, and will be overwhelmed.
> 3. agglomerative (bottom up clustering)
> 4. repeated bisections (top down clustering)
> 5. EM Algorithm (I will do this only after completing above algorithms)
OK.
> I am still afraid of clustering whenever I think of word senses.
The initial attempt should be to cluster words by grammatical
properties, and not word-senses. The word-senses problem
is considerably harder.
> I feel it
> would be better to cluster word senses rather than the words.
Again, this is harder. For example, pick some common verb
that can be used in a transitive and intransitive way. For example:
give -- see http://www.thefreedictionary.com/give
Notice it has sections for v.tr and v.intr and the definitions in each
section are distinct.
Now, link-grammar does not by itself distinguish between these
senses: it just has one entry for "give", and provides parsing
rules that allow it to be used in a transitive or intransitive fashion.
Likewise, the statistical tables that I have list the frequency of
how often this word was used in different syntactical constructions.
How should we "split apart" these frequency counts, so as to
assign them to distinct senses?
One might try to do this a-priori, e.g. tag them with wordnet sense
using some other, entirely different algorithm (e.g. one of the
usual WSD algorithms).
Alternately, one might try to look at how some synonymous words
were used - e.g. synonyms for "to make a present of, to deliver,
to hand over" these synonyms will account for *some* of the
observed syntatic usage of "give", but not all. What is left over
is presumably senses like "to collapse from force or pressure".
However, doing this kind of data-mining is likely to be quite
difficult, since I suspect the dataset is quite noisy, and it wiil
be hard to pick these apart. This is why its important to get
the simpler cases well-understood and working.
> How are we going
> to evaluate the clusters formed.
At this point, "seat of the pants".
> One naive way of evaluation would be using
> the already existing dict files in link parser.
Well, my intent was to turn this process on its head: I'm hoping
to find that the results of clustering will be *more* accurate than
the lg dicts, and I wanted to use those to make corrections
and fixes to LG.
> Also how are we going to use the clusters ? One of the use is that since we
> have same set of disjuncts for all the words in cluster it will be possible
> to learn unseen disjuncts for a word. But we need to be careful since we do
> not know if this assumption is correct.
Yes, we'd have to feel our way around and see what happens.
--linas
Cluto is not, it is proprietary. see e.g.
http://glaros.dtc.umn.edu/gkhome/node/332
I think that rules it out for general use.
Weka might be the way to go.
There is a long list of open-source clustering s/w at
http://mloss.org/software/tags/clustering/
I have not read through this list myself.
Not usually, but that depends on the clustering algorithm.
> Is it
> possible for a word to go to multiple clusters.
Most clustering algorithms try to assign any given item to
only one cluster.
Badly...
> Clustering Algorithms:
> 1. Direct (already implemented by you)
> 2. K-means clustering.
I am concerned that the K-means algo is very slow, and will be overwhelmed.
> 3. agglomerative (bottom up clustering)
> 4. repeated bisections (top down clustering)
> 5. EM Algorithm (I will do this only after completing above algorithms)
Hi Linas,
From the dump you gave me, I got 379442 distinct disjuncts. All these cannot be valid disjuncts. Some of the disjuncts are very long. Since we represent word as a vector of disjuncts, we need to choose certain disjuncts among all these disjuncts.
I am using the following criteria for choosing a disjunct to be a dimension. I sorted disjuncts based on their word frequency i.e with how many distinct words did the disjunct occur. Shall I select top 10000 disjuncts in the sorted list to be the dimensions. Is this ok ? I am attaching the disjunct frequency file. First and second columns represent (disjunct, word_frequency) respectively.
I generated one more frequency list by generalizing distincts by removing additional attributes of each link. For eg: link Ss- is changed to S-. I am also attaching this file. The number of distinct disjuncts are 67982 which are far less compared to the previous list. I think it would be better if we select a disjunct based on its frequency. Please see the list and can you comment anything on this. Clustering using these disjuncts as vector co-ordinates help us in designing a heirachial clustering algorithm. Also the dimensional space would be small.
Next steps:
---------------
1. Generating word as vectors
2. Applying clustering (Is there any possibility on guessing the number of clusters to be formed.)
I am using python. I generated the attached files using it. Database connectivity in python is very easy. Also it has all the functions present in perl and I am very convinient in using python.
Thanks,
Siva
> The main problem with many of the existing clustering algorithms is the we
> should know the number of clusters apriori. Is there any way to know this ??
>
> I found that if we know the number of clusters to be formed, K-means
> performance is better than other clustering algorithms.
I can provide an approximate answer:
link-grammar currently has about 71 "large" clusters
and another 1500 or so "small" clusters, where
"large" means 30 or 50 or more members, and
"small" is in the 1-50 range. I get these numbers
by counting the number of semi-colons in
data/en/4.0.dict, and for the "large" clusters, the
number of files in the "data/en/words" directory.
So, for initial runs, try K=1500 or K=2000 and see
what happens.
My long-term goal is to refine this into even more, smaller,
tighter clusters.
--linas
2009/7/5 Siva Reddy <gvs....@gmail.com>:
> From the dump you gave me, I got 379442 distinct disjuncts. All these cannot
> be valid disjuncts.
? Well, they were all generated by the parser, and so they
are "valid" in that sense. The parse itself may have been
invalid, but that's a different issue.
> Some of the disjuncts are very long. Since we represent
> word as a vector of disjuncts, we need to choose certain disjuncts among all
> these disjuncts.
Why? What's wrong with using all of them?
> I am using the following criteria for choosing a disjunct to be a dimension.
> I sorted disjuncts based on their word frequency i.e with how many distinct
> words did the disjunct occur.
There's an alternate way of counting, which is different,
and maybe more correct, and that is to count how often
a disjunct occurred in real text, instead of how often it was
recorded in the file. i.e. instead of adding +1 for each line
in the file, add +N where N is the count value in the table.
N is the actual number of times that disjunct-word-pair was
seen in real text.
> Shall I select top 10000 disjuncts in the
> sorted list to be the dimensions. Is this ok ?
Sure, but realize that this cutoff is purely arbitrary.
I'm attaching a graph which shows that the frequency
of disjuncts is very nearly scale-free -- I plotted your
raw data against Zipf's law on a log-log graph. Its
well-known that scale-free data simply doesn't have
any natural cutoff. (Scale-free data obeying Zipf's law
is seen everywhere, from sand-piles and shoreline
coast lengths and critical exponents in physics,
to the sizes of cities, the number of hits on web sites,
and the number of citations an academic paper gets).
> I generated one more frequency list by generalizing distincts by removing
> additional attributes of each link. For eg: link Ss- is changed to S-.
I don't like this. Its loosing important data.
A better idea would be to collapse identical links,
but only if they occur next to each-other. So, for
example:
MX- Xd- G- MX+ MX+ MX+ Xca+
can be converted into
MX- Xd- G- @MX+ Xca+
so that @MX+ means "zero or more occurrences of MX+"
However, even here, care should be taken: the above
can only be safely done if the actual data contains
MX- Xd- G- MX+ MX+ MX+ Xca+
MX- Xd- G- MX+ MX+ Xca+
MX- Xd- G- MX+ Xca+
MX- Xd- G- Xca+
which justifies collapsing this into one.
Doing this would be nice, but is not strictly necessary
for the first round of data analysis.
> select a disjunct based on its frequency.
Yes, reducing the number of disjuncts considered will
make it easier to work with the data. As the scale-free
graph shows, its reasonable to put the cutoff just about anywhere,
although I expect better accuracy by using
more data.
--linas
2009/7/5 Siva Reddy <gvs....@gmail.com>:
>
> I sorted disjuncts based on their word frequency
The most frequent one was blank. What does that mean?
--linas
Yes, that's exactly it. Thanks.
You should leave these out during clustering, they'll
mess up the data.
--linas
p.s. I blogged the frequency counts at
http://opencog.wordpress.com/2009/07/06/frequency-of-grammatical-disjuncts/
Actually, the above link is slightly out of date,
there have been some minor modifications and
revisions. The current copy is located at:
http://www.abisource.com/projects/link-grammar/dict/index.html
--linas
2009/7/6 Siva Reddy <gvs....@gmail.com>:
> Attached files are the collected statistics of disjuncts. Please have a
> view. Column 1 repesents the disjunct, Column 2 represents the number of
> distinct words which occured with this distinct, Column 3 represents the
> total count of this disjunct with all the words.
The more I look at this, the more I dislike column 2.
Its clear that G+ and AN+ are vastly over-represented.
This is in part because the corpus is wikipedia, and
every new article introduces a new given name of
one sort or another, frequently a foreign word at that.
A large portion of wikipedia articles are biographies,
and this shows in the frequency of the G and AN links.
(that is, the databse has a large number of unique,
distinct names in it, and these names connect with G
and AN links. Because there are a lot of names, they
contribute to your counts. By contrast, words like "the"
"a" "of" "in" "there", etc. are strongly under-represented
in your counts -- but would show up in the column-3 counts.)
By contrast, compare the counts of MVp and MVa links;
these connect verbs to various things -- you can see how
much more popular they are when you consider the
floating-point frequency in column 3, which is what should
be used for ranking.
> Type0: Statistics are collected without modifying disjuncts
>
> Type1: Statistics are collected by removing additional attributes of the
> disjunct links. For eg: Sp- Os+ MVp+ is changed to S- O+ MV+
I really dislike this, this looses important information.
To trim down the size of the dataset, it would be better
to eliminate everything which contains a capital letter,
number or punctuation. A reasonable first start would
be accept words that consist entirely of ASCII lower-case,
or dot, or dash (dot and dash are used in word subscripts)
That should drastically reduce the number of words.
Another simplification is to ignore all disjuncts that
contain ID*** connectors. There are thousands of these,
they are used only for idiomatic phrases, and so by
definition will always sit in a cluster by themselves.
-- linas
? Well, they were all generated by the parser, and so they
> From the dump you gave me, I got 379442 distinct disjuncts. All these cannot
> be valid disjuncts.
are "valid" in that sense. The parse itself may have been
invalid,
but that's a different issue.
Why? What's wrong with using all of them?
> Some of the disjuncts are very long. Since we represent
> word as a vector of disjuncts, we need to choose certain disjuncts among all
> these disjuncts.
There's an alternate way of counting, which is different,
> I am using the following criteria for choosing a disjunct to be a dimension.
> I sorted disjuncts based on their word frequency i.e with how many distinct
> words did the disjunct occur.
and maybe more correct, and that is to count how often
a disjunct occurred in real text, instead of how often it was
recorded in the file. i.e. instead of adding +1 for each line
in the file, add +N where N is the count value in the table.
N is the actual number of times that disjunct-word-pair was
seen in real text.
> Shall I select top 10000 disjuncts in theSure, but realize that this cutoff is purely arbitrary.
> sorted list to be the dimensions. Is this ok ?
I'm attaching a graph which shows that the frequency
of disjuncts is very nearly scale-free -- I plotted your
raw data against Zipf's law on a log-log graph. Its
well-known that scale-free data simply doesn't have
any natural cutoff. (Scale-free data obeying Zipf's law
is seen everywhere, from sand-piles and shoreline
coast lengths and critical exponents in physics,
to the sizes of cities, the number of hits on web sites,
and the number of citations an academic paper gets).
I don't like this. Its loosing important data.
> I generated one more frequency list by generalizing distincts by removing
> additional attributes of each link. For eg: link Ss- is changed to S-.
A better idea would be to collapse identical links,
but only if they occur next to each-other. So, for
example:
MX- Xd- G- MX+ MX+ MX+ Xca+
can be converted into
MX- Xd- G- @MX+ Xca+
so that @MX+ means "zero or more occurrences of MX+"
However, even here, care should be taken: the above
can only be safely done if the actual data contains
MX- Xd- G- MX+ MX+ MX+ Xca+
MX- Xd- G- MX+ MX+ Xca+
MX- Xd- G- MX+ Xca+
MX- Xd- G- Xca+
which justifies collapsing this into one.
Doing this would be nice, but is not strictly necessary
for the first round of data analysis.
2009/7/6 Siva Reddy <gvs....@gmail.com>:
count(disjunct_with_this_word)/actual_count(disjunct_with_this_word)
>
> where actual_count is the number of times the disjunct occurred with this
> word (This is obtained by adding +1 everytime rather than adding parse score
> every time)
>
> If it is less than a threshold, we can eliminate that entry but we don't
> have the actual counts in the current tables.
The parse-ranker itself is not perfect: it may well be giving
a low score to correct parses, and a high score to
incorrect ones. Its dangerous to second-guess this stuff
too much.
>> A better idea would be to collapse identical links,
>> but only if they occur next to each-other. So, for
>> example:
>>
>> MX- Xd- G- MX+ MX+ MX+ Xca+
>>
>> can be converted into
>>
>> MX- Xd- G- @MX+ Xca+
>>
>> so that @MX+ means "zero or more occurrences of MX+"
>>
>> However, even here, care should be taken: the above
>> can only be safely done if the actual data contains
>>
>> MX- Xd- G- MX+ MX+ MX+ Xca+
>> MX- Xd- G- MX+ MX+ Xca+
>> MX- Xd- G- MX+ Xca+
>> MX- Xd- G- Xca+
>>
>> which justifies collapsing this into one.
>>
>> Doing this would be nice, but is not strictly necessary
>> for the first round of data analysis.
>
> Why can't we write it as @MX- @Xd- @G- @Xca+ . I think even this is valid.
No, because such a rule allows grammatically incorrect sentences: X
connects words to punctuation, so the
above would allow a sentence ,,,,,,,,, with lots of ;;;;;
bogus punctuation.
By contrast, @MX and @A occur frewquently in the dict.
--linas
By contrast, compare the counts of MVp and MVa links;
these connect verbs to various things -- you can see how
much more popular they are when you consider the
floating-point frequency in column 3, which is what should
be used for ranking.
To trim down the size of the dataset, it would be better
to eliminate everything which contains a capital letter,
number or punctuation. A reasonable first start would
be accept words that consist entirely of ASCII lower-case,
or dot, or dash (dot and dash are used in word subscripts)
That should drastically reduce the number of words.
Another simplification is to ignore all disjuncts that
contain ID*** connectors. There are thousands of these,
they are used only for idiomatic phrases, and so by
definition will always sit in a cluster by themselves.
>> so that @MX+ means "zero or more occurrences of MX+"
>
> I rather took it as one or more instances.
Yes, right.
> S-
> S- O+
> S- O+ O+
>
> If we generalize the above disjuncts to S- @O+, then S- is also covered by
> this disjunct. Intransitive verbs have S- as their disjunct and Transitive
> verbs have S- @O+ as thier disjunct. I think combing Transitive and
> Intrasitive verbs may not be a good idea. Several such cases arise if we use
> @ for 0 or more.
Yes, it should be for "one or more". Also, we should
probably use it only if three or more links were actually
observed. This is because I don't believe that there are
real-life @O+ linkages. (there can't ever be three objects)
> I am attaching the updated disjuncts files. In the generalized disjuncts
> file, @A represents { A, A A , A A A, ... N times A} where maximum value of
> N is atleast 2. Is this ok or shall I change it to three.
It should probably be 3, per above.
--linas
> The main problem with many of the existing clustering algorithms is the weI can provide an approximate answer:
> should know the number of clusters apriori. Is there any way to know this ??
>
> I found that if we know the number of clusters to be formed, K-means
> performance is better than other clustering algorithms.
link-grammar currently has about 71 "large" clusters
and another 1500 or so "small" clusters, where
"large" means 30 or 50 or more members, and
"small" is in the 1-50 range. I get these numbers
by counting the number of semi-colons in
data/en/4.0.dict, and for the "large" clusters, the
number of files in the "data/en/words" directory.
So, for initial runs, try K=1500 or K=2000 and see
what happens.
My long-term goal is to refine this into even more, smaller,
tighter clusters.
> I am attaching the updated disjuncts files.
Much better, thank you!
Now the top entries make sense:
Ds+ 19 950275.635843
Xp- 1 838569.90527
A+ 6040 616522.664867
AN+ 7904 566658.997313
MVp- Js+ 63 563082.649325
MVp- Jp+ 63 446487.310222
Ds+ connects the determiner "the" to nouns: and of course,
"the" is the most frequent word in the English language.
Xp- connects the period at the end of the sentence to the
start of the sentence, so of course its frequently observed.
A+ connects adjectives to nouns, AN+ connects
noun modifiers.
MV connects verbs to modifying phrases, and J
connects prepositions to objects, so that MV- J+ is
the disjunct that most prepositions will get.
The rank-ordering of these disjuncts is much less
of a surprise.
Attached is a graph. As can be seen its got a much
sharper falloff, and its even less of a straight line ...
Its less Zipfian, less scale-free. I can't explain these
details.
--linas
2009/7/7 Siva Reddy <gvs....@gmail.com>:
The rank-ordering of these disjuncts is much less
of a surprise.
Attached is a graph. As can be seen its got a much
sharper falloff, and its even less of a straight line ...
Its less Zipfian, less scale-free. I can't explain these
details.
Hmm. I doubt it; but just in case .. can you send me data
that does include all words, but skips the idiom (ID*) links?
--linas
It should probably be 3, per above.
> I am attaching the updated disjuncts files. In the generalized disjuncts
> file, @A represents { A, A A , A A A, ... N times A} where maximum value of
> N is atleast 2. Is this ok or shall I change it to three.
> This may be because of our decision to cut words other than [a-z.-]* . TheHmm. I doubt it; but just in case .. can you send me data
> rank positions in the new file may not be their actual positions. But I
> don't know. Just a guess.
that does include all words, but skips the idiom (ID*) links?
> Now coming to feature selection part, I am going to use the log cond
> probability of disjuncts as features (dimensions).
Just keep in mind that log_cond_prob = infinity
for any disjuncts that are missing. i.e when p=0 then
-log p = infty and so if you use this as the coordinate,
you have to set the value of missing diusjuncts to either
infy or some large number (e.g. 1000) and NOT to zero!
It might be better to use the entropy S, which is defined
as
S = -p log p
where log_cond_prob = -log p
This is because entropy plays a special role in information
theory; it maximizes the uncertainty of unknown relations.
I should probably take the time to make a stronger
arguments as to why entropy would be best for clustering.
> For the generalized disjuncts we generated, log_cond_prob_scores need to be
> computed. To compute them, shall I use only the counts of our refined
> disjunct lists
Yes. Add the probabilities together.
Since l.c.p. = -log p, first undo the log, then add,
then redo the log.
>or shall I include the ones which are excluded too (other
> than [a-z.-]* )
Not needed, we can leave the total denormalized
without any harm.
> Coming to the number of disjuncts(features) that need to be chosen, when we
> observe the disjunct frequency files, we observe that more than 3/4 of the
> disjuncts in the list have very low scores. I think these disjuncts are very
> word specific.
Well, yes, that's the point!
> Including them or excluding them will not affect large sets
> of words.
The large sets are arguably the most boring sets.
But I guess we'll find out.
> For the time being I will use 4000, 6000, 8000 and 10000 disjuncts
> as different feature sets and perfom clustering. And then we can analyze the
> effect of number of disjuncts chosen on the clustering.
OK.
> Shall I use part of speech category as a dimension too.
No.
The correct way of thinking of a disjunct is as a kind
of highly-detailed part-of-speech. Thus,
MVp- Js+ and MVp- Jp+ both identify a word as a
preposition, .. and more: they distinguish prepositions
that take singular vs. plural objects.
In short, disjuncts already encode part-of-speech
information.
> Many clustering
> tools allow strings to be features (apart from numerical values).
Err ? yes, the disjuncts are the features. Not sure what
you mean here. Each feature is assigned a numerical
value, and that numerical value is the location of the
item on the axis defined by that feature ...
My gut instinct is that we should probably use the
entropy as the coordinate, I'll try to think of a more
rational argument.
> For stop
> words, I will have a new tag.
?? Why are stop words important to think about ??
> For unknown words, I will use unknown tag
?? what's an unknown word?
--linas
Much better. The first file made no difference at all, that I could
make out. The second flattens out the line very nicely! I like it!
Now, If I could just understand the shape ...
--linas
> Now coming to feature selection part, I am going to use the log condJust keep in mind that log_cond_prob = infinity
> probability of disjuncts as features (dimensions).
for any disjuncts that are missing. i.e when p=0 then
-log p = infty and so if you use this as the coordinate,
you have to set the value of missing diusjuncts to either
infy or some large number (e.g. 1000) and NOT to zero!
It might be better to use the entropy S, which is defined
as
S = -p log p
where log_cond_prob = -log p
This is because entropy plays a special role in information
theory; it maximizes the uncertainty of unknown relations.
I should probably take the time to make a stronger
arguments as to why entropy would be best for clustering.
Yes. Add the probabilities together.
> For the generalized disjuncts we generated, log_cond_prob_scores need to be
> computed. To compute them, shall I use only the counts of our refined
> disjunct lists
Since l.c.p. = -log p, first undo the log, then add,
then redo the log.
Not needed, we can leave the total denormalized
>or shall I include the ones which are excluded too (other
> than [a-z.-]* )
without any harm.
The large sets are arguably the most boring sets.
> Including them or excluding them will not affect large sets
> of words.
But I guess we'll find out.
> Shall I use part of speech category as a dimension too.
No.
The correct way of thinking of a disjunct is as a kind
of highly-detailed part-of-speech. Thus,
MVp- Js+ and MVp- Jp+ both identify a word as a
preposition, .. and more: they distinguish prepositions
that take singular vs. plural objects.
In short, disjuncts already encode part-of-speech
information.
Err ? yes, the disjuncts are the features. Not sure what
> Many clustering
> tools allow strings to be features (apart from numerical values).
you mean here. Each feature is assigned a numerical
value, and that numerical value is the location of the
item on the axis defined by that feature ...
My gut instinct is that we should probably use the
entropy as the coordinate, I'll try to think of a more
rational argument.
?? Why are stop words important to think about ??
> For stop
> words, I will have a new tag.
?? what's an unknown word?
> For unknown words, I will use unknown tag
2009/7/7 Linas Vepstas <linasv...@gmail.com>:
> I am attaching some statistics of our data that needs to be clustered. I
> collected the frequency of words having one, two, three .. N disjuncts.
Ah, well, this does present an alternate way of shrinking the dataset
to a size that is managable by existing tools.
For the case of words having only one disjunct, clustering is almost
trivial. Could you create a list of the words and the clusters they're
in?
To get fancy, you would also consider those word with multiple
disjuncts, but for which the probability of one of the disjuncts
exceeding, say, 95%.
> What I am trying to say from the observed statistics is that from our
> feature dimensionality of 10000,
I think it is a mistake to limit the number of feature dimensions.
> only a small fraction of dimensions are
> used for each word.
Yes. As your data shows, most words occur with just
a handful of disjuncts.
Because of this, a simple clusterin algo does suggest itself,
which would just classify words via simple lookup.
--linas
Its actually more like 50K or 100K words.
> 10K*4 =40K for 4 features per
> word
well, maybe a max of ten features per word. But yes, sparse
in this way, which indeed should make it far more tractable.
> Maybe it would help to review what is being
> classified and what it is being applied to. I missed that part of this
> conversation. What is being classified into what classes?
The goal is to improve the link-grammar dictionary in several ways:
-- (automatically) assign words to narrower, "tighter" categories
-- (automatically) loosen parse rules to allow parsing of a broader
range of text
-- (automatically) learn new words
-- progressively iterate on the above three steps.
Clustering addresses the first and the third points, although right now,
it will be "manual" not "automatic". I'll have to do some hacking to
enable the second point.
Linas.
Yes, good.
> These words can be put into clusters to be formed
> from the remaining words at later stages. Also it is found that majority of
> the words fall in two or three clusters.
I'd be interested in knowing what these are. Can you send a copy?
They don't have to be fully populated, I'd just like to see what they
look like.
The number and size of these clusters depends on the
cluster radius ...
> My current next goal is to refine the word list so that clustering is
> possible using the above procedure. Once clusters are formed, some of the
> words in each custer are choosen along with the left out words in the
> previous step and agian clustering is done to determine the clusters to
> which left out words went.
Why is this necessary? Just to keep computational times lower?
> Papers read by me:
You should be focusing on getting results at this point, time
is running out.
> Clusters will be ready in two/three days.
I'd like to review partial results, if any, sooner than that.
> Our next step is to determine the
> corresponding disjuncts to the clusters formed.
? I assume that the output of the clustering program includes
some average or centroid position of the cluster, right?
--linas
> Simcluster is able to handle our data if the number of words given forYes, good.
> clustering is in the range of 10000. So what I did is removed the words with
> exacltly one disjunct.
I'd be interested in knowing what these are. Can you send a copy?
> These words can be put into clusters to be formed
> from the remaining words at later stages. Also it is found that majority of
> the words fall in two or three clusters.
They don't have to be fully populated, I'd just like to see what they
look like.
The number and size of these clusters depends on the
cluster radius ...
Why is this necessary? Just to keep computational times lower?
> My current next goal is to refine the word list so that clustering is
> possible using the above procedure. Once clusters are formed, some of the
> words in each custer are choosen along with the left out words in the
> previous step and agian clustering is done to determine the clusters to
> which left out words went.
> Papers read by me:
You should be focusing on getting results at this point, time
is running out.
I'd like to review partial results, if any, sooner than that.
> Clusters will be ready in two/three days.
? I assume that the output of the clustering program includes
> Our next step is to determine the
> corresponding disjuncts to the clusters formed.
some average or centroid position of the cluster, right?
2009/7/24 Siva Reddy <gvs....@gmail.com>:
>> > Simcluster is able to handle our data if the number of words given for
>> > clustering is in the range of 10000. So what I did is removed the words
>> > with
>> > exacltly one disjunct.
>>
>> Yes, good.
>
> Words having only one disjunct are under the heading "Number of Main
> Disjuncts --> 1" in the attached file.
I'm confused. The file starts with:
Number of Main Disjuncts --> 1
Main Disjuncts --> '@A+'
Actual Number of disjuncts --> 1
a.m.
abducted.v
abject.a
abnormal.a
abolitionists.n
but a.m. has 25 different disjuncts associated with it, and
abducted.v has 80, etc. for example:
inflected_word | disjunct | log_cond_probability
----------------+----------------------------------+----------------------
abducted.v | Pv- | 1.82876319641111
abducted.v | Pv- MVp+ | 2.57902834757274
abducted.v | A+ | 3.36777993354544
abducted.v | Pv- MVp+ MVp+ | 3.82672257304021
abducted.v | E- A+ | 4.95246679079432
abducted.v | Ss- Os+ | 5.40579043433208
abducted.v | Pv- MVx+ | 5.47621429250552
abducted.v | Pv- MVp+ MVi+ | 5.54739398423103
abducted.v | Pv- MVs+ | 5.81114387203035
abducted.v | Pv- E- | 5.95107333357787
abducted.v | Ss- E- O+ | 5.99225171310795
etc.
so I'm unclear as to what your list is illustrating.
--linas
2009/7/28 Siva Reddy <gvs....@gmail.com>:
You should create your own branch.
> I am planning to commit my programs in opencog bzr repo. Is this ok ?
It makes more sense to associate that branch either with RelEx
or better yet, LexAt, which is where the other statistical work is.
https://launchpad.net/relex-statistical
Just create a new branch there.
--linas
aka a "dendrogram" .. A casual look seems to indicate its
a binary tree .. is that so?
> is not able to generate an
> image (reason might be due to large number of nodes in the tree which makes
> the size of image very large).
Yes, this is a recurring problem; none of the open-source visualization
tools seem to be able to handle large datasets (even when they claim
that they can.) We've had the same experience with when using
various different packages to visualize opencog graphs.
> To see the final formed clusters in human readable form, do you have any
> suggestions for the output format (how should it look like).
Not right now. I'm thinking of creating some simple tools to crawl
over what you've sent me ...
> I am planning
> to form a seperate file for each cluster.
it would be nicer to have everything in one file -- dealing qith 1500
files can be painful.
> Current status is that we have clusters of words behaving syntactically
> similar.
Well, My goal here was to be able to automatically refine the
link-grammar dicts from this data. This remains an important,
unrealized goal ...
> These clusters are formed by Weka (K means k=1500) and Simcluster
> (hierarchical clustering). Next step is to assign disjuncts to these
> clusters. Do you have any suggestions how to do these.
?? Well, presumably weka already "knows" this, its just not printing
it out. By writing some extra code, we could take the newick tree,
and compute centers for the graph nodes.
What is the meaning of the numerical value attached to the data.
The newick page says "distance", implying that this can be used for
graphical layout, but is this indeed the case?
--linas
> To Do: Identify Disjuncts of each cluster.
Yes, this would be interesting.
I wrote some scripts to post-process the files you sent me, and
I notice that the clusters are grouping together words from different
link-grammar classes, while also splitting apart existing classes
I'm curious to know why.
--linas
> Words are clustered using Simcluster. Attached are the results of clusteringaka a "dendrogram" ..
> in newick format
A casual look seems to indicate its
a binary tree .. is that so?
Yes, this is a recurring problem; none of the open-source visualization
> is not able to generate an
> image (reason might be due to large number of nodes in the tree which makes
> the size of image very large).
tools seem to be able to handle large datasets (even when they claim
that they can.) We've had the same experience with when using
various different packages to visualize opencog graphs.
Not right now. I'm thinking of creating some simple tools to crawl
> To see the final formed clusters in human readable form, do you have any
> suggestions for the output format (how should it look like).
over what you've sent me ...
it would be nicer to have everything in one file -- dealing qith 1500
> I am planning
> to form a seperate file for each cluster.
files can be painful.
Well, My goal here was to be able to automatically refine the
> Current status is that we have clusters of words behaving syntactically
> similar.
link-grammar dicts from this data.
This remains an important,
unrealized goal ...
?? Well, presumably weka already "knows" this, its just not printing
> These clusters are formed by Weka (K means k=1500) and Simcluster
> (hierarchical clustering). Next step is to assign disjuncts to these
> clusters. Do you have any suggestions how to do these.
it out. By writing some extra code, we could take the newick tree,
and compute centers for the graph nodes.
What is the meaning of the numerical value attached to the data.
The newick page says "distance", implying that this can be used for
graphical layout, but is this indeed the case?
> To Do: Identify Disjuncts of each cluster.Yes, this would be interesting.
I wrote some scripts to post-process the files you sent me,
and
I notice that the clusters are grouping together words from different
link-grammar classes,
while also splitting apart existing classes
I'm curious to know why.
They're very simple, ad-hoc. For example, I notice that your
cluster469 has 24 words, mostly from words.n.2.s but also
has doubts.n, warnings.n which have their own distinct
collection of disjuncts, and also excuses.n which is from
a different grouping.
This suggests the basic mechanism for broadening parse coverage
as explained in the other email.
--linas
cluster469 has 24 words, mostly from words.n.2.s but also
has doubts.n, warnings.n which have their own distinct
collection of disjuncts, and also excuses.n which is from
a different grouping.
This suggests the basic mechanism for broadening parse coverage
as explained in the other email.