Word similarity results; database almost ready

39 views
Skip to first unread message

Linas Vepstas

unread,
May 6, 2017, 4:11:31 PM5/6/17
to Ben Goertzel, Ruiting Lian, opencog, link-grammar
Ben, Ruiting,

For your enjoyment: I have some very preliminary results on word similarity.  They look pretty nice, even thogh based on a fairly small number of observations.

If you've been watching TV instead of reading email, here's the story so far: Starting from a large text corpus, the mutual information (MI) of word-pairs are counted. This MI is used to perform a maximum spanning-tree (MST) parse (of a different subset of) the corpus. From each parse, a pseudo-disjunct is extracted for each word.  The pseudo-disjunct is like a real LG disjunct, except that each connector in the disjunct is the word at the far end of the link.

So, for example, in in idealized world, the MST parse of the sentence "Ben ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from this, we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate".  Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben- pizza+) on the word "puke".  Since these two  disjuncts are the same, we can conclude that the two words "ate" and "puke" are very similar to each other.  Considering all of the other disjuncts that arise in this example, we can conclude that these are the only two words that are similar.

Note that a given word may have very many pseudo-disjuncts attached to it. Each disjunct has a count of the number of times it has been observed.  Thus, this set of disjuncts can be imagined to be a vector in a high-dimensional vector space, which each disjunct being a single basis element.  The similarity of two words can be taken to be the cosine-similarity between the disjunct-vectors (or pick another, different metric, as you please.)

Below are a set of examples, for English, on a somewhat small dataset.Collected over a few days, it contains just under half-a-million observations of disjuncts, distributed across about 30K words. Thus, most words will have only a couple of disjuncts on them, which may have been seen only a couple of times. its important, at this stage, to limit oneself to only the most popular words.

We expect the determiners "the" and "a" to be similar, and they are:
(cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141

Even more similar:
(cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755

Not very similar at all:
(cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119

Oh hey this and that are similar. Notice the triangle with "the".
(cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977

Some more results
 (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
 (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
 (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158

We expect that determiners, nouns and verbs to all be very different
from one-another. And they are:
 (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
 (cset-vec-cosine (Word "the") (Word "jump")) = 0.0
 (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0

We expect verbs to be similar, and they sort-of are.
 (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
 (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603

Since this is a sampling from wikipedia, there will be very few "action" verbs, unless the sample accidentally contains articles about sports. A "common sense" corpus, or a corpus that talks about what people do, could/should improve the above verbs.  These are very basic to human behavior, but are rare in most writing.

I'm thinking that a corpus of children's lit, and young-adult-lit would be much better for these kinds of things.

An adjective.
 (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
 (cset-vec-cosine (Word "wide") (Word "look")) = 0.0
 (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
 (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385

 (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278

Here's a set of antonyms!
 (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773

A pronoun
 (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
 (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417

 (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382

Wow!! In English, "it" is usually a male!
 (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
 (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
 (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902

I can post the database on mondy, let me know when you're ready to
receive it.

--linas


Ben Goertzel

unread,
May 6, 2017, 11:25:57 PM5/6/17
to Linas Vepstas, Ruiting Lian, opencog, link-grammar
Very cool!

Ruiting should be ready to start playing w/ this data on Tuesday, I think...
--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

Linas Vepstas

unread,
May 7, 2017, 3:23:18 AM5/7/17
to Ben Goertzel, Ruiting Lian, opencog, link-grammar
OK. Close coordination will be needed. I'm planning on creating a database with several different kinds of distance measures precomputed.  This is possible, because the database is currently small enough to make this possible.

Any favorite distance measures you might recommend, besides the cosine distance?

--linas

Ben Goertzel

unread,
May 7, 2017, 3:27:36 AM5/7/17
to link-grammar, Ruiting Lian, opencog
Hmm.. it's hard to know what distance measure is best without playing
w. the data first

Jaccard and Tanimoto similarity (the latter not quite corresponding to
a metric) may be useful, I dunno...

https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance

Some clustering methods will be able to use these precomputed
distances; others (like NN-based methods) sorta compute distances in
the midst of doing their other stuff anyway...
> --
> You received this message because you are subscribed to the Google Groups
> "link-grammar" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to link-grammar...@googlegroups.com.
> To post to this group, send email to link-g...@googlegroups.com.
> Visit this group at https://groups.google.com/group/link-grammar.
> For more options, visit https://groups.google.com/d/optout.

Nil Geisweiller

unread,
May 7, 2017, 4:21:12 AM5/7/17
to link-g...@googlegroups.com, Ruiting Lian, opencog

Linas Vepstas

unread,
May 9, 2017, 5:41:27 PM5/9/17
to opencog, Nil Geisweiller, link-grammar, Ruiting Lian
Hi Nil,

this is very off-topic, but it illustrates the problem of coding in C++: its hard and sometimes impossible to untangle algorithm and data structure.  This point is made clearly in the Bondi language, which is an experimental language that tries to clearly separate these two.

Historically, lisp/scheme were much better at separating algo from data, which is why they were popular in early AI attempts, and in early web shopping-carts.  The whole point of adding templates to C++ was to at least partly solve this problem, but c++ templates remain hard to use in any but the very simplest situations. basically, c++ templates are like a badly-broken hard-to-use version of lisp. (but with types, so I guess like haskel/caml) 

Another common solution for OO programming in python, C++ is the "visitor pattern" which you don't see in lisp or scheme, because everything is a visitor, there. visitors show up in chapter 2 of sicp, they are so basic that they are not even given a special name.

in my case, the different counts for different atoms are stored in different places, and they're never vectors- they're just usually random sets of gorp that earlier layers generated.

My quick-hack, non-generic-programming approach is here

banach lp-distance:

https://github.com/opencog/opencog/blob/master/opencog/nlp/learn/pseudo-csets.scm#L217-L234

vector product:
https://github.com/opencog/opencog/blob/master/opencog/nlp/learn/pseudo-csets.scm#L386-L414

Both are more complicated than they need to be, because the indicated data item might not exist. e.g. only one in a trillion possible disjuncts will ever exist. so these "vectors" are actually unordered sets and they are extremely sparse.

--linas


To post to this group, send email to link-g...@googlegroups.com.
Visit this group at https://groups.google.com/group/link-grammar.
For more options, visit https://groups.google.com/d/optout.




--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscribe@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/c4100a00-4dc0-5391-1ebe-20969f4e7e8b%40gmail.com.
Reply all
Reply to author
Forward
0 new messages