Linas Vepstas
unread,May 6, 2017, 4:11:31 PM5/6/17Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Ben Goertzel, Ruiting Lian, opencog, link-grammar
Ben, Ruiting,
For your enjoyment: I have some very preliminary results on word similarity. They look pretty nice, even thogh based on a fairly small number of observations.
If you've been watching TV instead of reading email, here's the story so far: Starting from a large text corpus, the mutual information (MI) of word-pairs are counted. This MI is used to perform a maximum spanning-tree (MST) parse (of a different subset of) the corpus. From each parse, a pseudo-disjunct is extracted for each word. The pseudo-disjunct is like a real LG disjunct, except that each connector in the disjunct is the word at the far end of the link.
So, for example, in in idealized world, the MST parse of the sentence "Ben ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from this, we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate". Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben- pizza+) on the word "puke". Since these two disjuncts are the same, we can conclude that the two words "ate" and "puke" are very similar to each other. Considering all of the other disjuncts that arise in this example, we can conclude that these are the only two words that are similar.
Note that a given word may have very many pseudo-disjuncts attached to it. Each disjunct has a count of the number of times it has been observed. Thus, this set of disjuncts can be imagined to be a vector in a high-dimensional vector space, which each disjunct being a single basis element. The similarity of two words can be taken to be the cosine-similarity between the disjunct-vectors (or pick another, different metric, as you please.)
Below are a set of examples, for English, on a somewhat small dataset.Collected over a few days, it contains just under half-a-million observations of disjuncts, distributed across about 30K words. Thus, most words will have only a couple of disjuncts on them, which may have been seen only a couple of times. its important, at this stage, to limit oneself to only the most popular words.
We expect the determiners "the" and "a" to be similar, and they are:
(cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141
Even more similar:
(cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755
Not very similar at all:
(cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119
Oh hey this and that are similar. Notice the triangle with "the".
(cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977
Some more results
(cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
(cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
(cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158
We expect that determiners, nouns and verbs to all be very different
from one-another. And they are:
(cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
(cset-vec-cosine (Word "the") (Word "jump")) = 0.0
(cset-vec-cosine (Word "ball") (Word "jump")) = 0.0
We expect verbs to be similar, and they sort-of are.
(cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
(cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603
Since this is a sampling from wikipedia, there will be very few "action" verbs, unless the sample accidentally contains articles about sports. A "common sense" corpus, or a corpus that talks about what people do, could/should improve the above verbs. These are very basic to human behavior, but are rare in most writing.
I'm thinking that a corpus of children's lit, and young-adult-lit would be much better for these kinds of things.
An adjective.
(cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
(cset-vec-cosine (Word "wide") (Word "look")) = 0.0
(cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
(cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385
(cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278
Here's a set of antonyms!
(cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773
A pronoun
(cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
(cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417
(cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382
Wow!! In English, "it" is usually a male!
(cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
(cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
(cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902
I can post the database on mondy, let me know when you're ready to
receive it.
--linas