Language-learning status

Linas Vepstas

unread,

Apr 28, 2017, 8:21:56 AM4/28/17

to Ben Goertzel, opencog, link-grammar

Ben,

Just got out of surgery for my broken leg; this email attempts to prove that the general anesthesia didn't kill too many brain cells. Its a report on some the language-learning results.

Lets dive in.
(mst-parse-text "this is a test")

Raw hard-to-read result below. Easier-to-read versions later.
ctv holds the raw count of how many times the word was observed.

((2.862118287646645 ((2 (WordNode "is" (ctv 1 0 8165736))) (3 (WordNode "a" (ctv 1 0 14691104))
))) (2.1880378875282904 ((1 (WordNode "this" (ctv 1 0 1300681))) (2 (WordNode "is" (ctv 1 0 8165736))
))) (2.8103625339100944 ((1 (WordNode "this" (ctv 1 0 1300681))) (4 (WordNode "test" (ctv 1 0 60328))
))))

The floating point number above and below is the Yuret MI of the word pair.
I've amended https://github.com/opencog/opencog/tree/master/opencog/nlp/learn/learn-lang-diary/learn-lang-diary.pdf pages 2-5 so that its less confusing and the formulas are accurate. Basically, it derives Yuret's formulas in a more rigorous
way; if I recall, his argument was scattered, and just asserted the result without deriving it. So the PDF derives it.

((2.862118287646645 ((2 (WordNode "is" )) (3 (WordNode "a" ))))
(2.1880378875282904 ((1 (WordNode "this" )) (2 (WordNode "is" ))))
(2.8103625339100944 ((1 (WordNode "this" )) (4 (WordNode "test" )))))

Simplifying further:

(((2 (WordNode "is" )) (3 (WordNode "a" )))
((1 (WordNode "this" )) (2 (WordNode "is" )))
((1 (WordNode "this" )) (4 (WordNode "test" ))))

The integer is the ordinal of the word. Note that the linkage "is-a" was selected over "a test" -- that's because "a test" has an MI of 2.0935. This is not terribly surprising; any MI of less than four is pretty crappy, and these four words occur so commonly that the correlation between them really is quite qeak -- they're almost drowning in noise. Extracting disjuncts should strongly sharpen the results. Next email.

Here's a better example:

(mst-parse-text "cats eat cat food")

((7.329 ((2 (WordNode "eat" (ctv 20938))) (4 (WordNode "food" (ctv 73924)))))
(4.992 ((3 (WordNode "cat" (ctv 18408))) (4 (WordNode "food" (ctv 73924)))))
(-1000 ((1 (WordNode "cats" (ctv 5902))) (4 (WordNode "food" (ctv 73924))))))

So "eat food" has a decent MI, as expected. Also "cat food" is decent. The minus-1000 means that the word pair "cats food" was never observed. (get-pair-mi-str "cats" "eat") = -1000 means that "cats eat" was never observed! Bummer! The word "cats" was observed 6000 times, and this was not enough to discover a sentence that has "cats eat" in it. These statistics are from some relatively smallish sample of WP articles, so lack of such a sentence is maybe not surprising. Here, childrens & young-adult lit may be better.

Anyway, clustering that reveals that cats, dogs, etc are similar should help with this, or so goes the hypothesis.

The word-pair "cats cat" does occur and has an MI of 5 but is prevented from linking by the link-crossing constraint. Have not attempted to figure out if the Dick Hudson landmark transitivity idea can be mutated to apply to this situation. I suppose I should think about things before writing about them, but not thinking is faster.

Lets try again:

(mst-parse-text "dogs eat dog food")
((7.329 ((2 (WordNode "eat" (ctv 20938))) (4 (WordNode "food" (ctv 73924)))))
(7.047 ((3 (WordNode "dog" (ctv 41896))) (4 (WordNode "food" (ctv 73924)))))
(5.050 ((1 (WordNode "dogs" (ctv 14852))) (2 (WordNode "eat" (ctv 20938))
))))

Well, that's much better. Let's try something harder:

(mst-parse-text "It is not uncommon to discover strange things")

((7.515 ( (WordNode "not" ) (WordNode "uncommon" )))
(4.142 ( (WordNode "is" ) (WordNode "uncommon" )))
(4.412 ( (WordNode "It" ) (WordNode "is" )))
(2.739 ( (WordNode "uncommon" ) (WordNode "to" )))
(3.529 ( (WordNode "to" ) (WordNode "discover" )))
(0.822 ( (WordNode "to" ) (WordNode "things" )))
(6.171 ( (WordNode "strange" ) (WordNode "things" ))))

Almost right -- the stinker in there is "to things" and it has a terrible MI. The correct link would have been "discover things" but this word-pair was never ever observed.

That's it for now, more later.

p.s. The above is obtained with code that uses values in full
generality; so, for example, the normalized word frequency is stored as

(Valuation
    (WordNode "foo")
    (PredicateNode "*-Frequency Key-*")
    (FloatValue 0.1234567 3.018))

Note that "Valuation" is like an EvaluationLink but different.
The first number is the normalized frequency of observation N(foo) / N(all words)
and the second number of log-base-2 of the first number (its easier to
read, than counting zeros in a frequency).

I had to fix a dozen bugs in brand-new SQL backend code to get this to
work right. It all seems stable, now.

--linas

Ben Goertzel

unread,

Apr 28, 2017, 11:05:45 AM4/28/17

to Linas Vepstas, opencog, link-grammar

Hi,

> Just got out of surgery for my broken leg; this email attempts to prove that
> the general anesthesia didn't kill too many brain cells.

Oh shit... so that little ankle-twist you got in Lalibela was actually
a broken bone??!! ;o that would explain why it was taking so f**king
long to heal ... ;/

> Its a report on
> some the language-learning results.

...

> (mst-parse-text "dogs eat dog food")
> ((7.329 ((2 (WordNode "eat" (ctv 20938))) (4 (WordNode "food" (ctv
> 73924)))))
> (7.047 ((3 (WordNode "dog" (ctv 41896))) (4 (WordNode "food" (ctv 73924)))))
> (5.050 ((1 (WordNode "dogs" (ctv 14852))) (2 (WordNode "eat" (ctv 20938))
> ))))

...

> (mst-parse-text "It is not uncommon to discover strange things")
>
> ((7.515 ( (WordNode "not" ) (WordNode "uncommon" )))
> (4.142 ( (WordNode "is" ) (WordNode "uncommon" )))
> (4.412 ( (WordNode "It" ) (WordNode "is" )))
> (2.739 ( (WordNode "uncommon" ) (WordNode "to" )))
> (3.529 ( (WordNode "to" ) (WordNode "discover" )))
> (0.822 ( (WordNode "to" ) (WordNode "things" )))
> (6.171 ( (WordNode "strange" ) (WordNode "things" ))))
>
> Almost right -- the stinker in there is "to things" and it has a terrible
> MI. The correct link would have been "discover things" but this word-pair
> was never ever observed.

Yeah, cool! We are definitely getting somewhere. These results are
all with an overly small corpus (as well as being the first stage of
the iterative process) so some dopey mistakes are to be expected...

***

Anyway, clustering that reveals that cats, dogs, etc are similar
should help with this, or so goes the hypothesis.

***

Yes. Ruiting is ready to start playing with clustering approaches,
as soon as you give her an Atomspace of mst-parsed sentences that you
think is a passable corpus for clustering experimentation...

-- Ben

Ben Goertzel

unread,

Apr 28, 2017, 12:24:27 PM4/28/17

to Linas Vepstas, opencog, link-grammar

On Fri, Apr 28, 2017 at 8:21 PM, Linas Vepstas <linasv...@gmail.com> wrote:
> The integer is the ordinal of the word. Note that the linkage "is-a" was
> selected over "a test" -- that's because "a test" has an MI of 2.0935. This
> is not terribly surprising;

I wonder if this example would work better if you used asymmetric
information rather than symmetric mutual information, as I suggested
before, though...

The problem with symmetric MI is: "a" is strongly attracted to "test",
but "test" is not strongly attracted to "a"

Whereas both "is" and "a" are strongly attracted to each other

So I sorta suspect that if you used asymmetric information here, and
then found the msdag rather than the mstree, you would get this (and
other similar) examples right even without a huge corpus...

-- Ben

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

Linas Vepstas

unread,

Apr 28, 2017, 5:50:53 PM4/28/17

to Ben Goertzel, opencog, link-grammar

On Fri, Apr 28, 2017 at 10:05 AM, Ben Goertzel <b...@goertzel.org> wrote:

Hi,

> Just got out of surgery for my broken leg; this email attempts to prove that
> the general anesthesia didn't kill too many brain cells.

Oh shit... so that little ankle-twist you got in Lalibela was actually
a broken bone??!! ;o that would explain why it was taking so f**king
long to heal ... ;/

Yeah. and I think it was even healing correctly, until one day when it was actually getting in pretty good shape, I was hurrying and made it worse, because it deteriorated after that.

Yeah, cool! We are definitely getting somewhere. These results are
all with an overly small corpus (as well as being the first stage of
the iterative process) so some dopey mistakes are to be expected...

Well, but this is all old ground. I was seeing this years ago, and besides, its what all the old MST parsers from a decade ago were all about. Its also why they were abandoned: no one figured out how to make them better, by extracting link types. So the next step will be a first.

Yes. Ruiting is ready to start playing with clustering approaches,
as soon as you give her an Atomspace of mst-parsed sentences that you
think is a passable corpus for clustering experimentation...

OK, "real soon now". There's always a few hiccups. The mst parser, being scheme code that calls the atomspace, is at least a magnitude slower than the LG parser (which has gotten really fast). Note, however, that the it is possible to export all of the MST data into LG, and use LG to perform MST. This has been in planning for a while, and the ground has been cleared for this, but remains to be implemented.

--linas

Linas Vepstas

unread,

Apr 28, 2017, 6:03:21 PM4/28/17

to Ben Goertzel, opencog, link-grammar

On Fri, Apr 28, 2017 at 11:24 AM, Ben Goertzel <b...@goertzel.org> wrote:

On Fri, Apr 28, 2017 at 8:21 PM, Linas Vepstas <linasv...@gmail.com> wrote:
> The integer is the ordinal of the word. Note that the linkage "is-a" was
> selected over "a test" -- that's because "a test" has an MI of 2.0935. This
> is not terribly surprising;

I wonder if this example would work better if you used asymmetric
information rather than symmetric mutual information, as I suggested
before, though...

The problem with symmetric MI is: "a" is strongly attracted to "test",
but "test" is not strongly attracted to "a"

Whereas both "is" and "a" are strongly attracted to each other

So I sorta suspect that if you used asymmetric information here, and
then found the msdag rather than the mstree, you would get this (and
other similar) examples right even without a huge corpus...

Pronto. This is "easy". At least, if I use the definition I(X,Y) / H(X). Any other favorite or suggested forms? I mean, we can do crazy things like squaring these quantities, too. I sort of want to avoid playing too many strange games, because it does take more time, but it would be a worthwhile exercise to at least write down all plausible forms.

--linas

Ben Goertzel

unread,

Apr 28, 2017, 9:44:51 PM4/28/17

to link-grammar, opencog

On Sat, Apr 29, 2017 at 6:02 AM, Linas Vepstas <linasv...@gmail.com> wrote:
> Pronto. This is "easy". At least, if I use the definition I(X,Y) / H(X).

That's what I was thinking... if the results from this are
unexpectedly pathological then we can brainstorm something else...

--
Ben Goertzel, PhD
http://goertzel.org

Reply all

Reply to author

Forward