Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

116 views
Skip to first unread message

Linas Vepstas

unread,
Apr 22, 2019, 11:40:49 PM4/22/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog
Hi Anton,

On Mon, Apr 15, 2019 at 11:18 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:
Ben, Linas,

Let me comment on latest results, given LG-English parses are given as
input for Grammar Learner using Identical Lexical Entries (ILE)
algorithm and compared against the same input LG-English parses - for
Gutenberg Children corpus with direct speech taken off, using only
complete LG-English parses for testing and training.

MWC - Minimum Word Count, so test only on the the sentences where every
word in the sentence occurs given number of times or more.

MSL - Maximum Sentence Length, so test only on the the sentences which
has given number of words or less.

MWC(GT) MSL(GT) PA      F1
0       0        61.69%   0.65 - all input sentences are used for test
5       0       100.00%   1.00 - sentences with each word occurring 5+
10      0       100.00%   1.00 - sentences with each word occurring 10+
50      0       100.00%   1.00 - sentences with each word occurring 50+
That is:

1) With words occurring 5 and more times recall=1.0 and precision-1.0;

Thank you!  This is fairly impressive: it says that if the algo heard a word five or more times, that was sufficient for it to deduce the correct grammatical form!  This is something that is considered to be very important when people compare machine learning to human learning -- it is said that "humans can learn from very few examples and machines cannot", yet here we have an explicit demonstration of an algorithm that can learn perfect accuracy with only five examples!  I think that is absolutely awesome, and is the kind of news that can be shouted from off of rooftops!  Its kind of a "we did it! success!" kind of story.

The fact that the knee of the curve occurs at or below 5 is huge -- very very different than if it occurred at 50.

However, just to be clear --- it would be very useful if you or Alexy provided examples of words that were seen only 2 or 3 times, and the kinds of sentences they appeared in.
 
2) Shorter sentences provide better recall and precision.

0       5        70.06%   0.72 - sentences of 5 words and shorter
0       10       66.60%   0.69 - sentences of 10 words and shorter
0       15       63.87%   0.67 - sentences of 15 words and shorter
0       25       61.69%   0.65 - sentences of 25 words and shorter

This is meaningless - a nonsense statistic.  It just says "the algo encountered a word only once or twice or three times, and fails to use that word correctly in a long sentence. It also fails to use it correctly in a short sentence." Well, duhhh -- if I invented a brand new word you never heard of before, and gave you only one or two examples of using that word, of course, you would be lucky to have a 60% or 70% accuracy of using that word!!  The above four data-points are mostly useless and meaningless.

--linas
 

Note:

1) Identical Lexical Entries (ILE) algorithm is "over-fitting" in fact,
so there is still way to go being able to learn "generalized grammars";
2) Same kind of experiment is still to be done with MST-Parses and
results are not expected to be that glorious, given what we know about
Pearson correlation between F1-s on different parses ;-)

Definitions of PA and F1 are in the attached paper.

Cheers,
-Anton


--------


*Past Week:*
1. Provided data for GC for ALE and dILEd.
2. Fixed GT to allow parsing sentenses starting with numbers in ULL mode.
3. Ended up with Issue #184, ran several tests for different corpora
with different settings of MWC and MSL:
- Nothing interesting for POC-English;
- CDS seems to be dependent on ratio of number of incompletely parsed
sentences to number of completely parsed sentenses which make up corpus
subset defined by MWC/MSL restriction.
http://langlearn.singularitynet.io/data/aglushchenko_parses/CDS-dILEd-MWC-MSL-2019-04-13/CDS-dILEd-MWC-MSL-2019-04-13-summary.txt
- Much more reliable result is obtained on GC corpus with no direct speech.
http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-13/GCB-NQ-dILEd-MWC-MSL-summary.txt
4. Small improvement to pipeline code were made.

*Next week:*
1. Resolve Issue #188
2. Resolve Issue #198
3. Resolve Issue #193
4. Pipeline improvements along the way.

Alexey



--
cassette tapes - analog TV - film cameras - you

Linas Vepstas

unread,
Apr 22, 2019, 11:46:12 PM4/22/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog
On Mon, Apr 15, 2019 at 11:18 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:


1) Identical Lexical Entries (ILE) algorithm is "over-fitting" in fact,
so there is still way to go being able to learn "generalized grammars";

Can you explain in detail what "Identical lexical entries" are? I can guess, but I would like to know if my guess is what you are actually doing. The attached paper did not say.

-- Linas 
 

Ben Goertzel

unread,
Apr 22, 2019, 11:48:42 PM4/22/19
to Linas Vepstas, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
***
Thank you! This is fairly impressive: it says that if the algo heard
a word five or more times, that was sufficient for it to deduce the
correct grammatical form!
***

Yes. What we can see overall is that, with the current algorithms
Anton's team is using: If we have "correct" unlabeled dependency
parses, then we can infer "correct" parts-of-speech and POS-based
grammatical rules... for words that occur often enough (5 times with
current corpus and parameters)

So the problem of unsupervised grammar induction is, in this sense,
reduced to the problem of getting correct-enough unlabeled dependency
parses ...

The current MST parser, on corpora of the sizes we have been able to
feed it, does not produce correct-enough unlabeled dependency parses.
One thread of current research is to see if using info from modern
DNN models, in place of simple mutual information, can cause an
MST-type parser to produce correct-enough unlabeled dependency
parses.... (where "correct" means agreement w/ human-expert
grammatical judgments, in this case)

ben
--
Ben Goertzel, PhD
http://goertzel.org

"Listen: This world is the lunatic's sphere, / Don't always agree
it's real. / Even with my feet upon it / And the postman knowing my
door / My address is somewhere else." -- Hafiz

Linas Vepstas

unread,
Apr 23, 2019, 12:13:55 AM4/23/19
to Anton Kolonin @ Gmail, Ben Goertzel, Alexei Glushchenko, lang-learn, link-grammar, opencog
On Mon, Apr 15, 2019 at 9:02 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:
Ben,
I'd be curious to see some examples of the sentences used in

***
5       0       100.00%   1.00 - sentences with each word occurring 5+
10      0       100.00%   1.00 - sentences with each word occurring 10+
50      0       100.00%   1.00 - sentences with each word occurring 50+
***
Alexey, please provide.
So if I understand right, you're doing grammar inference here, but
using link parses (with the hand-coded English grammar) as data ...
right?   So it's a test of how well the grammar inference methodology
works if one has a rather good set of dependency linkages to work with
...?

Yes.

Oh.  Well, that is not at all what I thought you were describing in the earlier emails. If you have perfect parses to begin with, then extracting dependencies from perfect parses is ... well, not exactly trivial, but also not hard. So getting 100% accuracy is actually a kind-of unit-test; it proves that your code does not have any bugs in it. 

2) To which extent "the best of MST" parses will be worse than what we have above (in progress)

3) If we can get quality of "the best of MST" parses close to that (DNN-MI-lking, etc.)

What does "the best of MST" mean?  The goal is to use MST-provided parses, discard all words/sentences in which a word occurs less than N times, and see what the result is.  I am still expecting a knee at N=5 and not N=50.

4) If we can learn grammar in more generalized way (hundreds of rules instead of thousands)

The size of your grammar depends strongly on the size of your vocabulary. For a child's corpus, I think it's "impossible" to get an accurate grammar with less than 800 or 1000 rules.  The current English LG dictionary has approximately 8K rules. 

I do not have a good way of estimating a "reasonable" dictionary size -- Again -- Zipf's Law means that only a small number of rules are used frequently, and that 3/4ths of all rules are used to handle corner cases.  To be clear: for the Child's corpus, if you learned 1000 rules total, then I would expect that 250 rules would be triggered 5 or more times, while the remaining 750 rules would trigger only 1,2,3 or 4 times.   That is my guess.

Actually creating, seeing what this graph looks like -- that would be ... very interesting.  It would reveal something important about language. Zipf's law says something very important -- that, hiding behind an apparent regularity, the exceptions and corner-cases are frequent and common. I expect this to hold for the learned grammars.  

What is means, in practice, is that the size of your grammar is determined by the size of your training set, specifically, by the integral under the curve, from 1 or more observations of a word.

-- Linas

Linas Vepstas

unread,
Apr 23, 2019, 12:36:59 AM4/23/19
to Ben Goertzel, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
On Mon, Apr 22, 2019 at 10:48 PM Ben Goertzel <b...@goertzel.org> wrote:
***
Thank you!  This is fairly impressive: it says that if the algo heard
a word five or more times, that was sufficient for it to deduce the
correct grammatical form!
***

Yes.   What we can see overall is that, with the current algorithms
Anton's team is using: If we have "correct" unlabeled dependency
parses, then we can infer "correct" parts-of-speech and POS-based
grammatical rules... for words that occur often enough (5 times with
current corpus and parameters)

Ah, well, hmm. It appears I had misunderstood. I did not realize that the input was 100% correct but unlaballed parses. In this case, obtaining 100% accuracy is NOT suprising, its actually just a proof that the code is reasonably bug-free. Such proofs are good to have, but its not theoretically interesting. Its kind of like saying "we proved that our radio telescope is pointed in the right direction".  Which is an important step.


So the problem of unsupervised grammar induction is, in this sense,
reduced to the problem of getting correct-enough unlabeled dependency
parses ...

Oh, no at all! Exactly the opposite!! Now that the telescope is pointed in  the right direction, what is the actual signal?  

My claim is that this mechanism acts as an "amplifier" and a "noise filter" -- that it can take low-quality MST parses as input,  and still generate high-quality results.   In fact, I make an even stronger claim: you can throw *really low quality data* at it -- something even worse than MST, and it will still return high-quality grammars.

This can be explicitly tested now:  Take the 100% perfect unlaballed parses, and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50% random errors into it. What is the accuracy of the learned grammar?  I claim that you can introduce 30% errors, and still learn a grammar with greater than 80% accuracy.  I claim this, I think it is a very important point -- a key point - but I cannot prove it. 

It is a somewhat delicate experiment -- the corpus has to be large enough.  If you introduce a 30% error rate into the unlabelled parses, then certain rare words (seen 6 or fewer times) will be used incorrectly, reducing the effective count to 4 or less ... So the MWC "minimum word count" would need to get larger, the greater the number of errors.  But if the MWC is large enough (maybe 5 or 10, less than 20) and the corpus is large enough, then you should still get high-quality grammars from low-quality inputs.

-- Linas

Linas Vepstas

unread,
Apr 23, 2019, 12:44:36 AM4/23/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog


On Mon, Apr 22, 2019 at 11:18 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

We are going to repeat the same experiment with MST-Parses during this week.

The much more interesting experiment is to see what happens when you give it a known percentage of intentionally-bad unlabelled parses. I claim that this step provides natural error-reduction, error-correction, but I don't know how much.

--linas
 

Ben Goertzel

unread,
Apr 23, 2019, 6:00:28 AM4/23/19
to Linas Vepstas, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
> On Mon, Apr 22, 2019 at 11:18 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:
>>
>>
>> We are going to repeat the same experiment with MST-Parses during this week.
>
>
> The much more interesting experiment is to see what happens when you give it a known percentage of intentionally-bad unlabelled parses. I claim that this step provides natural error-reduction, error-correction, but I don't know how much.


If we assume roughly that "insufficient data" has a similar effect to
"noisy data", then the effect of adding intentionally-bad parses may
be similar to the effect of having insufficient examples of the words
involved... which we already know from Anton's experiments. Accuracy
degrades smoothly but steeply as number of examples decreases below
adequacy.

***
My claim is that this mechanism acts as an "amplifier" and a "noise
filter" -- that it can take low-quality MST parses as input, and
still generate high-quality results. In fact, I make an even
stronger claim: you can throw *really low quality data* at it --
something even worse than MST, and it will still return high-quality
grammars.

This can be explicitly tested now: Take the 100% perfect unlaballed
parses, and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50%
random errors into it. What is the accuracy of the learned grammar? I
claim that you can introduce 30% errors, and still learn a grammar
with greater than 80% accuracy. I claim this, I think it is a very
important point -- a key point - but I cannot prove it.
***

Hmmm. So I am pretty sure you are right given enough data.

However, whether this is true given the magnitudes of data we are now
looking at (Gutenberg Childrens Corpus for example) is less clear to
me

Also the current MST parses are much worse than "30% errors" compared
to correct parses. So even if what you say is correct, it doesn't
remove the need to improve the MST parses...

But you are right -- this will be an interesting and important set of
experiments to run. Anton, I suggest you add it to the to-do list...

-- Ben

Ben Goertzel

unread,
Apr 23, 2019, 6:09:14 AM4/23/19
to Linas Vepstas, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
***
Ah, well, hmm. It appears I had misunderstood. I did not realize that
the input was 100% correct but unlaballed parses. In this case,
obtaining 100% accuracy is NOT suprising, its actually just a proof
that the code is reasonably bug-free.
***

It's a proof that the algorithms embodied in this portion of the code
are actually up to the task. Not just a proof that the code is
relatively bug-free, except in a broad sense of "bug" as "algorithm
that doesn't fulfill the intended goals"

(I know you understand this, I'm just clarifying for the rest of the
audience...)

***
Such proofs are good to have, but its not theoretically interesting.
***

I think it's theoretically somewhat interesting, because there are a
lot of possible ways to do clustering and grammar rule learning, and
now we know a specific combination of clustering algorithm and grammar
rule learning algorithm that actually works (if the input dependency
parses are good)

But it's not yet the conceptual breakthrough we are chasing...

***
Its kind of like saying "we proved that our radio telescope is pointed
in the right direction". Which is an important step.
***

I think it's more like saying "Yay! our telescope works and is pointed
in the right direction" ;-) ....

But yeah, it means a bunch of the "more straightforward" parts of the
grammar-induction task are working now, so all we have to do is
finally solve the harder part, i.e. making decent unlabeled dependency
trees in an unsupervised way

Of course one option is that this clustering/rule-learning process is
part of a feedback process that produces said decent unlabeled
dependency trees

Then the approach would be

-- shitty MST parses
-- shitty inferred grammar
-- use shitty inferred grammar to get slightly less shitty parses
-- use slightly less shitty parses to get slightly less shitty inferred grammar
-- etc. until most of the shit disappears and you're left with just
the same level of shit as in natural language...

Another option is to use DNNs to get nicer parses and just do

-- nice MST parses guided by DNNs
-- nice inferred grammar from these parses

Maybe what will actually work is more like

-- semi-shitty MST parses guided by DNNs
-- semi-shitty inferred grammar
-- use semi-shitty inferred grammar together with DNNs to get less
shitty parses
-- use less shitty parses to get even less shitty inferred grammar
-- etc. until most of the shit disappears and you're left with just
the same level of shit as in natural language...


.. ben
> --
> You received this message because you are subscribed to the Google Groups "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
> To post to this group, send email to lang-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA37EdDmznxu2rtQCFuEj%3DfXLcTtHZEk-AQizCxU0Dddbyg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Linas Vepstas

unread,
Apr 23, 2019, 5:13:51 PM4/23/19
to Ben Goertzel, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
On Tue, Apr 23, 2019 at 5:00 AM Ben Goertzel <b...@goertzel.org> wrote:
> On Mon, Apr 22, 2019 at 11:18 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:
>>
>>
>> We are going to repeat the same experiment with MST-Parses during this week.
>
>
> The much more interesting experiment is to see what happens when you give it a known percentage of intentionally-bad unlabelled parses. I claim that this step provides natural error-reduction, error-correction, but I don't know how much.


If we assume roughly that "insufficient data" has a similar effect to
"noisy data", then the effect of adding intentionally-bad parses may
be similar to the effect of having insufficient examples of the words
involved... which we already know from Anton's experiments.   Accuracy
degrades smoothly but steeply as number of examples decreases below
adequacy.

They are effects that operate at different scales.  In my experience, a word has to be seen at least five times before it gets linked mostly/usually accurately. The reason for this is simple: If it is seen only once, it has an equal co-occurance with all of it's nearby-neighbors: any neighbor is equally likely to be the right link (so for N neighbors, a 1/N chance of guessing correctly).  When a word is seen five times, the collection of nearby neighbors has grown into the several-dozens, and of those several dozen, only 1 or 2 or 3 will have been seen repeatedly.  The correct link is to one of the repeats.  And so, "from first principles", I can guess that 5 is the minimum number of observations to arrive at an MST parse that is better than random-chance.  This effect is operating at the word-pair level, and determines the accuracy of MST.

The other effect is operating at the disjunct level.  Consider a single word, and 10 sentences containing that word.  Assume each sentence has an unlabelled parse, which might be wrong. Assume that word is linked correctly 7 times, and incorrectly 3 times. Of those 3 times, only some of the links will be incorrect (typically, a word has more than one link going to it). When building disjuncts, this leads to 7 correct disjuncts, and 3 that are (partly) wrong.

Consider an 11th "test sentence" containing that word.  If you weight each disjunct equally, then you have a 7/10 chance of using good disjuncts and a 3/10 chance of using bad ones.  Solution: do not weight them equally!  But how to do this?  Short answer: the MI mechanism, w/ clustering, means that on average, the 7 correct disjuncts will have a high MI score, the 3 bad ones will have a low MI score, and thus, on the test sentence, it will be far more likely that the correct disjuncts get used.  The final accuracy should be better than 7/10.

This depends on a key step: correctly weighting disjuncts, so that this discrimination kicks in. Without discrimination, the resulting LG dictionary will have accuracy that is no better than MST (and maybe a bit worse, due to other effects).  



***
My claim is that this mechanism acts as an "amplifier" and a "noise
filter" -- that it can take low-quality MST parses as input,  and
still generate high-quality results.   In fact, I make an even
stronger claim: you can throw *really low quality data* at it --
something even worse than MST, and it will still return high-quality
grammars.

This can be explicitly tested now:  Take the 100% perfect unlaballed
parses, and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50%
random errors into it. What is the accuracy of the learned grammar?  I
claim that you can introduce 30% errors, and still learn a grammar
with greater than 80% accuracy.  I claim this, I think it is a very
important point -- a key point - but I cannot prove it.
***

Hmmm.   So I am pretty sure you are right given enough data.

However, whether this is true given the magnitudes of data we are now
looking at (Gutenberg Childrens Corpus for example) is less clear to
me

Its a fairly large corpus - what 750K sentences? and 50K unique words? (of which only 5K or 8K were seen more than five times!!)  So I expect accuracy to depend on word-frequency:   If the test sentences only contain words from that 5K vocabulary, they will have (much) higher accuracy than sentences that contain words that were seen 1-2 times.

I also expect that the disjuncts on the most frequent 1K words to be of much higher accuracy, than the next 4K -- So, for test sentences containing only words from the top 1K, I expect those to have high accuracy.   For longer sentences containing infrequent words, I expect most of it to be linked correctly, except for the portion near the infrequent word, where the error rate goes up.

One of the primary reason to perform clustering is to "amplify frequency" - by grouping together words that are similar, the grand-total counts go up, the probably-correct disjunct counts shoot way up, while the maybe-wrong disjunct counts stay scattered and low, never coalescing.
 
Also the current MST parses are much worse than "30% errors" compared
to correct parses.   

Did Deniz Yuret falsify his thesis data? He got better than 80% accuracy; we should too.
 
So even if what you say is correct, it doesn't
remove the need to improve the MST parses...

Actually, one of my proposals from the previous block of emails was to make MST worse!  I'm so sick of hearing about MST that I proposed getting rid of it, and replacing it with something of lower-quality, and focus on the clustering and disjunct weighting schemes to improve accuracy. 

I'm fairly certain that replacing MST with something lower-quality will still work well. If that is not the case, then  that means that the disjunct-processing stages are somehow being done wrong.  The final result should not depend very much on the accuracy of MST. And this does not require a huge corpus, either. If there is a strong dependence on MST, something is seriously wrong, seriously broken in the disjunct-processing stages.  We need to spend energy on fixing that brokenness and not on making MST better.

(And I would not be surprised that the disjunct-processing stages are broken, mostly because I have not seen any detailed description of how they are being performed.  The details there really matter, they really affect outcomes, but those details are not being discussed.)

To repeat myself-- these later stages are where all the action is -- if these later stages are weak, nothing can be built on them. 

--linas


But you are right -- this will be an interesting and important set of
experiments to run.   Anton, I suggest you add it to the to-do list...

-- Ben

Linas Vepstas

unread,
Apr 23, 2019, 5:45:42 PM4/23/19
to Ben Goertzel, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
Hi Ben,

On Tue, Apr 23, 2019 at 5:09 AM Ben Goertzel <b...@goertzel.org> wrote:
***

Ah, well, hmm. It appears I had misunderstood. I did not realize that
the input was 100% correct but unlaballed parses. In this case,
obtaining 100% accuracy is NOT suprising, its actually just a proof
that the code is reasonably bug-free.
***

 It's a proof that the algorithms embodied in this portion of the code
are actually up to the task.   Not just a proof that the code is
relatively bug-free, except in a broad sense of "bug" as "algorithm
that doesn't fulfill the intended goals"

Recently, one week of my time was sucked into a black hole.  I read all six papers from the latest Event Horizon Telescope announcement. Five and a half of these papers are devoted to describing the EHT, and proving that it works correctly.  The actual results are just one photo, and a few paragraphs explaining the photo.  And you got that in the mainstream-press.

I'd like to see the same mind-set here: a lot more effort put into characterizing exactly what it is that is being done, and proving that it works as expected, where "expected==intuitive explanation of why it works".  So, yes, characterizing the stage that moves from unlabeled parses to labeled parses is really important.  If you want to sound like a professional scientist, then write that up in detail, i.e. prove that your experimental equipment works.  That's what the EHT people did, we can do it too. 
 

***
 Such proofs are good to have, but its not theoretically interesting.
***

I think it's theoretically somewhat interesting, because there are a
lot of possible ways to do clustering and grammar rule learning, and
now we know a specific combination of clustering algorithm and grammar
rule learning algorithm that actually works (if the input dependency
parses are good)

Yes.  Despite all the spread-sheets, PDF's and github issues that Anton has aimed my way, I still do not understand what this "specific combination of clustering algorithm and grammar rule learning algorithm" actually is.  I've got a vague impression, but not enough of one to be able to reproduce that work.  Which is funny, because as an insider, I wrote half the code that is being used as ingredients.  So I should be in a prime position to understand what is being done ... but I don't.  This still needs to be fixed.  It should be written up at EHT-level quality write-ups.
 

Then the approach would be

I don't want to comment on this part, because I've already commented on it before.  If there is an accuracy problem, its got nothing to do with the accuracy of MST.  The accuracy of MST should NOT affect final results!  If the accuracy of MST is impacting the final results, then some other part of the pipeline is not working correctly! 

In a real radio-telescope, the very first transistor in the antenna dominates the signal-to-noise ratio, and provides about 3dB of amplification. 3DB is equal to one binary-bit!  10^0.3==2^1 == Two to the power-one of entropy decrease. All the data processing happens after that first transistor. 

MST is like that first transistor. Its gonna be shitty.  If the downstream stages - the disjunct processing aren't working right, then you get no worthwhile results.   Focus on the downstream, characterize the operation of the downstream. Quit obsessing on MST, its a waste of time.

--linas

Sarah Weaver

unread,
Apr 26, 2019, 12:57:55 PM4/26/19
to ope...@googlegroups.com
Hey did my last message show up in spam again? :P

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34BzxwmJMeMLT%2Byd_ih14RE6Y3S86XMPKEtCTG7URQKmA%40mail.gmail.com.

Linas Vepstas

unread,
May 1, 2019, 7:43:23 PM5/1/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog
Hi Anton, sorry for very late reply.

On Tue, Apr 23, 2019 at 8:25 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Linas, how would you "weight the disjuncts"?

We know how to weight the words (by frequency), and word pairs (by MI).

But how would you weight the disjuncts?


That is a very good question. There are several (many) different kinds of weighting schemes. I do not know which is best.  That is the point where I last left things, half-a-year-ago, now.  But first, some theoretical preliminaries.

Given any ordered pair at all -- thing-a and thing-b, you can compute MI for the pair. Thing-a does not have to be the same type as thing-b. In this case, the pair-of-interest is (word, one-of-the-disjuncts-on-that-word).  Write it as (w,d) for short.  The MI is defined same as always:  MI(w,d) = p(w,d)/p(*,d) p(w,*)   where p is the frequency of observation: p(w,d)=N(w,d)/N(*,*) as always. N is the observation count and * the wild-card sum. 

The pipeline code already computes this; I'm not sure if you use it or not. Its in the `(use-modules (opencog matrix))` module; it computes MI for pairs-of-anythings in the atomspace.  Its generic in that one can set up thing-a to be some/any collection of atoms in the atomspace, and thing-b can be any other collection of atoms, and it will start with the counts N(thing-a, thing-b) and compute probabilities, marginal probabilities, conditional probabilities, MI, entropies, "the whole enchilada" of statistical you can do on pairs of things.  Its called "matrix" because "pairs of things" looks like an ordinary matrix [M]_ij  

Sounds boring, but here's the kicker: `(opencog matrix)` is designed to work for extremely sparse matrixes, which *every other package (e.g. scipy) will choke on.  For example: if thinga-thing-b=words, and there are 100K words, then M_ij potentially has 100K x 100K = 10 giga-entries which will blow up RAM if you tried to store the whole matrix. In practice, 99.99% of them are zero (the observation count of N(left-word, right-word) is zero for almost all word pairs).  So the atomspace is being used as storage for hyper-sparse matrixes, and you can layer the matrix onto the atomspace any way that you want. Its like a linear cross-section through the atomspace. linear, vector, etc. etc.

OK, so .. the existing language pipeline computes MI(w,d) already, and given a word, and a disjunct on that word, you can just look it up.  ... but if you are clustering words into a cluster, then the current code does not currently recompute MI(g,d) for some word-group ("grammatical class") g.  Or maybe it does recompute, but it might be incomplete or untested, or different because maybe your code is different. For the moment, let me ignore clustering....

So, for link-grammar, just take -MI(w,d) and make that the link-grammar "cost".  Minus sign because larger-MI==better.

How well will that work? I dunno. This is new territory to me. Ben long insisted on "surprisingness" as a better number to work with. I have not implemented surprisingness in the matrix code; nothing computes it yet.  Besides using MI, one can invent other things.  I strongly believe that MI is the correct choice, but do not have any concrete proof. 

If you do have grammatical clusters g, then perhaps one should use MI(w,g)+MI(g,d)  or maybe just use MI(g,d) by itself.  Likewise, if the disjunct 'd' is the result of collapsing-together a bunch of single-word disjuncts, maybe you should add MI(disjunct-class, single-disjunct) to the cost. I dunno.  I was half-way through these experiments when Ben re-assigned me, so this is all new territory.

 -- Linas



 

-Anton


24.04.2019 4:13, Linas Vepstas пишет:
--
You received this message because you are subscribed to the Google Groups "lang-learn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
To post to this group, send email to lang-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA35datvKktVoaJgQk2fbq36t7a2wvWP3EXJ3Wrwaw8UtcQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Linas Vepstas

unread,
May 1, 2019, 8:07:26 PM5/1/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog


On Wed, Apr 24, 2019 at 9:31 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Ben, Linas, here is full set of results generated by Alexey:

Results update:


My gut intuition is that the most interesting numbers would be this:
 
MWC(GT) MSL(GT) PA      F1

5       2        
5       3        
5       4        
5       5        
5       10       
5       15       
5       25       

because I think that "5" gets you over the hump for central-limit.  But, per earlier conversation: the disjuncts need to be weighted by something, as otherwise, you will get accuracy more-or-less exactly equal to MST accuracy. Without weighting, you cannot improve on MST.   The weighting is super-important to do, and discovering the best weighting scheme is one major task (is it MI, surprisingness, something else?)
 

I just though that "Currently, Identical Lexical Entries (ILE) algorithm builds single-germ/multi-disjunct lexical entires (LE) first, and then aggregates identical ones based on unique combinations of disjuncts" is sufficient.

OK, so, by "lexical entry", I guess you mean "a single word-disjunct pair",  where he disjunct connectors have not been clustered? So, yes, if they are identical, then yes, you should add together the observation counts.  (It's important to keep track of observation counts; this is needed for computing MI.)

Note that, in principle, a "lexical entry" could also be a (grammatical-class, disjunct) pair, or it could be a (word,disjunct-class) pair, or a (grammatical-class, disjunct-class) pair, where "grammatical-class" is a cluster, and "disjunct-class" is a disjunct with connectors to that class (instead of connectors to individual words).  And please note: what I meant by "disjnuct class" might not be the same thing as what you think it means, and so, without a lot of extra explanation, it gets confusing again.  

At any rate, if you keep the clusters and aggregates in the atomspace, then the "matrix" code can compute MI's for them all.  Else, you have to redesign that from scratch.

Side note: one reason I wanted everything in the atomspace, was so that I could apply the same class of algos -- computing MI, joining collections of atoms into networks, MST-like, then clustering, then recomputing MI again, etc. and leveraging that to obtain synonyms, word-senses, synonymous phrases, pronoun referents, etc. all without having to have a total redesign.  To basically treat networks generically, not just networks of words, but networks of anythings,, expressed as atoms.

--linas

In meantime, it is in the code:

https://github.com/singnet/language-learning/blob/master/src/grammar_learner/clustering.py#L276

Cheers,

-Anton


23.04.2019 16:54, Ben Goertzel пишет:

Linas Vepstas

unread,
May 1, 2019, 8:09:05 PM5/1/19
to opencog, Sarah Weaver
On Fri, Apr 26, 2019 at 11:57 AM Sarah Weaver <lwfl...@gmail.com> wrote:
Hey did my last message show up in spam again? :P

The above is the full text of what I received from you, and nothing more.

--linas
 

Ben Goertzel

unread,
May 6, 2019, 1:30:35 AM5/6/19
to Anton Kolonin @ Gmail, Linas Vepstas, lang-learn, link-grammar, opencog


On Sun, May 5, 2019 at 10:15 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Linas, I am re-reading your emails and updating our TODO issues from some of them.

Not sure about this one:

>Did Deniz Yuret falsify his thesis data? He got better than 80% accuracy; we should too.

I don't recall Deniz Yuret comparing MST-parses to LG-English-grammar-parses.


Linas: Where does the > 80% figure come from?

This paper of Yuret's 


cites 53% accuracy compared against "dependency parses derived from dependency-grammar-izing Penn Treebank parses on WSJ text" ....   It was written after his PhD thesis.  Is there more recent work by Yuret that gives massively better results?  If so I haven't seen it.

Spitkovsky's more recent work on unsupervised grammar induction seems to have gotten better statistics than this, but it used radically different methods.



a) Seemingly "worse than LG-English" "sequential parses" provide seemingly better "LG grammar" - that may be some mistake, so we will have to double-check this.

Anton -- Have you looked at the inferred grammar for this case, to see how much sense it makes conceptually?

Using sequential parses is basically just using co-occurrence rather than syntactic information

I wonder what would happen if you used *both* the sequential parse *and* some fancier hierarchical parse as inputs to clustering and grammar learning?   I.e. don't throw out the information of simple before-and-after co-occurrence, but augment it with information from the statistically inferred dependency parse tree...




-- Ben

Andres Suarez

unread,
May 6, 2019, 3:37:03 PM5/6/19
to Ben Goertzel, Anton Kolonin @ Gmail, Linas Vepstas, lang-learn, link-grammar, opencog


On Mon, May 6, 2019, 13:30 Ben Goertzel <b...@goertzel.org> wrote:


On Sun, May 5, 2019 at 10:15 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Linas, I am re-reading your emails and updating our TODO issues from some of them.

Not sure about this one:

>Did Deniz Yuret falsify his thesis data? He got better than 80% accuracy; we should too.

I don't recall Deniz Yuret comparing MST-parses to LG-English-grammar-parses.


Linas: Where does the > 80% figure come from?

This paper of Yuret's 


cites 53% accuracy compared against "dependency parses derived from dependency-grammar-izing Penn Treebank parses on WSJ text" ....   It was written after his PhD thesis.  Is there more recent work by Yuret that gives massively better results?  If so I haven't seen it.

Ben, what's the title of this paper? The link gives me a message about exceeding my daily paper download limit (??). I'm not sure how they evaluate there, but just wanted to mention that Yuret's thesis evaluates only links between "content-words", so that should indeed be considered when comparing against the ULL results.


Spitkovsky's more recent work on unsupervised grammar induction seems to have gotten better statistics than this, but it used radically different methods.



a) Seemingly "worse than LG-English" "sequential parses" provide seemingly better "LG grammar" - that may be some mistake, so we will have to double-check this.
Anton, Sergey, the attached image comes from another Yuret paper ( http://www.denizyuret.com/2006/06/dependency-parsing-as-classification.html?m=1 ). Although English is missing, it could help explain why sequential parses score well against MST-parses, related to our discussion during today's call.



Anton -- Have you looked at the inferred grammar for this case, to see how much sense it makes conceptually?

Using sequential parses is basically just using co-occurrence rather than syntactic information

I wonder what would happen if you used *both* the sequential parse *and* some fancier hierarchical parse as inputs to clustering and grammar learning?   I.e. don't throw out the information of simple before-and-after co-occurrence, but augment it with information from the statistically inferred dependency parse tree...




-- Ben

--
You received this message because you are subscribed to the Google Groups "lang-learn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
To post to this group, send email to lang-...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

a.
Screenshot_20190507-031931.jpg

Matthew Ikle

unread,
May 6, 2019, 3:49:13 PM5/6/19
to suarez...@gmail.com, Ben Goertzel, Anton Kolonin @ Gmail, Linas Vepstas, lang-learn, link-grammar, 'Nil Geisweiller' via opencog
Andres,

I just happened to see your post. The title of Yuret’s paper is: Lexical Attraction Models of Language

Also I have had luck with removing cookies on some websites to reset download limits so you can try that as well — only works for certain websites though.

—matt

You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
<Screenshot_20190507-031931.jpg>

Ben Goertzel

unread,
May 7, 2019, 12:01:44 AM5/7/19
to Andres Suarez, Anton Kolonin @ Gmail, Linas Vepstas, lang-learn, link-grammar, opencog
>> This paper of Yuret's
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.5016&rep=rep1&type=pdf
>>
>> cites 53% accuracy compared against "dependency parses derived from dependency-grammar-izing Penn Treebank parses on WSJ text" .... It was written after his PhD thesis. Is there more recent work by Yuret that gives massively better results? If so I haven't seen it.
>
>
> Ben, what's the title of this paper?


Lexical Attraction Models of Language , by Deniz Yuret

Ben Goertzel

unread,
May 7, 2019, 4:10:03 AM5/7/19
to Anton Kolonin @ Gmail, Andres Suarez, lang-learn, link-grammar, opencog
I don't think we want an arithmetic average of distance and MI, maybe more like

f(1) = C >1
f(1) > f(2) > f(3) > f(4)
f(4) = f(5) = ... = 1

and then

f(distance) * MI

i.e. maybe we count the MI significantly more if the distance is
small... but if MI is large and distance is large, we still count the
MI a lot...

(of course the decreasing function f becomes the thing to tune here...)



On Tue, May 7, 2019 at 12:58 AM Anton Kolonin @ Gmail
<akol...@gmail.com> wrote:
>
> Andres, can you upload the sequential parses that you have evaluated and provide them in the comments to the cells?
>
> Ben, I think the 0.67-0.72 corresponds to naive impression that 2/3-3/4 of word-to-word connections in English is "sequential" and the rest is not. For Russian and Portuguese, it would be somewhat less, I guess.
>
> What you suggest here ("used *both* the sequential parse *and* some fancier hierarchical parse as inputs to clustering and grammar learning? I.e. don't throw out the information of simple before-and-after co-occurrence, but augment it with information from the statistically inferred dependency parse tree") can be simply (I guess) implemented in existing MST-Parser given the changes that Andres and Claudia have done year ago.
>
> That could be tried with "distance_vs_MI" blending parameter in the MST-Parser code which accounts for word-to-word distance. So that if the distance_vs_MI=1.0 we would get "sequential parses", distance_vs_MI=0.0 would produce "Pure MST-Parses", distance_vs_MI=0.7 would provide "English parses", distance_vs_MI=0.5 would provide "Russian parses", does it make sense, Andres?
>
> Ben, do you want let Andres to try this - get parses with different distance_vs_MI in range 0.0-1.0 an see what happens?
>
> This could be tried both ways using traditional MI or DNN-MI, BTW.
>
> Cheers,
>
> -Anton
>
>
> 06.05.2019 12:30, Ben Goertzel :
> --
> You received this message because you are subscribed to the Google Groups "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
> To post to this group, send email to lang-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/f6f8a242-fcb4-3456-77cf-dfa8833612ca%40gmail.com.
> For more options, visit https://groups.google.com/d/optout.



andres

unread,
May 9, 2019, 12:30:56 PM5/9/19
to Anton Kolonin @ Gmail, Ben Goertzel, Andres Suarez, lang-learn, link-grammar, opencog

Anton, sequential and random parses are in D56 and D57. Or do you want specifically the ones for GS and SS? If so, please tell me where you want them to avoid messing with your file structure, please.

Yes, the mix of distance and MI is what we have been doing when we use the distance weighting in MST parsing. But as I noticed before, we should find a good tuning for each case, because the MI's vary about two orders of magnitude.

a.

On 07/05/19 15:58, Anton Kolonin @ Gmail wrote:

Andres, can you upload the sequential parses that you have evaluated and provide them in the comments to the cells?

Ben, I think the 0.67-0.72 corresponds to naive impression that 2/3-3/4 of word-to-word connections in English is "sequential" and the rest is not. For Russian and Portuguese, it would be somewhat less, I guess.

What you suggest here ("used *both* the sequential parse *and* some fancier hierarchical parse as inputs to clustering and grammar learning?   I.e. don't throw out the information of simple before-and-after co-occurrence, but augment it with information from the statistically inferred dependency parse tree") can be simply (I guess) implemented in existing MST-Parser given the changes that Andres and Claudia have done year ago.

That could be tried with "distance_vs_MI" blending parameter in the MST-Parser code which accounts for word-to-word distance. So that if the distance_vs_MI=1.0 we would get "sequential parses", distance_vs_MI=0.0 would produce "Pure MST-Parses", distance_vs_MI=0.7 would provide "English parses", distance_vs_MI=0.5 would provide "Russian parses", does it make sense, Andres?

Ben, do you want let Andres to try this - get parses with different distance_vs_MI in range 0.0-1.0 an see what happens?

This could be tried both ways using  traditional MI or DNN-MI, BTW.

Cheers,

-Anton


06.05.2019 12:30, Ben Goertzel :

-- 
-Anton Kolonin
skype: akolonin
cell: +79139250058

Linas Vepstas

unread,
Jun 15, 2019, 4:23:43 PM6/15/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog
On Mon, May 6, 2019 at 12:15 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Linas, I am re-reading your emails and updating our TODO issues from some of them.

Not sure about this one:

>Did Deniz Yuret falsify his thesis data? He got better than 80% accuracy; we should too.

I don't recall Deniz Yuret comparing MST-parses to LG-English-grammar-parses.

Of course he didn't. He compared his results to something else. However, given that LG compares well with other things, I'm sure he would have gotten a similar level of accuracy.  This is foundational.  If this is not the case, it is important to understand why.


>Actually, one of my proposals from the previous block of emails was to make MST worse!  I'm so sick of hearing about MST that I proposed getting rid of it, and replacing it with something of lower-quality, and focus on the clustering and disjunct weighting schemes to improve accuracy.

What do you mean by "something"?

I've talked about this earlier in the email chain. There are a variety of ways of creating parse trees and parse graphs that are looser, simpler and of lower-quality than MST.

b) We have been studying the Pearson(parses,grammar) for MWC=1, and it may happen so that MWC>1 will change the pattern,

In my experiments, I take MWC > 500 for my small datasets, and MWC > 40 for large ones.  Again, any MWC less than 5 is "insane" -- it is impossible to conclude anything at all from a single observation of a word.  You need to observe a word, used in context, at least 5 times, and maybe 50 times, and maybe 500 times, before you can make any kind of accurate judgment as to it's grammatical behavior.

I do not know how accuracy depends on MWC, but I am certain that an MWC of at least 5 is strongly required.  I know that I personally get good results when I use MWC in the 50 to 500 range.


>I'm fairly certain that replacing MST with something lower-quality will still work well. If that is not the case, then  that means that the disjunct-processing stages are somehow being done wrong.  The final result should not depend very much on the accuracy of MST. And this does not require a huge corpus, either. If there is a strong dependence on MST, something is seriously wrong, seriously broken in the disjunct-processing stages.  We need to spend energy on fixing that brokenness and not on making MST better.

Again, if you "get rid of MST", where do you get the disjuncts? Just use "all possible combinations of high-MI word pairs in a sentence"?

Does this issue seem like a spec for that?

Yes. That is one example. One can be creative, and have others.

The explanation, link to the current code and suggestion to improve are here:

https://github.com/singnet/language-learning/issues/207

I'm looking now.

The definitions that we use based on your "Sheaves" work can be found here:

From what I can tell, the processing pipeline that you use does not actually do any sheaf-based calculations.  (Neither does my pipeline -- I did start work on that last summer, got part-way done, and then put it on hold. I am now trying to restart that work.  I get the feeling that no one ever actually understood what the point of the sheaf was, so my to-do list is to try to explain that, again.)

-- Linas

Linas Vepstas

unread,
Jun 15, 2019, 4:39:43 PM6/15/19
to Ben Goertzel, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
On Mon, May 6, 2019 at 12:30 AM Ben Goertzel <b...@goertzel.org> wrote:


On Sun, May 5, 2019 at 10:15 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Linas, I am re-reading your emails and updating our TODO issues from some of them.

Not sure about this one:

>Did Deniz Yuret falsify his thesis data? He got better than 80% accuracy; we should too.

I don't recall Deniz Yuret comparing MST-parses to LG-English-grammar-parses.


Linas: Where does the > 80% figure come from?

I am looking at his PhD thesis, page 42, section "4.3 Results" and specifically figure 4-2 -- I see best-case precision of 75%, "typical" precision of 65% and recall in the 40% to 50% range.

-- Linas

Linas Vepstas

unread,
Jun 15, 2019, 4:48:00 PM6/15/19
to Ben Goertzel, Anton Kolonin @ Gmail, lang-learn, link-grammar, opencog
Attached is a screenshot from the relevant parts of Yuret's thesis, in case you have trouble finding this.

-- Linas

Screenshot at 2019-06-15 15-46-16.png

Linas Vepstas

unread,
Jun 21, 2019, 6:38:29 PM6/21/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog, Alexei Glushchenko, Andres Suarez, Matthew Ikle
Anton,

It's not clear if you fully realize this yet, or not, but you have not just one
but two major breakthroughs here. I will explain them shortly, but first,
can you send me your MST dictionary?  Of the three that you'd sent earlier,
none had the MST results in them.

OK, on to the major breakthroughs... I describe exactly what they are in the
attached PDF.  It supersedes the PDF I had sent out earlier, which contained
invalid/incorrect data. This new PDF explains exactly what works, what you've found.
Again, its important, and I'm very excited by it.  I hope Ben is paying attention,
he should understand this.  This really paves the way to forward motion.

BTW, your datasets that "rock"? Actually, they suck, when tested out-of-training-set.
This is probably the third but more minor discovery: the Gutenberg training set
offers poor coverage of modern English, and also your training set is wayyyy too small.
All this is fixable, and is overshadowed by the important results.

Let me quote myself for the rest of this email.  This is quoted from the PDF. 
Read the whole PDF, it makes a few other points you should understand.

ull-lgeng

Based on LG-English parses: obtained from http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-FULL-ALE-dILEd-2019-04-10/context:2_db-row:1_f1-col:11_pa-col:6_word-space:discrete/

I believe that this dictionary was generated by replacing the MST step with a parse where linkages are obtained from LG; these are then busted up back into disjuncts. This is an interesting test, because it validates the fidelity of the overall pipeline. It answers the question: “If I pump LG into the pipeline, do I get LG back out?” and the answer seems to be “yes, it does!” This is good news, since it implies that the overall learning process does keep grammars invariant. That is, whatever grammar goes in, that is the grammar that comes out!

This is important, because it demonstrates that the apparatus is actually working as designed, and is, in fact, capable of discovering grammar in data! This suggests several ideas:

* First, verify that this really is the case, with a broader class of systems. For example, start with the Stanford Parser, pump it through the system. Then compare the output not to LG, but to Stanford parser. Are the resulting linkages (the F1 scores) at 80% or better? Is the pipeline preserving the Stanford Grammar? I'm guessing it does...

* The same, but with Parsey McParseface.

* The same, but with some known-high-quality HPSG system.

If the above two bullet points hold out, then this is a major breakthrough, in that it solves a major problem. The problem is that of evaluating the quality of the grammars generated by the system. To what should they be compared? If we input MST parses, there is no particular reason to believe that they should correspond to LG grammars. One might hope that they would, based, perhaps, on some a-priori hand-waving about how most linguists agree about what the subject and object of a sentences is. One might in fact find that this does hold up to some fair degree, but that is all. Validating grammars is difficult, and seems ad hoc.

This result offers an alternative: don't validate the grammar; validate the pipeline itself. If the pipeline is found to be structure-preserving, then it is a good pipeline. If we want to improve or strengthen the pipeline, we know have a reliable way of measuring, free of quibbles and argumentation: if it can transfer an input grammar to an output grammar with high-fidelity, with low loss and low noise, then it is a quality pipeline. It instructs one how to tune a pipeline for quality: work with these known grammars (LG/Stanford/McParse/HPSG) and fiddle with the pipeline, attempting to maximize the scores. Built the highest-fidelity, lowest-noise pipeline possible.

This allows one to move forward. If one believes that probability and statistics are the correct way of discerning reality, then that's it: if one has a high-fidelity corpus-to-grammar transducer, then whatever grammar falls out is necessarily, a priori a correct grammar. Statistics doesn't lie. This is an important breakthrough for the project.

ull-sequential

Based on "sequential" parses: obtained from http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-FULL-SEQ-dILEd-2019-05-16-94/GL_context:2_db-row:1_f1-col:11_pa-col:6_word-space:discrete/

I believe that this dictionary was generated by replacing the MST step with a parse where there are links between neighboring words, and then extracting disjuncts that way. This is an interesting test, as it leverages the fact that most links really are between neighboring words. The sharp drawback is that it forces each word to have an arity of exactly two, which is clearly incorrect.

ull-dnn-mi

Based on "DNN-MI-lked MST-Parses": obtained from http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-GUCH-SUMABS-dILEd-2019-05-21-94/GL_context:2_db-row:1_f1-col:11_pa-col:6_word-space:discrete/

I believe that this dictionary was generated by replacing the MST step with a parse where some sort of neural net is used to obtain the parse.

Comparing either of these to the ull-sequential dictionary indicates that precision is worse, recall is worse, and F1 is worse. This vindicates some statements I had made earlier: the quality of the results at the MST-like step of the process matters relatively little for the final outcome. Almost anything that generates disjuncts with slightly-better-than-random will do. The key to learning is to accumulate many disjuncts: just as in radio signal processing, or any kind of frequentist statistics, to integrate over a large sample, hoping that the noise will cancel out, while the invariant signal is repeatedly observed and boosted.

On Thu, Jun 20, 2019 at 11:11 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

It turns out the difference on if we apply MWC for GL and GT both (lower block) or for GT only (upper block) is miserable - applying it for GL make results 1% better.

So far, testing on full LG-English parses (including partially parsed) as a reference:


As we know, MWC=2 is much better than MWC=1 and no improvement further.

"Sequential parses" rock, MST and "random" parses suck.

Pearson(parses,grammar) = 1.0

Alexey is running this with "silver standard" for MWC=1,2,3,4,5,10

-Anton 

--
You received this message because you are subscribed to the Google Groups "lang-learn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
To post to this group, send email to lang-...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
gc-gl_on_fully_parsed-mwc_1-5.png
grammar-report.pdf

Linas Vepstas

unread,
Jun 22, 2019, 11:18:02 AM6/22/19
to Anton Kolonin @ Gmail, Ben Goertzel, lang-learn, link-grammar, opencog, Alexei Glushchenko, Andres Suarez, Matthew Ikle
Hi Anton,

On Sat, Jun 22, 2019 at 2:32 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

CAUTION: *** parses in the folder with dict files are not the inputs, but outputs - they are produced on basis of the grammar in the same folder, I am listing the input parses below !!! ***

I did not look either at your inputs or your outputs; they are irrelevant for my purposes. It is enough for me to know that you trained on some texts from project gutenberg.  When I evaluate the quality of your dictionaries, I do not use your inputs, or outputs, or software; I have an independent tool for the evaluation of your dictionaries.

It would be very useful if you kept track of how many word-pairs were counted during training.  There are two important statistics to track: the number of unique word-pairs, and the total number observed, with multiplicity.  These two numbers are important summaries of the size of the training set.  There are two other important numbers: the number of *unique* words that occurred on the left side of a pair, and the number of unique words that occurred on the right side of a pair. These two will be almost equal, but not quite.  It would be very useful for me to know these four numbers: the the first two characterize the *size* of your training set; the second two characterize the size of the vocabulary.

- row 63, learned NOT from parses produced by DNN, BUT from honest MST-Parses, however MI-values for that were extracted from DNN and made specific to context of every sentence, so each pair of words could have different MI-values in different sentences:

OK, Look: MI has a very precise definition. You cannot use some other number you computed, and then call it "MI". Call it something else.  Call it "DLA" -- Deep Learning Affinity. Affinity, because the word "information" also has a very precise definition: it is the log-base-2 of the entropy.  If it is not that, then it cannot be called "information".   Call it "BW" -- Bertram Weights, if I understand correctly.

So, if I understand correctly, you computed some kind of DLA/BW number for word-pairs, and then preformed an MST parse using those numbers?

 exported in new "ull" format invented by Man Hin:

Side-comment -- you guys seem to be confused about what the atomspace is, and what it is good for.  The **whole idea** of the atomspace is that it is a "one size fits all" format, so that you do not have to "invent" new formats.  There is a reason why databases, and graph databases are popular. Inventing new file formats is a well-paved road to hell.

Regarding what you call "the breakthroughs":

>Results from the ull-lgeng dataset indicates that the ULL pipeline is a high- fidelity transducer of grammars. The grammar that is pushed in is the effec- tively the same as the grammar that falls out. If this can be reproduced for other grammars, e.g. Stanford, McParseface or some HPSG grammar, then one has a reliable way of tuning the pipeline. After it is tuned to maximize fidelity on known grammars, then, when applied to unknown grammars, it can be assumed to be working correctly, so that whatever comes out must in fact be correct.

That has been worked accordingly to the plan set up way back in 2017. I am glad that you accept the results. Unfortunately, the MST-Parser is not built-in into pipeline yet but is is on the way.

If one like you could help with the outstanding work items, it would be appreciated, because we are short-handed now.

>The relative lack of differences between the ull-dnn-mi and the ull-sequential datasets suggests that the accuracy of the so-called “MST parse” is relatively unimportant. Any parse, giving any results with better-than-random outputs can be used to feed the pipeline. What matters is that a lot of observation counts need to be accumulated so that junky parses cancel each-other out, on average, while good ones add up and occur with high frequency. That is, if you want a good signal, then integrate long enough that the noise cancels out.

I would disagree (and I guess Ben may disagree as well) given the existing evidence with "full reference corpus".

I think you are mis-interpreting your own results. The "existing evidence" proves the opposite of what you believe. (I suspect Ben is too busy to think about this very deeply).

If you compare F1 for LG-English parses with MST > 2 on tab "MWC-Study" you will find the F1 on LG-English parses is decent, so it is not that "parses do not matter", it is rather just "MST-Parses are even less accurate that sequential".

You are mis-understanding what I said; I think you are also mis-understanding what your own data is saying.

The F1-for-LG-English is high because of two reasons: (1) natural language grammar has the "decomposition property" (aka "lexical property"), and (2) You are comparing the decomposition provided by LG to LG itself.

The "decomposition property" states that "grammar is lexical".  Natural language is "lexical" when it's structure can be described by a "lexis" -- a dictionary, whose dictionary headings are words,  end whose dictionary entries are word-definitions of some kind -- disjuncts for LG; something else for Stanford/McParseface/HPSG/etc.

If you take some lexical grammar (Stanford/McParseface/whatever) and generate a bunch of parses, run it through the ULL pipeline, and learn a new lexis, then, ideally, if your software works well, then that *new* lexis should get close to the original input lexis. And indeed, that is what you are finding with F1-for-LG-English.

Your F1-for-LG-English results indicate that if you use LG as input, then ULL correctly learns the LG lexis. That is a good thing.  I believe that ULL will also be able to do this for any lexis... provided that you take enough samples.  (There is a lot of evidence that your sample sizes are much too small.)

Let's assume, now, that you take Stanford parses, run them through ULL, learn a dict, and then measure F1-for-Stanford against parses made by Stanford. The F1 should be high. Ideally, it should be 1.0.  If you measure that learned lexis against LG, it will be lower - maybe 0.9, maybe 0.8, maybe as low as 0.65. That is because Stanford is not LG; there is no particular reason for these two to agree, other than in some general outline: they probably mostly agree on subjects, objects and determiners, but will disagree on other details (aux verbs, "to be", etc.)

Do you see what I mean now? The ULL pipeline should preserve the lexical structure of language.  If you use lexis X as input, then ULL should generate something very similar to lexis X as output.   You've done this for X==LG. Do it for X=Stanford, McParseface, etc. If you do, you should see F1=1.0 for each of these (well, something close to F1=1.0)

Now for part two:  what happens when X==sequential, what happens when X==DNN-MI (aka "bertram weights") and what happens when X=="honest MI" ?

Let's analyze X==sequential first. First of all, this is not a lexical grammar. Second of all, it is true that for English, and for just about *any* language, "sequential" is a reasonably accurate approximation of the "true grammar".  People have actually measured this. I can give you a reference that gives numbers for the accuracy of "sequential" for 20 different languages. One paper measures "sequential" for Old English, Middle English, 17th, 18th, 19th and 20th century English, and finds that English becomes more and more sequential over time! Cool!

If you train on X==sequential and learn a lexis, and then compare that lexis to LG, you might find that F1=0.55 or F1=0.6 -- this is not a surprise.  If you compare it to Stanford, McParseface, etc. you will also get F1=0.5 or 0.6 -- that is because English is kind-of sequential.

If you train on X==sequential and learn a lexis, and then compare that lexis to "sequential", you will get ... kind-of-crap, unless your training dataset is extremely large, in which case you might approach F1=1.0  However, you will need to have an absolutely immense training corpus size to get this -- many terabytes and many CPU-years of training.  The problem is that "sequential" is not lexical.  It can be made approximately lexical, but that lexis would have to be huge.

What about X==DNN-Bert  and X==MI?  Well, neither of those are lexical, either.  So you are using a non-lexical grammar source, and attempting to extract a lexis out of it.  What will you get?  Well -- you'll get ... something. It might be kind-of-ish LG-like. It might be kind-of-ish Stanford-like. Maybe kind-of-ish HPSG-like. If your training set is big enough (and your training sets are not big enough) you should get at least 0.65 or 0.7 maybe even 0.8 if you are lucky, and I will be surprised if you get much better than that.

What does this mean?  Well, the first claim  is "ULL preserves lexical grammars" and that seems to be true. The second claim is that "when ULL is given a non-lexical input, it will converge to some kind of lexical output".

The third claim, "the Linas claim", that you love to reject, is that "when ULL is given a non-lexical input, it will converge to the SAME lexical output, provided that your sampling size is large enough".  Normally, this is followed by a question "what non-lexical input makes it converge the fastest?" If you don't believe the third claim, then this is a non-sense question.  If you do believe the third claim, then information theory supplies an answer: the maximum-entropy input will converge the fastest.  If you believe this answer, then the next question is "what is the maximum entropy input?" and I believe that it is honest-MI+weighted-clique. Then there is claim four: the weighted clique can be approximated  by MST.

It is now becoming clear to me that MST is a kind-of mistake, and that a weighted clique would probably be better, faster-converging. Maybe. The problem with all of this is rate-of-convergence, sample-set-size, amount-of-computation.  it is easy to invent a theoretically ideal NP-complete algorithm; its much harder to find something that runs fast.

Anyway, since you don't believe my third claim, I have a proposal. You won't like it. The proposal is to create a training set that is 10x bigger than your current one, and one that is 100x bigger than your current one.  Then run "sequential", "honest-MI" and "DNN-Bert" on each.  All three of these will start to converge to the same lexis. How quickly? I don't know. It might take a training set that is 1000x larger.  But that should be enough; larger than that will surely not be needed. (famous last words. Sometimes, things just converge slowly...)

-- Linas
 

Still, we have got "surprize-surprize" with "gold reference corpus". Note, it still says "parses do matter but MST-Parses are as bad or as good as sequential but both are still not good enough". Also note, that it has been obtained just on 4 sentences which is not reliable evidence.

Now, we are full-throttle working on proving your claim now with "silver reference corpus" - stay tuned...

Cheers,

-Anton

22.06.2019 5:38, Linas Vepstas:
-- 
-Anton Kolonin
skype: akolonin
cell: +79139250058

Ben Goertzel

unread,
Jun 22, 2019, 11:49:54 AM6/22/19
to opencog, Anton Kolonin @ Gmail, lang-learn, link-grammar, Alexei Glushchenko, Andres Suarez, Matthew Ikle
Hi,

I think everyone understands that

***
The third claim, "the Linas claim", that you love to reject, is that
"when ULL is given a non-lexical input, it will converge to the SAME
lexical output, provided that your sampling size is large enough".
***

but it's not clear for what cases feasible-sized corpora are "large enough" ...

ben
> You received this message because you are subscribed to the Google Groups "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34sz8okB%2B3UGM0XO9CsYu5U0MBRL0A0fedTDtN%2BNc7Mfg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



--

Linas Vepstas

unread,
Jun 22, 2019, 12:13:57 PM6/22/19
to Ben Goertzel, opencog, Anton Kolonin @ Gmail, lang-learn, link-grammar, Alexei Glushchenko, Andres Suarez, Matthew Ikle
Hi Ben,

On Sat, Jun 22, 2019 at 10:49 AM Ben Goertzel <b...@goertzel.org> wrote:
Hi,

I think everyone understands that

***
The third claim, "the Linas claim", that you love to reject, is that
"when ULL is given a non-lexical input, it will converge to the SAME
lexical output, provided that your sampling size is large enough".
***

but it's not clear for what cases feasible-sized corpora are "large enough" ...

Yes, this is the magic question, but there is now a way for answering it.  If the input parses come from LG, then how much training is needed to approach F1=1.0 for the learned lexis? If the input parses come from Stanford, then how much training is needed to approach F1=1.0 for the learned lexis? Ditto McParseface... these are your "calibrators"; they allow you to calibrate the speed of convergence of the ULL pipeline ... they allow you to experiment with various stages (i.e. clustering) with the goal of minimizing the training set, maximizing the score, minimizing the CPU-hours.

Lets now suppose that a corpus of N sentences is enough to get to F1=0.95 on three of these systems.   Then one might expect that N is enough to approximately converge to the "true" lexical structure of English, starting with MI, or with Bert-weights, or something else.

Also: the way to measure convergence is not to compare the learned lexis to a "golden text hand-created by a linguist", but to compare these various dictionaries to each-other.  As the training size increases, do they become more and more similar?   Yes, of course, ideally, they should come darned close to the human-linguist-created parses, and disagreements should be examined under a microscope.

Here: when I measure Anton's dictionaries, against Anton's golden corpus, I find that HALF of the sentences in the golden corpus contain words that are NOT in the dictionary!  (There are 229 sentences in the golden reference; of these, only 113 sentences have all words in the dictionary.  They contain only 418 unique vocabulary words.  This is typical not only of Antons dicts, but also my own: the vocabulary of the training set overlaps poorly with the vocabulary of the test set -- any test set, not just the golden one.  This is Zipf's law in spades.)

--linas


For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages