Re: Fwd: Unsupervised Lang... - This comes from works of +linasvepsta...

73 views
Skip to first unread message

Ben Goertzel

unread,
Aug 4, 2018, 2:00:49 AM8/4/18
to Linas Vepstas, Anton Kolonin @ Gmail, Claudia Castillo, mur...@iis.nsk.su, Tatiana Batura, Andres Suarez, Oleg Baskov, lang-learn, opencog
Hi Linas,

Nice stuff!

A quick comment regarding

***

More precisely, the factorization of the language model appears to require three important ingredients:

  • A way of decomposing word-vectors into sums of word-sense vectors,

  • A way of performing biclustering, so as to split the bipartite graph p (w, d) into left, central and right components, holding the left and right parts to be sparse,

  • Using an information-theoretic similarity metric, to preserve the proba- bilistic interpretation of the contingency table p (w, d).

***

The first of these is, of course, what Adagram attempts to do ... and Andres has experimented with a variety of Adagram that replaces standard SkipGram with "SkipGram-in-a-parse" (to be done after a round of e.g. MST parsing) ....  But improvements on Adagram that incorporate broader context will be valuable...

The third of these makes total sense and is fortunately not a huge deal, as most clustering algorithms can work with whatever similarity metric you throw at them...

The second of these is the most interesting to me... basically it seems you are wanting to cluster (word, disjunct pairs) in a way that has high "clustering quality" both in the word dimension and in the disjunct dimension [i.e. so that both words are divided into meaningful clusters, and disjuncts are divided into meaningful clusters, even though the words and disjuncts are distinct and are stuck onto each other in various combinations]

This is interesting and could be attempted via many possible algorithms including of course k-means-like iterative algorithms or EM-like estimation algorithms....   (Or, as you note, evolutionary learning methods) ... Oleg may have some views on this...

-- Ben

On Sat, Aug 4, 2018 at 12:05 PM, Linas Vepstas <linasv...@gmail.com> wrote:
Hi Anton,

Attached please find this weeks new-and-improved version of the neural-nets-vs-symbolic-parsing document.  Improvements over the last version include:

More words spent explaining why:
  • k-means clustering is identical to matrix factorization -- this is not my result, its an old result, I'm recapping it because you need to understand it to understand the next step, which is bicategorization.
  • Why bicategorization is the right thing to do -- because that is what the link-grammar dicts already do!  Bicategorization is also an old algo, from 2003 - but if you look at it carefully, you can see that the Link Grammar dicts are ***exactly*** bicategorized contingency matrixes.  That is to say: ordinary old-school linguists who manually write dependency grammars do so in a format that is naturally the same format as the output of a k-means bicategorization.
  • Why an information-theoretic divergence is much better than a cosine distance.  This is a lot more subtle, I suppose, because it requires you to see a vector dot-product as something  that is invariant under not under rotations (well it is, but that misses the point), but rather as something that is invariant under Markov transformations, which preserve probabilities.  This is because all of the vectors are rows and columns in a probability distribution.  Thus, cosine distance is "wrong", and Kullback-Leibler divergence is correct. Again -- this is an old result, from 2003, but all of the people who are doing ordinary off-the-shelf k-means are unaware and oblivious to it, because their data is not a joint probability.  I try to spell this out in great detail, and to provide all of the explicit formulas you need to do this.

This paper is still not yet done, but I think it lays out the groundwork much more nicely than before.  I am hoping that it is not hard to read -- again, I tried to mostly simplify everything. I hope its not oversimplified.

Anyway, I think its a lot more promising, a lot better direction to go in than triadic k-means. Its probably simpler too.

--linas


On Sun, Jul 22, 2018 at 12:01 AM, Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Linas, thanks, I will look into that.

In meantime, below, the guys are getting close with "triadic K-means":

http://aclweb.org/anthology/P18-2010

They use "FrameNet 1.7" and "dataset of polysemous verb classes by Korhonen" for evalutaion.

If we get these, we may compare to which extent we are doing better.

Cheers,

-Anton


20.07.2018 1:46, Linas Vepstas пишет:
Due to the obvious confusion that the sheaves paper caused for everyone,
I have started work on a different way of explaining it.  This one takes, as
its starting point, a description of the word2vec algorithms, and explains how
word2vec can be viewed as a sheaf.  So, if you are more comfortable with
that viewpoint, this might be a better way of groking the concept.

The paper is very much an early draft; I've already re-written the final 2-3
pages since last night.  The title is likely to change. The introduction will change.
But maybe the middle bits will help clarify these issues.

Here:
https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/skippy.pdf

-- Linas

---------- Forwarded message ----------
From: Anton Kolonin (Google Docs) <d+MTEyNjExODQyMzA2NTk3MDYxMzEx-MTEzMTc0Njc2MTkyODczODc3MjIx@docs.google.com>
Date: Thu, Mar 1, 2018 at 4:16 AM
Subject: Unsupervised Lang... - This comes from works of +linasvepsta...
To: linasv...@gmail.com


Anton Kolonin mentioned you in a comment on Unsupervised Language Learning (ULL) Design Draft

Anton Kolonin
Anton Kolonin
Section - collection of adjacent Seeds from single sentence, series of adjacent sentence or entire single text Sheaf - unclearly defined combination of Sections and Lexical Entries representing particular corpus

This comes from works of +linasv...@gmail.com - clearer definition may get required and potential use should be explored further

Open



















Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

You have received this email because you are mentioned in this thread.Change what Google Docs sends you.You can reply to this email to reply to the discussion.



--
cassette tapes - analog TV - film cameras - you

-- 
-Anton Kolonin
skype: akolonin
cell: +79139250058
akol...@aigents.com
https://aigents.com
https://www.youtube.com/aigents
https://www.facebook.com/aigents
https://plus.google.com/+Aigents
https://medium.com/@aigents
https://steemit.com/@aigents
https://golos.blog/@aigents
https://vk.com/aigents



--
cassette tapes - analog TV - film cameras - you



--
Ben Goertzel, PhD
http://goertzel.org

"The dewdrop world / Is the dewdrop world / And yet, and yet …" -- Kobayashi Issa

Linas Vepstas

unread,
Aug 4, 2018, 12:06:51 PM8/4/18
to Ben Goertzel, link-grammar, Anton Kolonin @ Gmail, Claudia Castillo, mur...@iis.nsk.su, Tatiana Batura, Andres Suarez, Oleg Baskov, lang-learn, opencog
Hi,

On Sat, Aug 4, 2018 at 1:00 AM, Ben Goertzel <b...@goertzel.org> wrote:
Hi Linas,

Nice stuff!

A quick comment regarding

***

More precisely, the factorization of the language model appears to require three important ingredients:

  • A way of decomposing word-vectors into sums of word-sense vectors,

  • A way of performing biclustering, so as to split the bipartite graph p (w, d) into left, central and right components, holding the left and right parts to be sparse,

  • Using an information-theoretic similarity metric, to preserve the proba- bilistic interpretation of the contingency table p (w, d).

***

The first of these is, of course, what Adagram attempts to do ... and Andres has experimented with a variety of Adagram that replaces standard SkipGram with "SkipGram-in-a-parse" (to be done after a round of e.g. MST parsing) ....  But improvements on Adagram that incorporate broader context will be valuable...

 
I guess I need to spend more time reviewing Adagram. What I was trying to say is that you don't need Adagram; you do need vectors; the vectors have to come from somewhere. Sure, the NN codes give you vectors; but you want to use vectors that encode grammatical information (dependencies) rather than vectors that are whizzy N-grams.

What MST does is two things:
1) vectors
2) dependency information.

The NN algos always give 1); they don't properly give 2)  If there's an algo that gives 1 and 2 together, then the MST step can be replaced by that other, alternative algo.
 
The third of these makes total sense and is fortunately not a huge deal, as most clustering algorithms can work with whatever similarity metric you throw at them...
 
Yes, but:

a) there's a pairwise information metric I give towards the end. If an off-the-shelf clustering software is being used, then someone would have to rip into it, and encode that specific metric. Because of the way its defined, its fastest if certain sub-portions are pre-computed and pre-cached.  I've written that code for the scheme-based infrastructure, but I can't imagine that it exists in any off-the-shelf clustering package, anywhere on Anton's side of things.

b) the gradient descent algos do not usually have a location where you can plug in some custom pair-wise similarity metric.   When you have a pair-wise metric, it makes more sense to do agglomerative clustering. That runs at O(N) timescales, as opposed to O(N^2)

c) anything with the word "means" in it is going to be taking "arithmetic means" and I'm trying to explain why you don't want to take arithmetic means. One reason is that it ruins word-sense disambiguation.

d)  k-means is still hard clustering, when using off-the-shelf software.  You don't want hard-clustering. It is important to split words into word-senses.  In addition, you don't want "arithmetic means".  The earlier emails, and the other PDF explains various tactical moves to accomplish this.  The point here is that its a mistake to just dump raw words into k-means, and do hard-clustering. Its probably a mistake to just raw words into fuzzy-clustering, and do post-factoring decomposition. Its best to decompose vectors into word senses **during** clustering, and hard-cluster word-senses.  I seriously doubt that any off-the-shelf blob of software is capable of this.

e) for every word, there is also a connector with that word in it.  So when you cluster a word, you also have to cluster all of the disjuncts that have that connector somewhere inside of it.   It makes no sense to place two words in the same cluster, when clustering words, but place then into different clusters, when they appear inside a connector.  You want the same cluster in both locations; they are dual to one-another.

In essence, when you cluster together a pair of words, the vectors underneath you change as well.  Off-the-shelf software will not do this.  You can kind-of work around this by iterating, but that's quite inefficient.


The second of these is the most interesting to me... basically it seems you are wanting to cluster (word, disjunct pairs) in a way that has high "clustering quality" both in the word dimension and in the disjunct dimension [i.e. so that both words are divided into meaningful clusters, and disjuncts are divided into meaningful clusters, even though the words and disjuncts are distinct and are stuck onto each other in various combinations]

Its not so much that "I want to", its rather that this is how professional linguists actually behave, when they manually author a parse-rule lexis for a language. That parse-rule lexis has the formal structure of a sheaf.  What I am trying to illustrate is how to write an algorithm that will extract a sheaf structure from raw text.

The point of the sheaf paper was to try to explain how actual human domain experts (linguists, biochemists, etc) actually think about the actual problem domains that they think about. When you look at what it is that these human-beings actually do, when they create models of bio-chemical interactions, or create models of language, what they are actually doing is (subconsciously) creating sheaves.  The reason that they are doing this is because the actual data in nature has the structure of a sheaf -- the domain experts are not being silly; they are factoring data in the same way that nature factors it.

The goal here is to automate the process that humans use, to mimic and reproduce the structures that they typically hand-author, but to do so in an automated fashion.

-- Linas
Reply all
Reply to author
Forward
0 new messages