More precisely, the factorization of the language model appears to require three important ingredients:
A way of decomposing word-vectors into sums of word-sense vectors,
A way of performing biclustering, so as to split the bipartite graph p (w, d) into left, central and right components, holding the left and right parts to be sparse,
Using an information-theoretic similarity metric, to preserve the proba- bilistic interpretation of the contingency table p (w, d).
Hi Anton,Attached please find this weeks new-and-improved version of the neural-nets-vs-symbolic-parsing document. Improvements over the last version include:
More words spent explaining why:
- k-means clustering is identical to matrix factorization -- this is not my result, its an old result, I'm recapping it because you need to understand it to understand the next step, which is bicategorization.
- Why bicategorization is the right thing to do -- because that is what the link-grammar dicts already do! Bicategorization is also an old algo, from 2003 - but if you look at it carefully, you can see that the Link Grammar dicts are ***exactly*** bicategorized contingency matrixes. That is to say: ordinary old-school linguists who manually write dependency grammars do so in a format that is naturally the same format as the output of a k-means bicategorization.
- Why an information-theoretic divergence is much better than a cosine distance. This is a lot more subtle, I suppose, because it requires you to see a vector dot-product as something that is invariant under not under rotations (well it is, but that misses the point), but rather as something that is invariant under Markov transformations, which preserve probabilities. This is because all of the vectors are rows and columns in a probability distribution. Thus, cosine distance is "wrong", and Kullback-Leibler divergence is correct. Again -- this is an old result, from 2003, but all of the people who are doing ordinary off-the-shelf k-means are unaware and oblivious to it, because their data is not a joint probability. I try to spell this out in great detail, and to provide all of the explicit formulas you need to do this.
This paper is still not yet done, but I think it lays out the groundwork much more nicely than before. I am hoping that it is not hard to read -- again, I tried to mostly simplify everything. I hope its not oversimplified.
Anyway, I think its a lot more promising, a lot better direction to go in than triadic k-means. Its probably simpler too.
--linas
On Sun, Jul 22, 2018 at 12:01 AM, Anton Kolonin @ Gmail <akol...@gmail.com> wrote:Hi Linas, thanks, I will look into that.
In meantime, below, the guys are getting close with "triadic K-means":
http://aclweb.org/anthology/P18-2010
They use "FrameNet 1.7" and "dataset of polysemous verb classes by Korhonen" for evalutaion.
If we get these, we may compare to which extent we are doing better.
Cheers,-Anton
20.07.2018 1:46, Linas Vepstas пишет:
Here:But maybe the middle bits will help clarify these issues.The paper is very much an early draft; I've already re-written the final 2-3that viewpoint, this might be a better way of groking the concept.word2vec can be viewed as a sheaf. So, if you are more comfortable withDue to the obvious confusion that the sheaves paper caused for everyone,I have started work on a different way of explaining it. This one takes, as
its starting point, a description of the word2vec algorithms, and explains how
pages since last night. The title is likely to change. The introduction will change.
https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/skippy.pdf
-- Linas
---------- Forwarded message ----------
From: Anton Kolonin (Google Docs) <d+MTEyNjExODQyMzA2NTk3MDYxMzEx-MTEzMTc0Njc2MTkyODczODc3MjIx@docs.google.com>
Date: Thu, Mar 1, 2018 at 4:16 AM
Subject: Unsupervised Lang... - This comes from works of +linasvepsta...
To: linasv...@gmail.com
Anton Kolonin mentioned you in a comment on Unsupervised Language Learning (ULL) Design Draft
Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
You have received this email because you are mentioned in this thread.Change what Google Docs sends you.You can reply to this email to reply to the discussion.
--
cassette tapes - analog TV - film cameras - you
-- -Anton Kolonin skype: akolonin cell: +79139250058 akol...@aigents.com https://aigents.com https://www.youtube.com/aigents https://www.facebook.com/aigents https://plus.google.com/+Aigents https://medium.com/@aigents https://steemit.com/@aigents https://golos.blog/@aigents https://vk.com/aigents
--cassette tapes - analog TV - film cameras - you
Hi Linas,Nice stuff!A quick comment regarding***More precisely, the factorization of the language model appears to require three important ingredients:
A way of decomposing word-vectors into sums of word-sense vectors,
A way of performing biclustering, so as to split the bipartite graph p (w, d) into left, central and right components, holding the left and right parts to be sparse,
Using an information-theoretic similarity metric, to preserve the proba- bilistic interpretation of the contingency table p (w, d).
***The first of these is, of course, what Adagram attempts to do ... and Andres has experimented with a variety of Adagram that replaces standard SkipGram with "SkipGram-in-a-parse" (to be done after a round of e.g. MST parsing) .... But improvements on Adagram that incorporate broader context will be valuable...
The third of these makes total sense and is fortunately not a huge deal, as most clustering algorithms can work with whatever similarity metric you throw at them...
The second of these is the most interesting to me... basically it seems you are wanting to cluster (word, disjunct pairs) in a way that has high "clustering quality" both in the word dimension and in the disjunct dimension [i.e. so that both words are divided into meaningful clusters, and disjuncts are divided into meaningful clusters, even though the words and disjuncts are distinct and are stuck onto each other in various combinations]