And now, time for some theory: when two words have many ConnectorSet "in
common", they are judged to be "similar". There are several different
ways to define "in common": jaccard similarity, cosine distance, mutual
information, several others.
Ok.
That is, given a word, the collection of all of the ConnectorSeq's on
that word can be thought of as being a vector, and so one can take
vector products of all kinds.
Ok, but do we have a use for such vector products?
I do. Writing the vector dot product as v.w (vector v dot-product vector w) one then has:
the cosine distance = v.w / |v||w|
which is popular, but turns out to be low-quality.
the (fractional) mutual information MI = log v.w / v.* w.*
the variational MI = log v.w / sqrt v.* w.*
both of which work much better than the cosine.
A visualization challenge is then: how can one compare vectors? How
can one visualize the vectors? How can one visualize their
similarities?
I don't know if this can help, but... Vectors have a set of values in different dimensions, right?
Perhaps a practical example.
dog == fur + legs + tail + eyes + pet
cat == fur + legs + tail + eyes + pet
horse == fur + legs + tail + eyes + big
worm == round + long + underground
These are vectors. I wrote an implicit 1.0 for the relevant dimensions, and 0.0 for the missing dimensions. So, in greater detail:
vector dog == 1.0 * fur + 1.0 * legs + 1.0* tail + 1.0 * eyes + 1.0 * pet + 0.0 * round + 0.0 * underground
Clearly, cats and dogs are very similar (you can compute the cosine product ... or the MI or the VMI, they will all tell you that they are similar.)
Clearly horses and dogs are sort-of similar, and clearly the dot product of worm and dog is zero.
vectors are imperfect for common-sense reasoning, but they are fast and practical, and you can use them in many many places.
Jaccard distance is great, too.
The bubbles system can also be interpreted as a possibly infinite set of dimensions. Top bubble may represent the zeroth dimension. Bubbles orbiting the top bubble may represent the first dimension. Bubbles orbiting bubbles orbiting the top bubble may represent the second dimension, and so on.
How can I show that dog & cat are similar? well, today, the following works (almost works, the dataset is missing dog and cat):
(sim-vmi-nav 'edge-score (Word "dog") (WordNode "cat"))
This returns one floating point number. But that number is just the vmi, built from the dot-product of vectors.
How can the vector for dog, cat be visualized, and how can it be done to show that they are very similar?
But a system of dimensions is useful only to show magnitudes of each dimension. So, how to relate different dimensions regardless of what values they hold? I feel that vector products lay beneath of what you are searching for, not above. Once you figure out the above thing, you may be in position to compare those aboves using vectors.
Yes, I am trying to bootstrap my way up. There are lots of vectors, everywhere, at different layers of abstraction. Some are "below", some are "above", depending on the layer.
And what is that above? Simply, a programming language. It may be a general Turing machine, but it is hard to reason about its programs. It may be an imperative programming language, a functional programming language, a logic programming language, category theory programming language, and all the other correspondent languages we haven't even heard about yet.
The raw data that I have are:
* millions of books
* millions of photos
* millions of sound recordings.
I know how to turn these into vectors, and get GPT-like results (but using a completely different system). I do not know how to turn these into a "programming language".
I do know about Curry-Howard correspondance: that proofs are the same things as programs. I also know that steps-of-a-proof are jigsaw-pieces, and that jigsaw-pieces are the elements of a grammar, which are composed into a language. Is it possible to call this language a "programming language"? Kind of. But it is NOT a human-friendly programming language. It's very abstract and machine-centric. Not human-centric.
Atomese is a programming language, it's human-readable and human writable, but not as nice and clean as python or java; it's not meant for humans. It's meant for machines. The vectors of ConnectorSeq are in fact short little programs that are simple elements of a language. vectors and dot products are how I compose the programs. For example, "union type" and "product type" in type theory. So, just like CaML or Haskell: each Atom is *exactly* like a type in CaML or Haskell or ConcurrentML, and combine in all the same ways. Atomese is dynamically typed, CaML and haskell are static-typed. Vectors are just ... cartesian products. Same thing as lists, in lisp, scheme.
Programming is about data structures (those structures may also hold programs that may be interpreted in a processor). Once you manage to visualize different data structures, you will get an answer to your question. But how general you want to go? Do you want to bound yourself to domain specific data structures, or to bee free with general data structures? I believe that the most general data structures are the simplest ones. Domain specific ones should be just restrictions you need to additionally define to achieve some specific useful uniformities.
Atoms are the "most general" data structures, and so your earliest visualizations are "the most correct". But visualizing a single Atom is like visualizing a single neuron. You won't understand how the brain works, if your web-browser window can only draw 5 or 10 neurons at a time.
So, the question is: how do you visualize a cortical column, which consists of about 100 neurons?
Think of a single Atom as a neuron, and a single edge as a connection between neurons. Although you draw only ten connections at a time, most of those Atoms have hundreds or thousands of edges, and (nav-obj 'forward) only returns the ten most strongest connections. The hundreds of others are ignored...
And that's just the most basic network. The disjunct network is more complex. And, yes, eash disjunct is a jigsaw-puzzle piece, which means each disjunct is a part of a grammar, that defines a "programming" language. The vector products are simple programs. How can they be visualized?
This really is like slicing up a brain, and looking at it, and trying to figure out what you're seeing...
In the beginning of this story, I mentioned a system of dimensions. But those are zeroth order structures, just like there exists zeroth order logic (propositional calculus) compared to first and higher order logic. Maybe you can try to invent a first or higher order dimensional system? Some variables, or something would apply, but how?
Already have. You need to read the various papers I've sent you about jigsaw pieces. I'm not the first -- mathematicians were writing about jigsaw pieces decades ago, and linguistics even earlier. There is a broad and deep and wide foundation for all of this stuff, its the foundations for computer science. There is no quick or easy way to learn this stuff. There's oceans of it.
Read a little bit about type theory and proof theory and curry-howard correspondence, Read about grammar and parsing. Read about category theory. There are "jigsaw pieces" in all of them ... unfortunately, they are fairly hidden, Sometimes, they're mentioned explicitly, but not all that often.
But again: visualizing one jigsaw piece (one disjunct) is like visualizing one neuron. The trick is to visualize hundreds or thousands of them, all at once.
-- linas
You can go specific, or you can go general, but once you define data structures you'd like to visualize, that means that more than 99% of work is done. And if we are talking about general dimensional systems, we already know how to visualize them. We "just" lack a definition of those possibly higher order dimensional systems.
- ivan -
For visualization, the basic idea would be: every query has input and output types (say, `WordNode` or `ConnectorSet`). Queries come in pairs, branching from top to bottom. If upper query output matches lower query input, those queries can be combined and will be offered in the dropdown menu.
What does this typing mean for combining `cset-stars` queries? It means that they can be combined in a flip-flop manner. We start from a `WordNode`, then combine the first query `'forward` to attain `ConnectorSet`, than combine the second query `'backward` to attain `WordNode`, then again `'forward` to attain `ConnectorSet`, and so on. How does this sound?
That's not a bad idea. It certainly gives a different way of navigating. We can try it, see if it's interesting.
And now, time for some theory: when two words have many ConnectorSet "in common", they are judged to be "similar". There are several different ways to define "in common": jaccard similarity, cosine distance, mutual information, several others.
That is, given a word, the collection of all of the ConnectorSeq's on that word can be thought of as being a vector, and so one can take vector products of all kinds.
A visualization challenge is then: how can one compare vectors? How can one visualize the vectors? How can one visualize their similarities?
--linas
I'm very sorry, I unintentionally posted my ssh connection parameters in the `READ.me` file to GitHub.
The
parameters, or your keys? Without keys, no one can do anything. Maybe
some people might figure out that there is some ssh server on port 224,
but this is not much of a threat.
ssh is secure, unless someone uses a weak password. I'll disable password logins, just to be safe.
I unintentionally published just the shell command.
Keys are safe, thanks God for that..
The disjuncts are already in the current dataset. They are pairs,
where the left dist is a word, and the right side is ConnectorSeq as
described before.
The question is: how do
visualize this structure? I do not have any simple, easy answer to
this. ... for this, I need your help. What can you do to visualiize
this?
I may have some ideas, but I can't manage the `make-navigator` wrapper to work on disjuncts, and I'm not versed in Scheme.
In the meanwhile, I am preparing a file with word-similarities in it ..
actually, all three .. words-pairs, disjuncts and word-similarities.
With luck I'll have it done in 12-36 hours. The similarities will be
easy to visualize.
Cool, take your time...
- ivan -
I'm very sorry, I unintentionally posted my ssh connection parameters in the `READ.me` file to GitHub.
The
parameters, or your keys? Without keys, no one can do anything. Maybe
some people might figure out that there is some ssh server on port 224,
but this is not much of a threat.
ssh is secure, unless someone uses a weak password. I'll disable password logins, just to be safe.
I unintentionally published just the shell command.
Linas,
I'm very sorry, I unintentionally posted my ssh connection parameters in the `READ.me` file to GitHub.
The parameters, or your keys? Without keys, no one can do anything. Maybe some people might figure out that there is some ssh server on port 224, but this is not much of a threat.
ssh is secure, unless someone uses a weak password. I'll disable password logins, just to be safe.
I updated the `READ.me` immediately to exclude this information, but I guess the changes are left buried inside the GitHub repo diff tracking. I hope you still are passing only me through proxy, but if you find this as a potential threat, please update your server configuration as soon as possible. The last thing I would want is to see your server being hacked.
Please forgive me.
Very sorry for inconvenience,
Ivan
Linas,
Ok, clear enough, word-pairs it is for a start (forward and backward), configurable parameters coming next, and let's see what happens then. I may need some help with documentation.
I'm interested in disjuncts too, so when you decide to do them, let me know. I may have some more free time.
Fine regards,
Ivan
Hi Linas,
Making what work? The more specific the question, the easier it is for me to answer it ...
I'm sorry, there seems to be a short circuit between us, I guess I have to be more clear, I think it is my fault.
Mine as well. These things take time.
Let me try again posing my question from the start. You supplied me with make-navigator object I'm trying to use. Right now, I can successfully run only query patterns like these three:
- ((make-navigator pair-stars 'right-duals 'left-duals pair-freq 'pair-fmi 10) 'forward (Word "start"))
- ((make-navigator pair-stars 'right-duals 'left-duals pair-freq 'pair-fmi 10) 'backward (Word "start"))
- ((make-navigator pair-stars 'right-duals 'left-duals pair-freq 'pair-fmi 10) 'edge-score (Word "start") (Word "end"))
The `make-navigator` is an object constructor, as in "object oriented programming". It would be more efficient if you made this only once. So for example:
`(define navi (make-navigator pair-stars 'right-duals 'left-duals pair-freq 'pair-fmi 10))` and then use this object over and over:
(navi 'forward (Word "start"))
(navi 'backward (Word "start"))
It will run a little bit faster, if you just create the object once, and re-use it. The difference is not that large, so you can do it either way, but it can make the code more readable.
The user interface should contain a configuration page/penel, which allows allows all of these arguments to be configurable: pair-stars 'right-duals 'left-duals pair-freq 'pair-fmi 10
So, for example, the user should be able to type in 'pair-logli instead of 'pair-fmi. Now, 'pair-logli is less interesting than 'pair-fmi, but still ... It might be fun to play with.
pair-stars and pair-freq are just two objects I create when I load this dataset. It might make more sense for you to create these objects, but, for now, it's OK as-as.
I have other datasets that contain word-similarities. For those datasets, the two vertexes joined by an edge are words, as before. The edge itself holds three of four different kinds of similarity scores. In those datasets, the "stars" object would have some different name -- maybe word-sim-stars. I don't care. Instead of pair-freq, you would use `word-sims` and instead of 'pair-fmi you would use any one of 3 or 4 other methods that report different kinds of similarity scores.
For word-similarity, there is no difference between left-stars and right-stars (or left-duals and right-duals) and so there is no difference between forward and backward.
I guess I should try to create a dataset that has both word-pairs, and also similarity scores in it. Then you would need to present the user with a menu choice of three things: (a) view raw atomspace contents, (b) view word-pair chains, or (c) view similar words.
A completely different project is to visualize the disjuncts. But this is a completely different project, and so you would have to ask questions about that.
This all works fine, and I already made some code out of it, but I suspected (re-reading all the previous mails, especially the 8'th mail from the top - the technical one) that you (right now) wanted in the visualization more than the above three query pattern examples.
Well, the disjuncts.
If the above three examples are not enough, please provide me with all the concrete examples of other queries (in the above form) you'd like to have in the visualization because I'm having trouble making them up. Other things I try in scm shell return errors.
That is because this dataset only has word-pairs, and csets in it. You get errors for other things, because the dataset does not have those other things in it. The only things it has are the things explained in (pair-stars 'describe) and (cset-stars 'describe) which are almost identical, but not quite, and (pair-freq 'describe) and (cset-freq 'describe) which are completely identical, except that the second uses disjuncts, not words.
But if the above three examples are all you wanted for now, then everything is clear, and I made all the trouble out of nothing.
disjuncts. I think that visualizing disjuncts would be interesting.
You don't have to upload into the dataset anything you didn't plan,
I plan to have "everything"; just that today, I don't have high-quality word-pair similarity datasets that are small enough to fit on the server I exposed to you. It would take me a few days of hard work to create them ... but given how you talk, I guess I should do this as soon as possible.
I'm just trying to find out everything you expected to have available in the first visualization iteration.
I guess word-pairs is enough. I will try to create something for word-similarities.
I think disjuncts are important, but they are harder.
--linas
There were emails about these and those queries, what is working now, what later, what's impressive and what's common, I re-read it all several times, and I guess I got lost, probably posing questions about the wrong things and making the confusion. I hope I managed to be more concise and understandable this time. If not, I seriously have to work on my communication skills.
Thank you, and sorry for misunderstandings,
Ivan