Re: digging into MST-Parser code

38 views

Skip to first unread message

Linas Vepstas

unread,

Jan 11, 2019, 9:31:46 PM1/11/19

to akol...@aigents.com, Alexei Glushchenko, Leung Man Hin, Amen Belayneh, b...@goertzel.org, lang-learn, opencog, Michael Duncan, Zarathustra Goertzel

Hi Anton!

On Thu, Jan 10, 2019 at 11:45 PM Anton Kolonin @ Aigents <akol...@aigents.com> wrote:

Hi Linas, while digging into MST-Parser code, we have found that some
of the NLP Scheme code resides in singnet/opencog and some is in
singnet/atomspace.

My goal is that all developers agree that there is a development branch and a stable branch, and to know which one is which, and to work so that all development go into the development branch, and that the stable branch be a branch that is stable according to industry-standard definitions of stability.

I am concerned that there continues to be confusion about this. I am concerned that this will just lead to wasted time and bad design and bad code that is buggy, inoperable.

I spend vast amounts of my time being "the janitor" who cleans up messes, and this is a thankless job, and I don't enjoy it, and I get concerned whenever I read something that suggests I have a big cleanup job waiting for me in the future.

I wonder, if idea of having Scheme code in AtomSpace
layer has some conceptual justification or it is just historical matter?

There has always been scheme code in the atomspace. The atomspace provides all of the core infrastructure for the scheme bindings.

For instance we were unraveling the uses and implementation of
add-symmetric-mi-compute.

That function is provided as a part of the "matrix" package. That package provides a way for looking at subsets of the atomspace as if they were (sparse) matrices. Please recall that a matrix (a 2-tensor) is an N x N grid of values. There are many, many things one can do with a matrix. Almost all of the code in the matrix directory is focused on treating the matrix as a probability P(X,Y) of two random processes.

Whenever one has a probability like that, one is typically interested in the marginals (the P(X) which is the P(X,Y) summed over all Y), the conditional probability P(X|Y), the entropy H(X,Y) and the mutual information MI(X,Y). Another very important quantity is the product P(X,Y) P^T(Y,Z) where ^T denotes the matrix transpose. This product is can be used to build the cosine distance between X and Z; it can also be used to build the symmetric-MI, which is like the cosine distance; but has sums over logarithms in strategic places.

You may wonder "why not use an ordinary linear algebra package?" or "why not use Gnu R?" (or SPSS or SciPy, or whatever) There are three reasons for this:

1) The atomspace matrices are extremely sparse: for the NLP data, only one in a million entries are non-zero.

2) the NxN matrix has N=100K to 1 million for NLP data, which is more RAM than computers can easily provide. The matrix package has to be optimized for sparse data. Genomic data might have even larger N.

3) It would be marvelous if someone wrote an R wrapper for this stuff. It's not hard. Someone needs to do this. I have been urging the agi-bio guys to do this, because their genomic/proteomic data is also extremely sparse, and because they like to use R for data analysis.

The general justification is that every atom is like a tensor index, and the value attached to that atom is the value of that tensor at that index. Since a collection of atoms is conceptually the same thing as a set of sparse tensors, lets acknowledge that fact, and provide an API that allows ordinary users to access the tensor data as tensors. By "ordinary users" I mean anyone who has ever done statistical analysis, or more generally, any user who uses SciPy or Gnu R to mangle their data.

The atomspace is not for everyone: its only for those people who have very sparse data with a Zipfian distribution. But if that is what they have, let them access it in a "normal" kind of data-analytics kind of way, like how you'd do data analytics in other packages.

If we keep extending the MST-Parser code

The MST parser code is in a different directory; it is not a part of the matrix code. It is much more experimental. The MST parser is meant to be a part of a generic parsing and theorem-proving infrastructure. Among both the academics, and the readers of this mailing list, there is some general understanding that theorem proving, natural deduction, Hilbert-style deduction, sequent calculus, parsing and constraint-solving are all kind-of-ish "the same thing". The goal here is to actually try to make them actually be "the same thing" by providing something that accomplishes all of the above with the same code base.

To be more explicit: we have the URE, which performs forward and backward chaining. If you look at the chaining algorithm, you promptly realize that it is a certain kind of parsing algorithm. This insight, that parsing and theorem proving is "the same thing" is what prompted the URE to be created. It has been used for the proving side; i.e. for PLN, but it has not been used for parsing, yet. No one has ever attempted to import link-grammar into the URE. (There was also the intent that open-pse would also run on top of, run with the URE, but the current URE does not support that mode of operation, and so open-psi exists as a distinct, separate code base)

If the URE had been sufficiently powerful and robust, we would have been able to import the full English link-grammar dictionary into the URE, and run it, and get ordinary LG parses coming out. This is not currently possible with the current URE design.

Given that the current URE is unable to support open-psi, and is unable to support LG, it seemed like it was time to redesign it, from the ground-up. Thus, the code in the "sheaf" directory is an attempt to re-imagine how theorem-proving and parsing can be accomplished in a fashion that is much faster, easier and more usable than the current URE is, with a simpler API and a stronger toolset. The paper on sheafs was an attempt to explain how this could be done.

for account for word, link and
disjunct frequency

Please understand that disjuncts are a general concept. They occur not only in natural language, but they also occur in biology, and they also occur in theorem proving.

and provide integration with DNN-s

The paper on skip-grams is an attempt to explain how theorem proving is just like deep-learning in neural nets. It attempt to explain how these two different systems are really variations on the same theme.

Ideally, the API provided by the code in the "sheaf" directory will be able to provide a common API to deep learning systems, a well as to parsing systems, as well as to PLN, as well as to open-psi, and that, one could choose between different algorithms and implementations that can process your data.

Currently, this dream is pre-pre-alpha, and it contains only a generic MST parser.

and add
incremental/iterative learning capabilities to it,

The goal of tracking disjunct statistics is that this *is* the learning system. Yes, there are other ways of learning. Again, the paper on skip-grams attempts to explain all the different ways in which learning can be accomplished.

should the changes be
done both to singnet/opencog and singnet/atomspace following the same
pattern?

See comments at top about stable and development branches.

Or, we should better pull all NLP code out from singnet/atomspace to
singnet/opencog or even place them in separate project?

The MST parsing code is intended to be a part of a generic learning system that can be applied to NLP or genetics or to logical induction or to robotic motion control. It is not specific to natural language.

-- Linas

Ben, Man Hin, Amen - any insights on this?

Thanks,

--
-Anton Kolonin
skype: akolonin
cell: +79139250058

cassette tapes - analog TV - film cameras - you

Michael Duncan

unread,

Jan 16, 2019, 11:04:57 AM1/16/19

to Linas Vepstas, akol...@aigents.com, Alexei Glushchenko, Leung Man Hin, Amen Belayneh, b...@goertzel.org, lang-learn, opencog, Zarathustra Goertzel

hi linus, given your assertion that the atomspace is fundamentally better than the other graph databases out there, wouldn't it be strategic from an ecosystem point of view to reconceptualize the functional distinction between the atomspace repo and the opencog repo as a graph database and (proto) agi infrastructure, respectively?

Linas Vepstas

unread,

Jan 16, 2019, 2:38:21 PM1/16/19

to Michael Duncan, akol...@aigents.com, Alexei Glushchenko, Leung Man Hin, Amen Belayneh, b...@goertzel.org, lang-learn, opencog, Zarathustra Goertzel

On Wed, Jan 16, 2019 at 10:04 AM Michael Duncan <mjsd...@gmail.com> wrote:

hi linus, given your assertion that the atomspace is fundamentally better than the other graph databases out there,

I don't think I said that. I think I said that it has more advanced features. However, the competition is moving faster, and is catching up. I fear the atomspace will be a forgotten historical footnote in not too many more years.

wouldn't it be strategic from an ecosystem point of view to reconceptualize the functional distinction between the atomspace repo and the opencog repo as a graph database and (proto) agi infrastructure, respectively?

I think that was always the case.

In practice, I tried to make sure that the atomspace was that part of opencog that was stable, well-thought-out, finished, reliable, dependable, whereas the other repos were experimental prototyping sandboxes. Exactly where to draw that line is not always clear; for example, the rule-engine is a part of the atomspace, even though it is less than finished, and can be re-imagined in several alternative ways (for example, openspi is also a kind-of-rule-engine, except its totally different.) Should the rule-engine live in it's own repo, instead of the atomspace? Maybe. Should openpsi live in it's own repo, instead of opencog? Maybe. I don't think I would resist such a split-up; do it in a conscientious, thoughtful fashion, and sure, why not?

-- Linas

Linas Vepstas

unread,

Jan 16, 2019, 3:54:49 PM1/16/19

to Anton Kolonin @ Gmail, Michael Duncan, Alexei Glushchenko, Leung Man Hin, Amen Belayneh, b...@goertzel.org, lang-learn, opencog, Zarathustra Goertzel, Cassio Pennachin, Alexey Potapov, Сергей Шаляпин

Hi Anton

On Wed, Jan 16, 2019 at 11:10 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

From strategic product/service delivery perspective having AtomSpace as
hyper-graph database (storage) layer isolated from application (business
logic) layer would make a lot of sense, if that is possible.

I think this was always the case. The choice of words "business logic" is kind-of funny. It's quite accurate, but I've found that there is a class of catch-phrases that Ben thinks are boring (he will literally walk out of the room), and I think this is one of them.

We've never tried to make the AtomSpace a business product. We've never found a way to talk about the atomspace in a business-obvious, developer-obvious kind of way (compare, again to grakn.ai). This means that almost all developers struggle mightily to figure out how to use it, and almost always fail. Compare, again to grakn.ai: it's not just tutorials and demos and examples and documentation; its the need create an example that *everyone* can relate to, and make that the primary example.

As I see it, one of the problems preventing this nice architectural
isolation between the layers is atom type hierarchy which is bound to
low-level implementation of the AtomSpace (storage) concepts on one end
and to high-level (business logic) aspects like NLP on the other end.

What's the problem? Per a different thread, there is a need for a better FFI (or rather a "foreign type interface" rather than a "foreign function interface") but I think this a well-defined, easily solvable problem that no one has been interested in solving, until now.

Ideally, I would imagine having AtomSpace as a C/C++ graph database
loadable with any atom type hierarchy, isolated from any specific atom
type hierarchies like one used in OpenCog.

But that is the case already, is it not? For example, agi-bio has it's own hierarchy; the atomspace does not know about it, but it can store it and load it and pattern-match it and backward-chain with it just fine.

My question might be rhetorical; I know of many things that are wrong, incomplete, poorly implemented ... but it is hard to figure out which of these are important, and which are not. So you'd have to make clear which ones are the important ones.

On top of this, there could be separate projects and any applications
and scripts in any languages such as Scheme or C/C++ or whaterve,
loading any atom type hierarchies into with any AGI/NLP/etc. applications.

I think that has been possible since about "forever", so you would have to give a more detailed example.

Now, there are many things that one could do to make the atomspace better/easier for "ordinary" users. I've thought a lot about these. But doing so takes focus and effort. The historical focus has been on PLN and various conceptions of AGI, and essentially zero focus on "normal" applications.

But there is more important thing that is concerning me regarding the
architecture. To my understanding, unlikely any conventional database
used in industry, the OpenCog is not supposed to work multi-user
environments. For instance if you have SQL table about animals, you may
have multiple users querying different segments of the table related to
different animals.

Do you mean "access permissions"? Read-write? There's some minimal support for that; you can have a read-only atomspace (e.g. some huge genome dataset) and then a read-write layer on top of it (so that some scientist can modify portions of the dataset, without screwing up the total, and without having to make a private, personal copy of the huge dataset.) This works now, but it's minimal; no fancy features.

If you mean "atomic update", then no; the atomspace is more BASE-like than ACID-like. This could be interesting to talk about.

If you mean "table schemas", we've got a prototype of that, called "deep types". In SQL you must always have a table schema. In prolog/datalog, you never need a schema (and I'm not sure it is even possible to specify a schema in those languages). I promise not to mention XML schema. Ooops. Same idea - you can write XML without a schema, but there are people who insist that their app has to have one, and so -- XML schemas. We have a sketch for that in the atomspace -- see the wiki page on type constructors. The basics work. No one actually uses it for anything.

If you mean inner and outer joins, the pattern matcher already does that, automatically.

Seemingly, it does not work the same way in OpenCog - if two independent
users start MST-Parsing on two different corpora, they will have data
messed up together,

Why? Open a bug; this should work perfectly. Once, long long ago, I've run parsing in parallel on 3 different machines; the data was not "messed", it summed up very nicely.

I have not actually tried this (or even thought about it) with the current pipeline in opencog/nlp, so yes, there may be bugs. They should be fixed. There's a potential performance penalty from syncing too often. There might be issues with atomic updates; I don't think we have atomic counters fully implemented (work for that was started, but not finished) but you can certainly do language learning without atomic counters.

if they start inference or pattern matching activity
on different topics, the topics will be messed up together.

Huh? This should work perfectly. Open a bug.

The way it
is supposed to get solved is having different AtomSpaces for each corpus

or for each inference process but the AtomSpaces are really heavyweight
and you can not create AtomSpaces dynamically for the user sessions.

?? Why can't you create atomspaces dynamically?
cog-new-atomspace
cog-push-atomspace
cog-pop-atomspace
cog-atomspace-readonly?
cog-set-atomspace!

and 8 more of these kinds of functions.

Well, we may have pool of N AtomSpaces serving queue of M users, so if N
< M then M-N users are staying in queue. But I anticipate that context
AtomSpace initialization for every user coming from the queue could be
as expensive as creation of the new AtomSpace for every user...

I don't understand. You can do this just fine. There's already a pool of temporary atomspaces. You'd have to define what "expensive" means. I think you can create an atomspace in milliseconds or maybe tens of milliseconds at most.

Destroying a database with millions of atoms in it slow ... but that is a different issue.

Something to get addressed before considering exposure of OpenCog-based
services to SingularityNET.

There are 1001 things that generic databases do that the atomspace does not do, or, at least, not efficiently, quickly, easily. It would be nice to have those features. Up until now, there has been a very low demand for these. Because the user base is tiny.

There is one very very important issue that is being ignored: we need to be able to load the ghost rules for Sophia much more quickly than we do. Last time someone measured, it was unacceptably slow. I'm not sure of what the status on that is.