New AtomSpace Architecture

243 views
Skip to first unread message

Ramin Barati

unread,
Jul 14, 2013, 11:37:16 AM7/14/13
to Benjamin Goertzel, opencog, Linas Vepstas
Hi Ben, Linas, etc.,

Now that we're starting to implement the PLN from scratch (or maybe evolve the current design), I can't help myself thinking about the persistence and the method of reading/writing the atoms and links. As far as I know atomspace is in the class of Bigdata and we all know that RAM isn't big enough and RDBMS is too slow. I know for a fact that I'm not the first person that stumbled upon this problem but bare with me. I should apologize if contents of this e-mail is redundant.

1) NoSQL solutions are fast at reading big chunks of data but they're still slow when there is something to write. OpenCog architecture requires very fast read/write operations with small data, so I believe that using NoSQL databases wouldn't boost the performance over the current structure of data that much.

2) NoSQL solutions are fast because they are are eventually consistent, meaning that the data isn't guaranteed to be consistent at any given moment. So as the agents discover more facts about the world (adding atoms and links) they will discover redundant data (because they don't know that it has been discovered by their fellow agents). As a consequence the info/data ratio will be small resulting in slower (even wrong?!) inferences, this also forces us to program garbage collector agents. Wouldn't this make OpenCog slower?

3) Although key-value pairs seems like the right choice for graph representations, they are not. Apart from nodes, graphs have links which should be considered first class citizens in any practical graph data structure . Key-Value pairs are just good at representing nodes but adding links to the mix will result in hard to read and hard to write data.

Considering the above, I don't think the implementation of the new design on Redis will satisfy the given criteria in the "Context and Motivation" part of the article. I also believe that there is no commercial or open source db project that will satisfy all of the needs of the OpenCog data. But if we combine them, we can definitely address all of its needs. OpenCog is already polyglot and I don't see a reason why it can't have more than one db technology as well. 

As I hate to just nag, I did some digging around and found out about Graph databases and NewSQL. They are not general purpose dbs, they each specialize in one field. Graph databases, as the name suggests, specifically store and retrieves graphs the way it should be done. I read somewhere that a graph db is 1000x faster than a conventional db when the content can be best represented with a graph (although that seems a little exaggerated). VoltDb is the leading force in the In-memory databases and a part of NewSQL movement. Its PHast, partitioned and ACID. IMHO combining these two technologies will be the best approach that OpenCog can take for atomspace architecture.

Regards
Ramin

Ben Goertzel

unread,
Jul 14, 2013, 11:45:19 AM7/14/13
to Ramin Barati, opencog, Linas Vepstas
Hi Ramin,

> Considering the above, I don't think the implementation of the new design on
> Redis will satisfy the given criteria in the "Context and Motivation" part
> of the article. I also believe that there is no commercial or open source db
> project that will satisfy all of the needs of the OpenCog data. But if we
> combine them, we can definitely address all of its needs. OpenCog is already
> polyglot and I don't see a reason why it can't have more than one db
> technology as well.

Sure, in principle that seems fine...

> As I hate to just nag, I did some digging around and found out about Graph
> databases and NewSQL.

I have studied graph databases a fair bit already; and a while ago I played
around with Neo4j ...

> They are not general purpose dbs, they each specialize
> in one field. Graph databases, as the name suggests, specifically store and
> retrieves graphs the way it should be done. I read somewhere that a graph db
> is 1000x faster than a conventional db when the content can be best
> represented with a graph (although that seems a little exaggerated). VoltDb
> is the leading force in the In-memory databases and a part of NewSQL
> movement. Its PHast, partitioned and ACID. IMHO combining these two
> technologies will be the best approach that OpenCog can take for atomspace
> architecture.

This is an interesting suggestion. It would be worthwhile for you to elaborate
in more detail on what led you to this specific conclusion...

Thanks
Ben

Ramin Barati

unread,
Jul 14, 2013, 12:27:30 PM7/14/13
to Ben Goertzel, opencog, Linas Vepstas
  Well first I wanted to make sure that I'm not doing sth that has been done before, so I didn't work on it that much. The idea is to use Graph db as the long-term memory and VoltDb (or some other In memory NewSQL db) as the short-term memory. VoltDb is ACID, agents put their guesses there and their info will be consistent, as more and more evidence is found backing that guess, These guesses become facts and will be delivered to the graph db back end. This way we can query the long-term memory as fast as Redis without the fear of inconsistency and the write overhead, and also have the speed and consistency that a human has in his/her short-term memory.

  Of course it's not that of a mature solution, but maybe it can be transformed into sth useful.

Linas Vepstas

unread,
Jul 14, 2013, 12:28:17 PM7/14/13
to Ramin Barati, Benjamin Goertzel, opencog
Hi Ramin,

On 14 July 2013 10:37, Ramin Barati <rek...@gmail.com> wrote:
Hi Ben, Linas, etc.,

Now that we're starting to implement the PLN from scratch (or maybe evolve the current design), I can't help myself thinking about the persistence and the method of reading/writing the atoms and links. As far as I know atomspace is in the class of Bigdata and we all know that RAM isn't big enough and RDBMS is too slow. 

Its very poor form to start a discussion by making a preposterous and false statements. I'll read the rest and try to respond, but its just poor form to say things that aren't true, and then expect the audience to follow you wherever you go.

--linas

Ramin Barati

unread,
Jul 14, 2013, 12:36:19 PM7/14/13
to linasv...@gmail.com, Benjamin Goertzel, opencog
Hi Linas,

Its very poor form to start a discussion by making a preposterous and false statements. I'll read the rest and try to respond, but its just poor form to say things that aren't true, and then expect the audience to follow you wherever you go.

 I made three statements, which ones are preposterous and false?

Ben Goertzel

unread,
Jul 14, 2013, 12:45:10 PM7/14/13
to Ramin Barati, linasv...@gmail.com, opencog
On Mon, Jul 15, 2013 at 12:36 AM, Ramin Barati <rek...@gmail.com> wrote:
> Hi Linas,
>
>> Its very poor form to start a discussion by making a preposterous and
>> false statements. I'll read the rest and try to respond, but its just poor
>> form to say things that aren't true, and then expect the audience to follow
>> you wherever you go.
>
>
> I made three statements, which ones are preposterous and false?

It is not clear whether the same statement was classified as both
preposterous and false, versus one statement being preposterous and a
different statement being false ;)

Given " As far as I know atomspace is in the class of Bigdata and we
all know that RAM isn't big enough and RDBMS is too slow. " I would
guess that Linas must be taking issue with "RDBMS" is too slow, since
the other statements are true...

For applications like unsupervised language learning, Web search or
robotics, the Atomspace will be in the class of Big Data ... and
indeed given current technology and $$, RAM isn't big enough...

Regarding RDBMS being fast enough, I guess it comes down to fast
enough for what? It all depends what queries you need to execute,
right? Linas found the postgres backing store for OpenCog fast
enough for the particular things he was doing with OpenCog and NLP
when he created it...

FWIW, I have my doubts that -- when you really dig into the details --
the technological solution you suggest is gonna work for OpenCog.
After I dug fairly deep into various related technologies a year or so
ago, I reluctantly came to the conclusion that for the in-RAM,
**distributed** AtomSpace, we would have to roll our own, as no
existing tech really did the trick.... I did not research
persistence solutions extensively though, so have no strong opinion on
that part...

-- Ben

TheNinthWave

unread,
Jul 14, 2013, 1:11:24 PM7/14/13
to ope...@googlegroups.com, Ramin Barati, Benjamin Goertzel, linasv...@gmail.com

Its very poor form to start a discussion by making a preposterous and false statements. I'll read the rest and try to respond, but its just poor form to say things that aren't true, and then expect the audience to follow you wherever you go.

--linas


Linas,

It would be more constructive if you would provide some reasoning for such a statement.  Otherwise, this just sounds cranky and dismissive.

Best,
Duane

Ramin Barati

unread,
Jul 14, 2013, 1:25:24 PM7/14/13
to Ben Goertzel, linasv...@gmail.com, opencog
Regarding RDBMS being fast enough, I guess it comes down to fast
enough for what?

  I'm considering data as large as Twitter, Facebook, etc. If it was slow for them, Why should it be fast for OpenCog?

FWIW, I have my doubts that -- when you really dig into the details --
the technological solution you suggest is gonna work for OpenCog.
After I dug fairly deep into various related technologies a year or so
ago, I reluctantly came to the conclusion that for the in-RAM,
**distributed** AtomSpace, we would have to roll our own, as no
existing tech really did the trick....

Well I don't believe that anything revolutionary has happened in the past year, so I accept that. But I'm not talking about an in-RAM atomspace. Atomspace is in the graph db, VoltDb is a relational db, so it's not suitable for representing graphs. I'm thinking maybe if it's possible to keep the records of the mind agents actions, we can stop them from doing the same discovery twice and only things that are becoming strong enough will be delivered to atomspace. I didn't mean to give the impression that VoltDb will be a temporal atomspace. It's just a consistent medium between the eventual consistent graph db and the dynamic nature of learning.

Ben Goertzel

unread,
Jul 14, 2013, 1:28:14 PM7/14/13
to ope...@googlegroups.com, linasv...@gmail.com
ah, ic...

I think the records of the MindAgents' actions should be stored as
Atoms, and thus either

-- kept in the Atomspace for reasoning, learning from

or

-- if they get too voluminous, persisted along with the rest of the
on-disk Atomspace

...

ben
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at http://groups.google.com/group/opencog.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Ben Goertzel, PhD
http://goertzel.org

"My humanity is a constant self-overcoming" -- Friedrich Nietzsche

Linas Vepstas

unread,
Jul 14, 2013, 1:32:27 PM7/14/13
to Ramin Barati, Benjamin Goertzel, opencog
Hi,


On 14 July 2013 10:37, Ramin Barati <rek...@gmail.com> wrote:
 OpenCog architecture requires very fast read/write operations with small data, 

This is correct.  Atoms are small.
 
2) NoSQL solutions are fast because they are are eventually consistent,

Not exactly true. It is not so much that they are fast, but rather that, compared to SQL, a certain synchronization lock was removed. That lock can *sometimes* be a bottleneck when writing a lot of data.  "Sometimes" is the key word in the sentence above. 

meaning that the data isn't guaranteed to be consistent at any given moment. So as the agents discover more facts about the world (adding atoms and links) they will discover redundant data (because they don't know that it has been discovered by their fellow agents).

Unless you have a specific example, I don't believe this at all.  So, for example, some vision agent will not be finding redundant data -- redudant with what? Something that was heard or touched or smelled before?

OK, so maybe you have two vision agents, or one optical, one infrared and one radar and one ultrasonic.  Then yes, you have data management problems, but simply dumping it all into the atomspace and hoping it all become magically "eventually consistent" can't possibly work. 

You have to design something that explicitly reconciles different data.

As a consequence the info/data ratio will be small

Huh? Sensors usually provide too much data ... 
 
resulting in slower (even wrong?!) inferences,

Well, this is now talking about PLN or some other inferencing agent.  Yes, an interesting scenario is where we may have 2 or 3 or 100 PLN agents each making inferences on the same or on similar data, and now we have to reconcile all of these different inferences. 

However, 1) right now we have zero working PLN agents, and 2) "eventual consistency" won't solve any problems here,either.  There will need to be some specific way of harnessing together 100's of PLN agents to work together coherently.  


3) Although key-value pairs seems like the right choice for graph representations, they are not.

Please be careful with the words that you use. For ordinary graphs, key-value pairs might provide a reasonable representation, maybe; however, opencog consists of hypergraphs, not graphs.  There's more; the opencog hypergraphs are typed; ordinary graphs are never typed.

The real question seems to be "what is the best way to *represent* a typed hypergraph (i.e. the atomspace) within some other, existing storage system (e.g. sql, nosql, whatever)?  

l satisfy all of the needs of the OpenCog data. But if we combine them, we can definitely address all of its needs. OpenCog is already polyglot 

Please be careful to distinguish opencog from atomspace. They are related but very different things. 

The atomspace, which hold typed hypergraphs, should have a good persistence architecture. It should also be distributed. 

Whether this or that or some other part of opencog wants to use the atomspace is a completely different question.  So, for example, the space/time parts of embodiment have already decided that the plain-old atomspace is not good enough for 3D data, and there are new, distinct space/time servers created just for this.   These servers have a very different set of requirements for persistence.


As I hate to just nag, I did some digging around and found out about Graph databases and NewSQL. They are not general purpose dbs, they each specialize in one field. Graph databases, as the name suggests, specifically store and retrieves graphs the way it should be done.
 
Hmm  could be interesting... anything that gets us closer to typed hypergraphs is a good thing.

I read somewhere that a graph db is 1000x faster than a conventional db when the content can be best represented with a graph (although that seems a little exaggerated).

Heh. To say the least. Given that "conventional dbs" store something called "relational data", and given that any and all relations are fundamentally graphs, the claim is absurd.

In fact, relational DB's store more than just relation graphs: the relations are typed.
 
VoltDb is the leading force in the In-memory databases

Well, you started out by claiming that things don't fit in-memory!  If you want an in-memory database, hell, gnucash has one that was built about 13 years ago. It can do any kind of search and query.  Really cool. We should have split it out as a distinct project; the NewSQL revolution could have started a decade earlier!

==========
Here are my major complaints:

1) Opencog already has persistence.  Absolutely no one is using it.

2) The existing opencog persistence is distributed, and should scale just fine to maybe 10 or 20 or 100 machines. It works; I once ran it on 3 machines. 

Until people start using this, and start discovering ways in which it is inadequate, I think that having discussions about persistence are ridiculous and absurd. 

3) The existing persistence design is one certain kind of *representation* of typed hypergraphs as typed relational graphs.  It is a particularly simple one, but perhaps not the best or fastest one.  There are certainly other representations that are possible that may have much better performance properties.  I can think of several fairly easy ones, right now, that could be very interesting.

4) The particular relational graph that is used is mapped to SQL (and postgres, in particular).  It could be mapped to a different SQL, or to NoSQL. Or a different representation could be mapped to NoSQL. It is "well known" that SQL and NoSQL are adjoint functors in the sense of category theory; any representation you have in one has an adjoint representation in the other.  Microsoft has a paper on this, maybe 5 years ago now.

So we have not even begun to scratch the surface of what can be done.  I think its completely premature to start talking about which technology will be used for the solution, before we have outlined what the problem is, and how to solve it.   Saying things like "lets use Redis" or "lets use MongoDB" or whatever is like saying  "lets use a Brand X steering wheel, and build a whole car around it; but no matter what the car is, or what the car does, it must have a Brand X steering wheel in it!"  

-- Linas

Linas Vepstas

unread,
Jul 14, 2013, 2:30:47 PM7/14/13
to Ramin Barati, Benjamin Goertzel, opencog
I think perhaps I can clarify what it is I'm talking about.

First, anyone who is interested in atomspace persistence should use the existing persistence infrastructure, and gain some experience with it.  If, after using it, you still decide that persistence is still a problem, and that the most important problems are not elsewhere (e.g. the current atomspace wrapper being stunningly slow) then all thinking on persistence should be focused on *representations*.

What is a "representation"?  It is a generic mathemetical term, which means that "structure A is implemented using a specific technique B, such that everything that can be legally done with A works exactly the same way with B."

For the atomspace, structure A is a typed hypergraph. To figure out what B is, one must answer questions such as these: 

1) Where is an atom type stored?  Is it a field in a record?  Is it a key in a KVP?  Maybe there is a different table for each different atom type? (so that the atom type is implicit, given the table, rather than explicit, by being a value in a table?).

2) Are links and nodes stored in the same table, or in different tables?

3) Are truth values stored with the atoms, or are they in a separate table? Same question for attention values?

4) Should the outgoing set of a link be considered to be part of a link type, or not?   Should links with different outgoing types be stored in a different table, or the same table? That is, if I have 

  SimilarityLink
     ConceptNode A
     ConceptNode  B

and 

   SimilarityLink
      ConceptNode A
      NumberNode B

should these be stored in the same table (the table of all similarity links) or in two different tables?

5) Does the outgoing set get stored as one single record, or is it split across many records?  One table, or many tables?

6) PLN and pattern matching, and other inferencing systems, will want the set of all incoming links.  There needs to be a fast way of finding these.  How will this be done? By using an index? By using a table? Some other way?

7) In the above, perhaps "the set of all incoming links" is not a requirement, perhaps "the set of most incoming links, in an eventually-consistent fashion" is enough.  How would this affect the design?

8) After finishing the above design, what are all the steps needed to create a new atom?  How many tables or records need to be updated? one, two, three? More?  How many indexes need to be recomputed?  Does anything need to be locked while this is being done, or is there a lock-free solution?

9) After finishing the above design, what are all the steps needed to change the truth value of an atom? How many tables & indexes need to be updated?  Are locks needed?

10) Same question for STI/LTI.  Note that the current AV infrastructure in the atomspace is very rigorous about keeping a certain set of indexes exactly in sync.  Can this be relaxed?
 
Only AFTER doing all of the above, should you start asking yourself "should I use SQL or NoSQL or Reddis or whatever, to implement the above".  Well, I exaggerate .. clearly, some of the answers to the questions above will be different, depending on whether you use SQL or NoSQL or hypertable or something else. 

In the current persistence design, question 5) was answered in such a way that it works for Postgres, but fails for MySQL. So, yes, technology choice does limit possible answers.

There is no one unique "best" answer to the above questions.  Each answer will have a different performance profile. Clearly, some "representations" will be bad. But there will be many that are "pretty good, except for problem X which is really horrible". That is when things get hard.

The above is not an exhaustive list of questions; there are more.  We have not yet even talked about "distributed" and "synchronization" and "consistency" and "concurrency" and "write-back".  These are even harder questions, best left for later.

We also have different user profiles. The way that PLN might access the atomspace will be very different from how ... I dunno .. something else might.  Some parts of embodiment might be creating and deleting vast numbers of atoms. Others might be creating and deleting very few, but changing truth values on everything.  Different users will see different performance profiles for different representations.

IF you can make one or two or three proposals for how to answer the above questions, THEN we can have productive conversations, including having an SQL vs. NoSQL vs. BigTable vs. Hadoop vs. Hibernate vs. god-knows-what technology choices.  

I think it would be best if at least two, maybe three representations were proposed. Then these can be compared and discussed.  Don't make one good proposal, and two obviously bad ones. Make all three proposals be each very good, and very, very different.

-- Linas

Ramin Barati

unread,
Jul 14, 2013, 3:13:04 PM7/14/13
to ope...@googlegroups.com, Benjamin Goertzel
 @ben
 
I think the records of the MindAgents' actions should be stored as
Atoms, and thus either...

  As long as the structure doesn't converge to a graph it can be easily represented in a relational model . If its only an Atom, then it can be represented as a row in a table.

@linas

As a consequence the info/data ratio will be small

Huh? Sensors usually provide too much data ... 
  
  I meant data/garbage ratio. No idea how info/data got written. :p

VoltDb is the leading force in the In-memory databases

Well, you started out by claiming that things don't fit in-memory!  If you want an in-memory database, hell, gnucash has one that was built about 13 years ago. It can do any kind of search and query.  Really cool. We should have split it out as a distinct project; the NewSQL revolution could have started a decade earlier!
  
  As I stated earlier, VoltDb should be used to store other types of structural data, like the data needed to coordinate the agents, not Atomspace. This kind of data should be write intensive, therefore using the advantages of VoltDb. If they're not like that, we have no reason to use VoltDb. We should use the technology that best fits it's needs. There is just so many new born db technologies these days, One of them probably fits.

Here are my major complaints:

1) Opencog already has persistence.  Absolutely no one is using it.

2) The existing opencog persistence is distributed, and should scale just fine to maybe 10 or 20 or 100 machines. It works; I once ran it on 3 machines. 

Until people start using this, and start discovering ways in which it is inadequate, I think that having discussions about persistence are ridiculous and absurd. 

  Yes, OpenCog already has persistence and I'm sure that it's gonna work great for now. But I can't help to notice that there are 0 activity in those aspects of code. I'm not sure but the last time anyone touched the persistence code was 2009. Most of the contributors are attracted to more flashy parts like MOSES, Embodiment, etc. Let's face it, DB isn't fun. The code will eventually get obsolete and will lag behind the new technologies (here is the first sign). There are people out there doing what we don't have the man power to do for free. Why should we be so partial about our persistence method?

So we have not even begun to scratch the surface of what can be done.  I think its completely premature to start talking about which technology will be used for the solution, before we have outlined what the problem is, and how to solve it.

  Then that is exactly what we should do! Or completely decouple persistence methods from the core modules. Suppose we have the perfect PLN right now; it will be heavily coupled with persistence layer (Don't think anyone wants to upload the Concept Net Core every time he/she wants to test or debug the PLN code, it's a needed feature). Suddenly, we reach the conclusion that we should use the X technology for persistence, It will be a maintenance nightmare!

--

Ramin Barati

unread,
Jul 14, 2013, 3:59:01 PM7/14/13
to ope...@googlegroups.com, Benjamin Goertzel
o, for example, the space/time parts of embodiment have already decided that the plain-old atomspace is not good enough for 3D data, and there are new, distinct space/time servers created just for this.   These servers have a very different set of requirements for persistence.

  That is exactly what I'm talking about, use different db technologies to deal with the different technological needs. e.g. for spatial data, a good candidate would be PostGIS. There is no need to stick to RDBMS, NoSQL or NewSQL. We can use all of them and have best of them altogether. 

Joel Pitt

unread,
Jul 14, 2013, 4:02:31 PM7/14/13
to ope...@googlegroups.com, Benjamin Goertzel
I think Linas makes good points, but I also think persistance as well swapping between memory and the persistance store should be a concern of the AtomSpace. Persistence should also be a first class citizen in any design so that if you are working with OpenCog everything is saved unless you explicitly disable it.

Related, I'm making some progress on my hypergraph db, which does this using pages of data (like SQLite does) but is optimised to store (small) atom properties inline so that accessing and hopefully traversing atoms maintains some sort of data locality. Partioning graphs is an unsolved hard problem, but there are a few heuristics that do okay and I intend to allow rebalancing the graph so that more connected data is more likely to be in the same page.

This probably shouldn't be taken into account for future OpenCog developments though. I'm not sure how long it will be before its in a stable state API-wise.

Ramin Barati

unread,
Jul 14, 2013, 4:46:51 PM7/14/13
to ope...@googlegroups.com, Benjamin Goertzel
@linas

  I can't possibly answer all of these questions correctly right now, but I'll try.
  Questions 1,2,3,4,5,6: If we use a hypergraphDB (HGDB), for example HyperGraphDB, the data will be stored somehow and will be represented to us like a graph. The retrieval of the data may be slow in HGDB, so we store the the atoms and links in HGDB, but use a KVP to store the metadata, Resulting in fast travel in the graph nodes and fast retrieval of atom data. All we have to do is synchronizing the atomic operations of saving, Something we will be doing anyways no matter what technology we use.
 
  I'm not in a position to answer Question 7.

  Question 8,9,10: These questions are specific to relational model dbs, except the locks and their mechanism (which I can't answer right now).

  In my point of view the ten above questions can be reduced to:
  
  1) Which HGDB best fits the needs of atomspace?
  2) Should we use RDB, NewSQL or KVP for storing atom metadata?
  3) Which RDB/NewSQL/KVP best fits atomspace needs?

  

--

Ramin Barati

unread,
Jul 14, 2013, 5:03:34 PM7/14/13
to ope...@googlegroups.com, Benjamin Goertzel
@ joel

  Maybe you can check the available hypergraph db projects and decide which one is (or will be) more suited for atomspace?

Joel Pitt

unread,
Jul 14, 2013, 5:23:17 PM7/14/13
to opencog, Benjamin Goertzel
HypergraphDB is a java project, so I'm not particularly inclined to
suggest integrating it with OpenCog even though things like RelEx are
in Java.

My project will be a dynamic library in C/C++, with Python bindings at
the minimum.

However, hypergraphDB is fully functional and mature, whereas mine is
very much a prototype and work in progress. So if you need something
now, then hypergraphdb is the clear winner in that battle!

J

Linas Vepstas

unread,
Jul 14, 2013, 9:42:07 PM7/14/13
to opencog, Benjamin Goertzel
OK,


On 14 July 2013 15:46, Ramin Barati <rek...@gmail.com> wrote:
@linas

  I can't possibly answer all of these questions correctly right now, but I'll try.

You shouldn't even try. If you don't spend at least a couple of days on it, you won't get a good answer; you probably won't even understand what the questions are really asking.  These are not easy questions. Spend a few weeks thinking about them. A month, a couple of months. Keep coming back to it.  Architecture is not something you can dash off in an afternoon or a week. Professionals spend entire careers on this stuff.

  Questions 1,2,3,4,5,6: If we use a hypergraphDB (HGDB), for example HyperGraphDB, the data will be stored somehow

How?
 
and will be represented to us like a graph.

How?
 
The retrieval of the data may be slow in HGDB,

Why would it be slow?
 
so we store the the atoms and links in HGDB, but use a KVP to store the metadata,

What metadata?  The atomspace has metadata ??
 
Resulting in fast travel in the graph nodes and fast retrieval of atom data.

Two sentences ago, you said it was slow; now you say its fast? Which is it?
 
All we have to do is synchronizing the atomic operations of saving,
 
Why?
How? 

Something we will be doing anyways no matter what technology we use.

Doing what?
 
  Question 8,9,10: These questions are specific to relational model dbs,

No, they're not; I think you misunderstood the questions. 

 
except the locks and their mechanism (which I can't answer right now).

Why are locks needed? The current sql persistence backend does not need locks ... 
 
  In my point of view the ten above questions can be reduced to:

Again -- spend at least a few days or maybe a LOT longer, before you try to answer these questions.  Do some research.  Google stuff out when you get stuck. Sketch out on a piece of paper how you think you will organize things; what points where, how; the names of all the table that you'll need, what each table will contain.

Write down a sketch of the algorithms that might be needed.  Run the algo's by pencil & paper, think about where the data is, where its moving, what's being copied, saved, fetched, stored, indexed, offset, and where all these operations are happening, and how expensive each of them might be.

These were not meant to be easy questions, and if you think they are easy, then you didn't really understand the questions.

-- Linas

Linas Vepstas

unread,
Jul 14, 2013, 10:30:33 PM7/14/13
to opencog, Benjamin Goertzel
The following might help solidify thinking:

http://queue.acm.org/detail.cfm?id=1961297

A co-Relational Model of Data for Large Shared Data Banks

by Erik Meijer, Gavin Bierman | March 18, 2011

It starts out real easy; it gets a bit more complicated towards the end.

--linas

On 14 July 2013 20:42, Linas Vepstas <linasv...@gmail.com> wrote:

Ben Goertzel

unread,
Jul 14, 2013, 11:32:49 PM7/14/13
to ope...@googlegroups.com, Ramin Barati
Linas,

> What is a "representation"? It is a generic mathemetical term, which means
> that "structure A is implemented using a specific technique B, such that
> everything that can be legally done with A works exactly the same way with
> B."
>
> For the atomspace, structure A is a typed hypergraph. To figure out what B
> is, one must answer questions such as these:

These are good questions, and I will tell y'all the answers I have tentatively
concluded for *some* of them... with the overall OpenCog / Cog Prime design
in mind, rather than only the current or very-near-future applications

> 1) Where is an atom type stored? Is it a field in a record? Is it a key in
> a KVP? Maybe there is a different table for each different atom type? (so
> that the atom type is implicit, given the table, rather than explicit, by
> being a value in a table?).

I'm not sure, but one comment I'd make is that the list of Atom types will
probably ultimately be dynamic.

For instance, let's say we had an AtomSpace with a very large number
of relations such as

EvaluationLink
CommunicatesWith
A
B

because the system has been fed a lot of data from repositories of
email and mobile phone
communications [just a random example off the top of my head]

Then we may want to have a CommunicatesWithLink, to save memory, and
simplify the
encoding of rules regarding communication in the system...

If the system automagically identifies that CommunicatesWithLink
should be added as a new
link type, you'd like this change to be able to happen, without some
sort of wholesale
restructuring of the knowledge store

Of course, previously persisted Atomspaces will not use
CommunicatesWithLink, so some way of
loading these in and recasting them in terms of CommunicatesWithLink
will also be needed...

But the point is, it would be *better* if the list of Atom types were
not hard-wired in, in some very
inflexible way...

> 2) Are links and nodes stored in the same table, or in different tables?


One comment is that, ince we can have links pointing to links, the
link/node distinction is not really that fundamental...

This is an argument in favor of "same table", but not a decisive one
of course...

> 3) Are truth values stored with the atoms, or are they in a separate table?
> Same question for attention values?

The current attention allocation subsystem uses a big matrix of
attention values for carrying out the mechanics of
importance spreading.... This suggests that, in RAM, it can
sometimes be useful to store AVs separately from Atoms...

In the case of persistent storage, though, I can't currently think of
an important case where you'd need to grab a TV or AV
without also grabbing the Atom..... Of course, if you're grabbing a
network of Atoms from disk, then at the edges of the network,
you may be grabbing the TV, AV and other basic information of certain
Atoms and not grabbing their links. But this case doesn't really
justify having a separate store for TVs or AVs...

> 4) Should the outgoing set of a link be considered to be part of a link
> type, or not? Should links with different outgoing types be stored in a
> different table, or the same table? That is, if I have
>
> SimilarityLink
> ConceptNode A
> ConceptNode B
>
> and
>
> SimilarityLink
> ConceptNode A
> NumberNode B
>
> should these be stored in the same table (the table of all similarity links)
> or in two different tables?

Often there are multiple ways to represent the same thing, and the
choice between them is somewhat arbitrary

For instance,

SimilarityLink
ConceptNode A
ConceptNode B

and

SimilarityLink
ConceptNode A
SatisfyingSet
PredicateNode isB

are the same thing...

Part of the process of cognition is transforming between these
different equivalent representations..... This is part of PLN
inference, for example...

> 5) Does the outgoing set get stored as one single record, or is it split
> across many records? One table, or many tables?

Different queries will tend to filter the outgoing set in different
ways, and there seems no way to split it up across many records in a
way that will make sense for the variety of common queries. So, if
it's split up, one has to assume that a significant percentage of
queries will need to be executed across multiple parts...

> 6) PLN and pattern matching, and other inferencing systems, will want the
> set of all incoming links.

Yes

>There needs to be a fast way of finding these.
> How will this be done? By using an index? By using a table? Some other way?

...

> 7) In the above, perhaps "the set of all incoming links" is not a
> requirement, perhaps "the set of most incoming links, in an
> eventually-consistent fashion" is enough. How would this affect the design?

In that case, there should be a way for the system to increase its
confidence of having
a complete version of the ste of incoming links toward certainty, via
exerting extra
effort/resources

If you really really try, you can remember *almost* everything you
know about your
first girlfriend, even though you may never get to completeness...

> 8) After finishing the above design, what are all the steps needed to create
> a new atom? How many tables or records need to be updated? one, two, three?
> More? How many indexes need to be recomputed? Does anything need to be
> locked while this is being done, or is there a lock-free solution?
>
> 9) After finishing the above design, what are all the steps needed to change
> the truth value of an atom? How many tables & indexes need to be updated?
> Are locks needed?

No comments on these...

> 10) Same question for STI/LTI. Note that the current AV infrastructure in
> the atomspace is very rigorous about keeping a certain set of indexes
> exactly in sync. Can this be relaxed?

It should be possible to relax this requirement. Attention
allocation is a highly
fuzzy, fault-tolerant sort of process

> The above is not an exhaustive list of questions; there are more. We have
> not yet even talked about "distributed" and "synchronization" and
> "consistency" and "concurrency" and "write-back". These are even harder
> questions, best left for later.

About the requirements for a distributed Atomspace, I did think this
through pretty
carefully about a year ago ... those who don't remember the
particulars of that long
discussion can look at the document here:

http://wiki.opencog.org/w/DistributedAtomspace

-- Ben

David Hart

unread,
Jul 15, 2013, 12:24:54 AM7/15/13
to ope...@googlegroups.com
>> 2) Are links and nodes stored in the same table, or in different tables?
>
> One comment is that, ince we can have links pointing to links, the
> link/node distinction is not really that fundamental...
>
> This is an argument in favor of "same table", but not a decisive one
> of course...
>
>> 3) Are truth values stored with the atoms, or are they in a separate table?
>> Same question for attention values?

On data structure concepts, I kinda assumed that a future roll-your-own
persistence would use either some existing KVP library (nearly all RDB &
noSQL systems use them, as does HyperGraphDB) or a completely home-grown
low-level store, rather than a RDB, so discussion of 'nodes stored in
tables' sounds a bit anachronous to my ear.

> In the case of persistent storage, though, I can't currently think of
> an important case where you'd need to grab a TV or AV
> without also grabbing the Atom..... Of course, if you're grabbing a
> network of Atoms from disk, then at the edges of the network,
> you may be grabbing the TV, AV and other basic information of certain
> Atoms and not grabbing their links. But this case doesn't really
> justify having a separate store for TVs or AVs...

I can envision use cases where a KVP persistence layer with separate
instances for each atom attribute outperforms an RDB-style system were
all atom data is lumped into single records or views, because of
parallelism. Think of it like RAID striping; if each atom attribute is
stored in a separate storage instance, a single-atom fetch may be slower
than the RDB case, but a stream of them will likely fetch faster.

(see my message of last week about LMDB, the Lightning Memory-Mapped
Database)

-dave

Ramin Barati

unread,
Jul 15, 2013, 6:33:26 AM7/15/13
to ope...@googlegroups.com
@linas

  Questions 1,2,3,4,5,6: If we use a hypergraphDB (HGDB), for example HyperGraphDB, the data will be stored somehow

How?
 
and will be represented to us like a graph.

How?

  I've read the article. although it addressed KVPs, Nothing was said about graphs. It's pretty clear to me why relational model can model any data structures, but will it be an efficient model too? Is it the best fit model for graphs? Graphs consists of Nodes, Edges and Properties/Metadata. In a relational model everything is persisted in tables. This will add complexity to the solution, because we need to exactly map the graph to tables. OK, so we introduce relations between tables and then we will have a naive model of a graph. But time complexity of traversing in the graph will be high, because we need to iterate through the rows and find the next link (the related row in the table) every time we want to to travel to the neighboring node. OK, so we introduce indexes and speed things up. Indexes takes space, also should be updated from time to time, so they become a new complexity introduced through the need for speed.
 
  All of the above complexities are the direct consequence of mapping Node/Edges/Properties to Table/Relations. It can be done but it just doesn't feel natural to me (or anyone else for that matter). Graph databases are designed from the ground to address these needs. I quote from Wikipedia:

 A graph database is a database that uses graph structures with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. This means that every element contains a direct pointer to its adjacent element and no index lookups are necessary.

  As you can see from the above. We DON'T have tables/relations/indexes in a graph database, so there will be NO NEED to design tables/relations. It just works the way it does. I hope that I've answered the first two 'How?'s.

The retrieval of the data may be slow in HGDB,

Why would it be slow?

  They won't be slow (They are designed to be fast), I just like to have a Plan B.

so we store the the atoms and links in HGDB, but use a KVP to store the metadata,

What metadata?  The atomspace has metadata ??

  Atomspace itself doesn't have metadata; Nodes and Links do (TVs, AVs, etc.).

Resulting in fast travel in the graph nodes and fast retrieval of atom data.

Two sentences ago, you said it was slow; now you say its fast? Which is it?

  I said it MAY be slow. I'm defining the Plan B.

All we have to do is synchronizing the atomic operations of saving,
 
Why?

  If we had to resort to plan B, then we need to address the Atomicity of storing nodes/links and their properties (A complexity introduced through using two separate dbs for storing graph and metadata, my plan B).

How?

  We should develop a transaction mechanism. It's pretty straight-forward.

Something we will be doing anyways no matter what technology we use.

Doing what?

  Addressing Atomicity in case of Plan B.

  Question 8,9,10: These questions are specific to relational model dbs,

No, they're not; I think you misunderstood the questions. 

  Considering the graph db definition, yes they are.  

except the locks and their mechanism (which I can't answer right now).

Why are locks needed? The current sql persistence backend does not need locks ... 

  If the current design doesn't have locks then we don't need locks. I meant that migrating to graph dbs won't make the world needless of locking mechanisms.

  In my point of view the ten above questions can be reduced to:

Again -- spend at least a few days or maybe a LOT longer, before you try to answer these questions.  Do some research.  Google stuff out when you get stuck. Sketch out on a piece of paper how you think you will organize things; what points where, how; the names of all the table that you'll need, what each table will contain.

  A generally true statement for any pragmatic person. Although if I ever took the job, I won't be connecting tables. Relational Model is a mature technology and it would have been my first choice if I wasn't modeling a graph. Relational Model just weren't designed with graphs in mind, I don't mean to call it an obsolete methodology. 

These were not meant to be easy questions, and if you think they are easy, then you didn't really understand the questions.

  I never said they are. But I believe that migrating to a graph db will erase most of these questions. They simply cease to exist.

Ben Goertzel

unread,
Jul 15, 2013, 8:22:05 AM7/15/13
to ope...@googlegroups.com
FWIW, my intuition agrees with Ramin's that a graph database would be
a good choice for OpenCog persistence

A hypergraph can be pretty straightforwardly mapped into a graph, in
such a way that queries that would be fast over a hypergraph are also
fast over that graph...

Whether existing graph DB technologies are scalable in the ways we'd
need for OpenCog down the road, is a different questions...

-- Ben G

David Hart

unread,
Jul 15, 2013, 9:00:49 AM7/15/13
to ope...@googlegroups.com, Borislav Iordanov
On 15/07/13 22:22, Ben Goertzel wrote:
> FWIW, my intuition agrees with Ramin's that a graph database would be
> a good choice for OpenCog persistence
>
> A hypergraph can be pretty straightforwardly mapped into a graph, in
> such a way that queries that would be fast over a hypergraph are also
> fast over that graph...
>
> Whether existing graph DB technologies are scalable in the ways we'd
> need for OpenCog down the road, is a different questions...

HyperGraphDB, as I understand it, implements a graph DB store using KVPs
with BerkeleyDB - I'm hoping that Boris can chime in with his experience.

Probably shoehorning a graph database into a bunch of KVPs isn't optimal
but it's probably also much less effort than writing a more 'native'
persisting graph database from scratch (whatever 'native' means with
respect to implementing them on on digital computers), and it can take
advantage of other KVP library features, like the memory-mapped
zero-copy lookup found in LMDB.

That one feature alone, combined with solid state storage, raises some
fundamental design questions, like does it make sense to continue to
mostly use a query + batch-load + batch-process model? When would
pattern matching and graph transformation operating directly on stored
atoms make sense over query-load-process?

-dave
David Hart
OpenCog Foundation

Linas Vepstas

unread,
Jul 15, 2013, 1:34:07 PM7/15/13
to opencog, Ramin Barati
On 14 July 2013 22:32, Ben Goertzel <b...@goertzel.org> wrote:
 the list of Atom types will
probably ultimately be dynamic.

FWIW, the current backend already supports dynamic types (even though the atomspace currently does not) in teh following sense: you can store a bunch of stuff, stop opencog, add a bunch of new atom types, start opencog, and restore from the database.  The mapping of old to new is handled automatically, and has been tested and works.  However, this is a different concept than what you describe below.
 

For instance, let's say we had an AtomSpace with a very large number
of relations such as

EvaluationLink
    CommunicatesWith
    A
    B

because the system has been fed a lot of data from repositories of
email and mobile phone
communications [just a random example off the top of my head]

Then we may want to have a CommunicatesWithLink, to save memory, and
simplify the
encoding of rules regarding communication in the system...

If the system automagically identifies that CommunicatesWithLink
should be added as a new
link type, you'd like this change to be able to happen, without some
sort of wholesale
restructuring of the knowledge store

Good one. The traditional solution to this is to use triggers and stored procedures, thus hiding the actual data representation from the programmers (this is partly due to the lawyers, and partly due to govt regulation (HIPAA, Sarb-OX): can't have some yahoo, green inexperienced programmers screwing up the data, so you have the DBA's control and guarantee the data integrity.)

So, if this was SQL not opencog, the DBA would write a stored proceedure that gets used whenever a programmer asks for a CommunicatesWithLink, and it would return what was asked for, never mind how its actually stored in the DB.  Or, if they asked for the other variant, the other variant would get returned.  How the database *actually* stores this info is hidden from the programmer, and is controlled entirely by the DBA, who reports to a different management chain than the developers, culminating in a different executive.

We could implement something like this (i.e. stored procedures and triggers) within the atomspace.  My initial reaction is that I think its orthogonal to the persistence question.   (persistance is "how to we store stuff", while the above is a "how do we query stuff?")   An initial, simple-minded implementation wouldn't be that hard.
 
But the point is, it would be *better* if the list of Atom types were
not hard-wired in,

Its not.

Note that the original question was not about storing atom types, it was about storing the type of an atom.  If every different atom type is stored in its own unique table, there's no reason that new tables couldn't be created on the fly ...

 
> 2) Are links and nodes stored in the same table, or in different tables?

One comment is that, ince we can have links pointing to links, the
link/node distinction is not really that fundamental...

Sure, but nodes must save strings, and links have outgoing sets. So, perhaps one can save storage space by keeping them separate.  None of my questions had to do with "fundamentals", they had everything to do with making practical, detailed design decisions which have real, actual reprecussions on speed, memory used, compactness on disk, query efficiency.  At the low level, details matter, even if at the high conceptual level, they all blur away.
 
This is an argument in favor of "same table", but not a decisive one
of course...

Don't you think it would take twice as long to search a single large table than two smaller ones?  Or do you think that an index can make up for this?  These are the kinds of questions one needs to actually think about, if you want performance.
 

> 3) Are truth values stored with the atoms, or are they in a separate table?
> Same question for attention values?

The current attention allocation subsystem uses a big matrix of
attention values for carrying out the mechanics of
importance spreading....   This suggests that, in RAM, it can
sometimes be useful to store AVs separately from Atoms...

We are not talking about RAM, we are talking about disk (SQL/noSQL). The way things are stored in RAM is determined by the upper reaches of the atomspace, and have nothing to do with persistence.
 
In the case of persistent storage, though, I can't currently think of
an important case where you'd need to grab a TV or AV
without also grabbing the Atom.....  

I think you misunderstood the question. I was asking "how do you store it?"  and not "how do you query it?"
 
Of course, if you're grabbing a
network of Atoms from disk, then at the edges of the network,
you may be grabbing the TV, AV and other basic information of certain
Atoms and not grabbing their links.   But this case doesn't really
justify having a separate store for TVs or AVs...

You entirely missed the point of the question. Having separate tables has only a marginal relation to how things get queried.  I'm asking for a low-level, specific, concrete proposal, and you are responding with a very high-level, abstract contemplation.  The one has very little to do with the other ... I don't disagree with your answer, its just answering a different question.

 
> 4) Should the outgoing set of a link be considered to be part of a link
> type, or not?   Should links with different outgoing types be stored in a
> different table, or the same table? That is, if I have
>
>   SimilarityLink
>      ConceptNode A
>      ConceptNode  B
>
> and
>
>    SimilarityLink
>       ConceptNode A
>       NumberNode B
>
> should these be stored in the same table (the table of all similarity links)
> or in two different tables?

Often there are multiple ways to represent the same thing, and the
choice between them is somewhat arbitrary

Again, this entirely misses the point of the question.  Yes, in some abstract sense, these all map to one-another.  This has nothing at all to do with how they would actually be written to a database!
 

> 5) Does the outgoing set get stored as one single record, or is it split
> across many records?  One table, or many tables?

Different queries will tend to filter the outgoing set in different
ways, and there seems no way to split it up across many records in a
way that will make sense for the variety of common queries.  

ok, well, that's more interesting.  Realize, though, that in KVP stores, there is no alternative, and an outgoing set must be split across multiple records; that's a fundamental restriction.  Right? One key can have only one value.  Unless you plan to have that one value be a list of atom handles, in which case, someone has to parse that list, ... that is, if your storage system does not support lists natively (which, for example, is something that MySQL does not support, which is why the current persistence backend doesn't work on MySQL...) 

Which segues to the question about incoming links, because if you store it as a list that has to be parsed, then how do you perform the query that returns the incoming set?
 
So, if
it's split up, one has to assume that a significant percentage of
queries will need to be executed across multiple parts...

Sure. That would be the Map step of a MapReduce.  Part of the noSQL territory, if you want to use a noSQL design.
 
> 7) In the above, perhaps "the set of all incoming links" is not a
> requirement, perhaps "the set of most incoming links, in an
> eventually-consistent fashion" is enough.  How would this affect the design?

In that case, there should be a way for the system to increase its
confidence of having
a complete version of the ste of incoming links toward certainty, via
exerting extra
effort/resources

How? Details matter. The point is that if you have many agents  crunching data, there is a constant flux of the incoming set: new atoms getting added, others getting deleted.  There is a good chance that between the time that you grab an incoming set, and the time that you are done using it, that one of the atoms in that set is deleted by some other agent.   Now what?

You can try to lock everything down, but using locks would probably be a bottleneck.  Unless you made the locks very fine-grained... but then they get hard to manage.   What does "exerting extra effort/resources" really mean? You could broadcast a global message to all agents: "please stop what you are doing, I need to get the incoming set, don't do anything till I tell you its OK again ..."  but this "solution" to the problem would be horrible.

If you really really try, you can remember *almost* everything you
know about your
first girlfriend, even though you may never get to completeness...

Err, no. If the set of incoming links is static and unchanging, then "eventual consistency" means that by the time you query, you will almost certainly get everything.  That's what the E in BASE stands for.

The problem is with what do do before "eventual consistency" happens:  while the incoming set is still rapidly changing, because its in the here-and-now, and not some old, cobwebbed memory. 

> The above is not an exhaustive list of questions; there are more.  We have
> not yet even talked about "distributed" and "synchronization" and
> "consistency" and "concurrency" and "write-back".  These are even harder
> questions, best left for later.

About the requirements for a distributed Atomspace, I did think this
through pretty
carefully about a year ago ... those who don't remember the
particulars of that long
discussion can look at the document here:

http://wiki.opencog.org/w/DistributedAtomspace

Which hopefully concludes with the statement that "we have had a working, debugged, operational distributed atomspace since 2009, which no one is actually using."   I rather resent this idea that we have to redesign something that no one bothers to actually use.  While its likely that what we currently have will eventually be found lacking and deficient, doing blue-sky designs without learning how to fly first is a recipie for waste.

Unless you want to redesign the atomspace from scratch, which is ... something else again.  I have started implementing a brand-new, from scratch atomspace. I don't know if I like it yet, but I think its a lot faster and slimmer than the current one.  (Its in the link-grammar codebase. Conceptually identical to the current atomspace, just that everything is done very very differently).

-- Linas

Linas Vepstas

unread,
Jul 15, 2013, 2:02:43 PM7/15/13
to opencog
On 14 July 2013 23:24, David Hart <dh...@opencog.org> wrote:


I can envision use cases where a KVP persistence layer with separate
instances for each atom attribute outperforms an RDB-style system were
all atom data is lumped into single records or views, because of
parallelism. Think of it like RAID striping; if each atom attribute is
stored in a separate storage instance, a single-atom fetch may be slower
than the RDB case, but a stream of them will likely fetch faster.

I think you are building a false analogy.  Even with raid striping, disk block sizes are 512 bytes; the disk drive industry is experimenting with  much larger bock sizes.  Atoms are about 100 bytes or so.  Trying to stripe 100 bytes across dozens of records sounds like the kiss-of-death to me. 

I just don't see how a dozen cpu cores, each pulling 4 or 8 bytes out of a dozen records, can possibly outperform a dozen cpu cores, each pulling out one atom each.   Think about cache locality: if an atom is 100 bytes, and those bytes are contiguous, then a single atom can fit in one or two cache lines.  But if you split the atom up into a dozen pieces, then you need a dozen cache lines: which eventually leads to a 5x or 10x higher cache miss rate.

Again, if you want to make a convincing argument, you have to go out and actually measure these things.  In 2009, when I wrote the current persistence backend, we had EXACTLY the same conversation. I installed the fasted KVP database of the day, stuck a million atoms into it, and measured.  I got about 25K to 30K atoms/second.  I did the same thing with postgres and got 5x or 10x that. 

Why?  After time on the mailing lists, it became clear: the KVP stores are optimized for storing web-pages, javascript, png's, gif's, mp3's, flash animations.   These are 50K to 5MB in size typically.  Maybe it's true that a KVP would be faster than postgres for this kind of data. Maybe, I dunno, did not measure.  When I told them that my data was 50 bytes or 100 bytes or so, they kind of laughed, the "what's wrong with you" laugh.  The response was dismissive, because no ordinary person would ever store something ridiculous like that.

That is why the current persistence backend is on SQL, not on KVP.  The KVP code is still in the git repo, if you want to look. Its broken, though, and never fully worked.

The problem is that 90% of the noSQL fanboys have no clue of what they are talking about. There are legitimate uses for noSQL and map-reduce, but I think that the current atomspace design is so tremendously far away from that, that the gap can't be bridged without a total over-haul, and I don't think we have enough smart people here to be able to perform that overhaul.  We simply don't have the manpower or the experience on tap.

If you think you know how to map-reduce within the atomspace, and get high performance, I'd like to hear it.  But make it convincing, not hand-waving.

-- Linas
 

Linas Vepstas

unread,
Jul 15, 2013, 2:54:23 PM7/15/13
to opencog
Ramin,

I don't know what to say. I don't know how to respond to what you wrote.  I'm sorry that I upset you; I think you need to get some rest, take a vacation.  Lets just take a break, and resume this conversation in a week.

-- Linas


Charles Beyer

unread,
Jul 15, 2013, 3:49:48 PM7/15/13
to ope...@googlegroups.com
Microsoft has been working on a distributed computing architecture called Orleans.  While it may not be an exact fit for AtomSpace, you might find insight in how they chose to distribute and synchronize Grains (their Atom equivalents).



Charles Beyer,
cjb...@gmail.com
Seattle, WA

Linas Vepstas

unread,
Jul 15, 2013, 3:57:52 PM7/15/13
to opencog, Borislav Iordanov
Hi Dave,

On 15 July 2013 08:00, David Hart <dh...@opencog.org> wrote:

Probably shoehorning a graph database into a bunch of KVPs isn't optimal
but it's probably also much less effort than writing a more 'native'
persisting graph database from scratch (whatever 'native' means with
respect to implementing them on on digital computers),

:-)

and it can take
advantage of other KVP library features, like the memory-mapped
zero-copy lookup found in LMDB.

In May 2008 (5 years ago) I measured 219K atoms/second through the SQL backend, and 40K atoms/second through the AtomTable.  (see the opencog/persist/sql/README file)    So, 5 years ago, the SQL accounted for less than 20% of the overhead.

I did not measure the Atomspace performance back then.

This spring, I measured the AtomSpace performance; see opencog/benchmark/diary.txt  --  the current atomspace is 8x or 10x SLOWER than the atomtable.    If I multiply numbers together, I naively conclude that the SQL portion of the atomspace performance is about 2% of the grand total.

We are wasting a lot of electrons talking about how to optimize this 2% ... We have a persistence backend, it works, and its not the bottleneck.

Its not perfect, its flawed, its badly flawed, even.  I could write a book about everything that is wrong with it. But redesigning it does not seem like the correct plan of action at this time.
 

That one feature alone, combined with solid state storage, raises some
fundamental design questions, like does it make sense to continue to
mostly use a query + batch-load + batch-process model?

This is a valid question; however, first, I would like to point out that the current SQL back-end does on-the-fly as well as batch. 
 
When would
pattern matching and graph transformation operating directly on stored
atoms make sense over query-load-process?
 
Possibly.  This question can be taken in several ways, and one of them seems to be "should we re-design the atomspace?"

Philosophically, I am an evolutionary incrementalist. This means that I usually prefer small, incremental steps that change the codebase in small steps, rather than one big revolutionary, boil-the-ocean approaches. So I would propose this:

-- Who are the primary users of the atomspace today? 
-- What performance bottlenecks are they experiencing?  (I would prefer hard numbers rather than vague assertions and accusations)
-- What can we do to fix those bottlenecks? (viz, what can we do with a person-week or a person-month effort? I don't want a 5 man-year project to redesign everything.)

So, in the context of your question: "When would pattern matching and graph transformation operating directly on stored atoms make sense over query-load-process?"  my answer(s) would be:

-- Sure, we could move pattern matching inside of the atomspace, instead of having it outside. It would run a lot faster that way.  And, although it would not yet be operating directly on stored atoms, it would be one step closer to that.

-- 'graph transformation operating directly on stored atoms': if this is what I think it is, then we don't have graph transformations at all, yet, but it would make sense to put it inside the atomspace, not outside, since that would have much higher performance.

Linas

Borislav Iordanov

unread,
Jul 15, 2013, 6:38:11 PM7/15/13
to David Hart, opencog
Hi,

On Mon, Jul 15, 2013 at 9:00 AM, David Hart <dh...@opencog.org> wrote:
> On 15/07/13 22:22, Ben Goertzel wrote:
>> FWIW, my intuition agrees with Ramin's that a graph database would be
>> a good choice for OpenCog persistence
>>
>> A hypergraph can be pretty straightforwardly mapped into a graph, in
>> such a way that queries that would be fast over a hypergraph are also
>> fast over that graph...
>>
>> Whether existing graph DB technologies are scalable in the ways we'd
>> need for OpenCog down the road, is a different questions...
>
> HyperGraphDB, as I understand it, implements a graph DB store using KVPs
> with BerkeleyDB - I'm hoping that Boris can chime in with his experience.
>
> Probably shoehorning a graph database into a bunch of KVPs isn't optimal
> but it's probably also much less effort than writing a more 'native'
> persisting graph database from scratch (whatever 'native' means with
> respect to implementing them on on digital computers), and it can take
> advantage of other KVP library features, like the memory-mapped
> zero-copy lookup found in LMDB.

I think mapping the Opencog hypergraph model to a classical graph
database is not so trivial. Unless the actual data lends itself to it,
e.g. if most links are really arity 2, then it would make more sense.
Otherwise, it's problematic. A natural mapping from a hypergraph to a
graph could be to represent a (hyper)link L as a (classical) node N
and create a classical edge between N and each of L's targets. Then to
avoid loosing the order of the original tuple, you'd have to store it
as a property of the edge connecting the hyperlink and its target. My
guess is this leads to a lot of wasted space and more processing time.
Maybe there are better ideas, but I haven't seen anything easy to work
with.

What distinguishes graph dbs from KV stores is that graph dbs are
pretty much like object-oriented dbs: each entity has a unique
identifier that also happens to be a direct pointer in storage to
where the entity is located. So lookup or insertation are not log(n)
tree operations as in KV stores, but really O(1) operations. That's
how some of the graph dbs boost about great traversal performance. But
then they start having problems (1) with queries that are more
set-oriented, more SQL-like and (2) with data partitioning across many
machines because then the identifiers are no longer unique and to make
them them unique you go back to KV strategy. Problem (1) can be
solved with some extra storage techniques on top of the graph
structure like the DEX database does (and I don't know of any other
that does it as well), and I don't know how problem (2) is solved or
alleviated.

A native HyperGraphDB storage could follow the idea of identifiers
(i.e. handles in OpenCog hypergraph lingo) being disk offsets and, to
solve the distribution issue, it would split an ID into two parts: a
long identifying the origin (the DB instance where the atom was
created) and a long which is a local disk offset. Each node in a HGDB
cluster would then have a separate file for each peer it replicates
atoms for. So all atoms originating at a given peer instance will be
stored in a dedicated file with the local offsets somehow mapped from
one peer to another (so that the file size doesn't have to
replicated).

There are also some newer algorithms for KV storage to experiment
with, e.g. fractal trees, and I just found out TokuDB is becoming more
open (http://en.wikipedia.org/wiki/TokuDB), which is nice.

Boris

Linas Vepstas

unread,
Jul 15, 2013, 7:58:14 PM7/15/13
to opencog, David Hart
Hi Boris,

Some casual thinking-aloud below.

On 15 July 2013 17:38, Borislav Iordanov <borislav...@gmail.com> wrote:

A native HyperGraphDB storage could follow the idea of identifiers
(i.e. handles in OpenCog hypergraph lingo) being disk offsets

Hmm. Interesting idea. But offsets are kind-of-like pointers, and all of the comp-sci issues with pointers show up with offsets as well, right?   So either you have to alloc/free the offset, and make sure no one is using it when you free it, or you use reference counting, or you garbage collect unused disk offsets.

By contrast, ID's that are not offsets have some but not all of these problems. So, for example, you can't free the storage associated with an ID until you are sure no one is using it.  But you are not under any pressure to re-use an ID, since you don't waste any space if an ID never used again.

Finding out if an ID is used is "easy"  for a user of a relational DB: you just ask it "is it being used?" and you get an answer.  It's someone else's problem (viz the DB author's) to make this actually happen.  But if we go native, then we have to solve this alloc/free/refcnt/gc mess ourselves. 

I noticed long ago that there are people out there who really, really want to write their own OS kernel, damn the rest of the world.  Then there are people who roll their own compilers and languages.  I suppose there are people who really really want to roll their own DB (are you one such?).  We need to find one of these people, and get them to make one for opencog.  Maybe what we really need to do is to get on DB discussion forums and start pleading and advertising and begging and hyping and trying to capture someone's imagination.  How to do this?
 
and, to
solve the distribution issue, it would split an ID into two parts: a
long identifying the origin (the DB instance where the atom was
created) and a long which is a local disk offset.

The current opencog sql persistence backend solves this in a very simple, but I think clever way. Basically, each instance requests a block of ID's, which it can issue however it wants. When these run out, it just asks for more.  There are no ID collisions, and I don't have to issue them in blocks of 2^32 or 2^64, but can issue tiny block of 1M each.  Issuing is trivial: increment some counter by += 1M (or whatever). If one instance needs more than 2^64 ID's, that's also not a problem. And since ID's never have to be returned or freed, that's it. Real easy.
 
Each node in a HGDB
cluster would then have a separate file for each peer it replicates
atoms for. So all atoms originating at a given peer instance will be
stored in a dedicated file with the local offsets somehow mapped from
one peer to another (so that the file size doesn't have to
replicated).

Replica management seems like it can become very non-trivial.  So again, I'm wondering: can we convince some DB geek(s) out there to invent a brand new hypergraph DB?

Now, I am utterly clueless as to how to think about your HGDB. How hard would it be to map opencog atoms into it?  How hard would it be to do this from C++?  What might the performance be?

-- Linas

Joel Pitt

unread,
Jul 15, 2013, 9:32:24 PM7/15/13
to opencog, David Hart
On 16 July 2013 11:58, Linas Vepstas <linasv...@gmail.com> wrote:
> On 15 July 2013 17:38, Borislav Iordanov <borislav...@gmail.com>
> wrote:
> Hmm. Interesting idea. But offsets are kind-of-like pointers, and all of the
> comp-sci issues with pointers show up with offsets as well, right? So
> either you have to alloc/free the offset, and make sure no one is using it
> when you free it, or you use reference counting, or you garbage collect
> unused disk offsets.
>
> By contrast, ID's that are not offsets have some but not all of these
> problems. So, for example, you can't free the storage associated with an ID
> until you are sure no one is using it. But you are not under any pressure
> to re-use an ID, since you don't waste any space if an ID never used again.

The way I'm approaching these challenges is to use "chunks" of storage
(essentially 4-64k pages... to be tuned later) which will have their
own non-pointer ID. An actual Atom ID consists of both a chunk ID and
an offset.

This has the benefit of not having to store the entire ID for each
atom, since the ID is just a combination of the chunk it's in and the
offset.

Thus, if you have subgraph that can be completely stored within a
chunk, all references will be offsets, otherwise references between
chunks require chunk_id + atom_id.

The db will be a persistent data store, in that no atom can be
explicitly deleted. Atoms can however be marked as stale, and
read-repair will optionally propagate this stale-marker to links that
connect to stale atoms. Unchecked database storage size growth will be
mitigated by a compaction process that can remove stale atoms and by
providing chunks that can be flagged as "temporary". The limitation of
these temporary chunks is that no other chunk's atoms can reference an
atom in this temporary chunk.

Chunks can be distributed, and the graph can repartition itself
between chunks, but that won't be a major focus for me for a while
yet.

I'm aiming for around 90-140 bytes per OpenCog-like atom *including*
indexes and everything else... but that's very dependent on the number
of properties stored and average outgoing set.

I've spent a fair amount of time learning about distributed databases,
and intend to follow through with my implementation. But I'm hesitant
to suggest it for any OpenCog plans... because it's progress will
depend on my time and it will be done "when it's ready".

Ben Goertzel

unread,
Jul 15, 2013, 11:04:09 PM7/15/13
to ope...@googlegroups.com, David Hart
Boris,

Currently, nearly all the Atoms in the Atomspace are either nodes,
binary links or ternary links...

Ternary because

EvaluationLink
PredicateNode: eat
List
cat
mouse

is generally represented in the code as a ternary EvaluationLink....

-- Ben G
> --
> You received this message because you are subscribed to the Google Groups "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at http://groups.google.com/group/opencog.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--

Ben Goertzel

unread,
Jul 15, 2013, 11:13:44 PM7/15/13
to ope...@googlegroups.com, Ramin Barati
Thanks for the responses to my reply, Linas...

In my reply, I did not intend to be answering all your difficult
questions.... Rather, I was just pointing out some related facts
about the Atomspace and its usage for folks to think about.... You
are probably already aware of all the factors I mentioned, but not
everybody else is

>> About the requirements for a distributed Atomspace, I did think this
>> through pretty
>> carefully about a year ago ... those who don't remember the
>> particulars of that long
>> discussion can look at the document here:
>>
>> http://wiki.opencog.org/w/DistributedAtomspace
>
>
> Which hopefully concludes with the statement that "we have had a working,
> debugged, operational distributed atomspace since 2009, which no one is
> actually using." I rather resent this idea that we have to redesign
> something that no one bothers to actually use. While its likely that what
> we currently have will eventually be found lacking and deficient, doing
> blue-sky designs without learning how to fly first is a recipie for waste.

If you take half an hour to re-read the PDF linked above, which is here

http://wiki.opencog.org/wikihome/images/e/ea/Distributed_AtomSpace_Design_Sketch_v6.pdf

you will remember that your own comments and suggestions played a
large role in shaping the document

You will also note that it was heavily guided by the requirements of
gaming and robotics applications. We are not yet doing those
applications, because of other needed code being incomplete.
However, once we start doing those applications seriously (which I
think will be pretty soon in the case of gaming, though I don't want
to argue that point since it's been delayed many times already), we
will start feeling the pain of running out of RAM, and will want
something like the design given in the PDF...

> Unless you want to redesign the atomspace from scratch, which is ...
> something else again. I have started implementing a brand-new, from scratch
> atomspace. I don't know if I like it yet, but I think its a lot faster and
> slimmer than the current one. (Its in the link-grammar codebase.
> Conceptually identical to the current atomspace, just that everything is
> done very very differently).

The distributed Atomspace design I proposed in the above-referenced
PDF is largely independent of

-- how the Atomspace is implemented within RAM on a single machine

-- what kind of backing store is used

Anyway I am happy to discuss the distributed Atomspace proposal, but
it seems not worthwhile to do so unless you feel like refreshing your
memory regarding what it actually says ;)

... ben

David Hart

unread,
Jul 16, 2013, 4:56:52 AM7/16/13
to ope...@googlegroups.com
On 16/07/13 05:57, Linas Vepstas wrote:

> We are wasting a lot of electrons talking about how to optimize this
> 2% ... We have a persistence backend, it works, and its not the
> bottleneck.
>
> Its not perfect, its flawed, its badly flawed, even. I could write a
> book about everything that is wrong with it. But redesigning it does
> not seem like the correct plan of action at this time.

If I understand the requirements correctly as outlined in Ben's design
sketch, the system needs to accommodate AtomSpaces much larger than
available physical RAM and be able to query+copy sets of atoms between
nodes, and both of these things could be accomplished using your
persistence backend with some additional work, such as a well-tuned
ForgettingAgent and a new TelepathyAgent that uses remote connections to
atomspaces and postregess instances.

With that said, reducing explicit memory management (i.e. reducing
complexity & overheads of Forgetting) may be a good reason for trying
experimental design changes that use memory-mapped storage for
persistence, effectively offloading the memory management to the
dumb(er) persistence and OS layers. I believe that could be accomplished
by replacing all memory allocation/deallocation semantics in the current
AtomSpace design with those of a low-level persistence library. Doing
that may actually slow down the AtomSpace even further, but the memory
management benefits *might* be worth the tradeoff in certain use cases
(for certain kinds of nodes). Of course that assertion can be validated
only building it and measuring it.

> Philosophically, I am an evolutionary incrementalist. This means that
> I usually prefer small, incremental steps that change the codebase in
> small steps, rather than one big revolutionary, boil-the-ocean
> approaches. So I would propose this:
>
> -- Who are the primary users of the atomspace today?
> -- What performance bottlenecks are they experiencing? (I would
> prefer hard numbers rather than vague assertions and accusations)
> -- What can we do to fix those bottlenecks? (viz, what can we do with
> a person-week or a person-month effort? I don't want a 5 man-year
> project to redesign everything.)

I think that's a good approach to the stable/development branches, but
we're also discussing potential experimental branches where a 5-man year
redesign might be the most sensible answer. All these kinds of
discussions should be valid so long as we're clear about which branches
we're discussing at any given moment. E.g. my first paragraph above was
development branch focused, while the second was experimental branch
focused.

> So, in the context of your question: "When would pattern matching and
> graph transformation operating directly on stored atoms make sense
> over query-load-process?" my answer(s) would be:
>
> -- Sure, we could move pattern matching inside of the atomspace,
> instead of having it outside. It would run a lot faster that way.
> And, although it would not yet be operating directly on stored atoms,
> it would be one step closer to that.

If the atomspace used memory-mapped storage semantics for all memory
allocation/deallocation, pattern matching could operate on in-RAM atoms
and stored atoms in unison. The parts of stored atoms subject to the
pattern matching (assuming it's expressed as a series of masks) would,
at least temporary, be fetched into RAM (or perhaps just a register).
This is one case where a property-segregated store would be beneficial,
as the parts of atoms not subject to matching could be kept in storage.
How common or how fringe this case might be, I have no idea.

note, here I'm thinking of 'storage' as PCI-attached flash that
completely bypasses all block and filesystem semantics. Segregated atom
properties stored across physical storage devices may operate as a type
of striping. I'll guess that things like CPU-cache lines aren't a
factor, since the pattern match events are likely to be one-offs or
separated by relatively large intervals in time. Streteching a bit
further into optimization speculation, it should be possible to flag
fetches for data for these ephemeral pattern matches as do-not-cache,
keeping cache lines usable for other purposes.

> -- 'graph transformation operating directly on stored atoms': if this
> is what I think it is, then we don't have graph transformations at
> all, yet, but it would make sense to put it inside the atomspace, not
> outside, since that would have much higher performance.

This also got me thinking about indexes which are, in the context of
hypergraph semantics, just the results of pattern matches that are
beneficial to keep around, perhaps because the pattern match is repeated
frequently, or because it's expensive. In a system with dynamic
indexing, each pattern match expression (graphs themselves) could be
optionally stored for statistical analysis for building and maintaining
a list of indexes to keep. The indexes themselves, if kept as memory
mapped structures, could be huge and swapped out to storage when not in
use (this assumes the case where creating the index is much more
expensive than simply fetching it, in whole or in part, from storage).
Of course there's still the overhead involved in updating indexes at
each atom add/change/remove, but such a system makes it generalizable
rather than hard-coded for each index to be updated.

-dave

Ramin Barati

unread,
Jul 16, 2013, 5:23:06 AM7/16/13
to Ben Goertzel, ope...@googlegroups.com
@linas

  I agree, the discussion is not constructive anymore and is getting more like a presidential debate. 

  FWIW, my first email wasn't about using a particular technology. I quote myself:

 I also believe that there is no commercial or open source db project that will satisfy all of the needs of the OpenCog data. But if we combine them, we can definitely address all of its needs.

  My proposal was to use each technology where it best performs. After some research on the matter, I've reached the following conclusions (summary):
  1) RDBs should be used where there is a need for ad-hoc queries or when complex data needs to be aggregated, e.g. summation, average, etc.
  2) If it's a simple structured data, best use KVPs. They're easier to query and needs far less design/develop time than relational models.
  3) GDBs are best used where networks need to be queried, for example "How can I get from X to Y?" or "Which nodes are most strongly related, but not directly connected to X?"
  4) As of right now, OpenCog doesn't have an immediate need for a complete and perfect persistence model, the current model is functional and ready. But it's not clear to me whether a "virtualization layer" exists somewhere in the codebase which one could use without heavily coupling with the persistence methodology.

Ben Goertzel

unread,
Jul 16, 2013, 5:30:04 AM7/16/13
to ope...@googlegroups.com
Ramin,

> But it's not clear to me whether a "virtualization layer" exists somewhere
> in the codebase which one could use without heavily coupling with the
> persistence methodology.

Do you mean something like

https://github.com/opencog/opencog/blob/master/opencog/atomspace/BackingStore.h

?

ben

Ramin Barati

unread,
Jul 16, 2013, 5:35:32 AM7/16/13
to ope...@googlegroups.com
Yeah, that's what I was looking for!

Ben Goertzel

unread,
Jul 16, 2013, 5:59:11 AM7/16/13
to ope...@googlegroups.com
That is how Linas's postgres backing store is accessed, according to
his own design ;)

Keyvan Mir Mohammad Sadeghi

unread,
Jul 16, 2013, 4:19:36 PM7/16/13
to opencog

> Ramin,
>
> I don't know what to say. I don't know how to respond to what you wrote.  I'm sorry that I upset you; I think you need to get some rest, take a vacation.  Lets just take a break, and resume this conversation in a week.

Hmmm, Linas,

I think Ramin has a valid point that using a graph db eliminates the need for thinking about the design choices you've mentioned (since they're mostly concerned with relational representations). I also tend to think that graph dbs would be a better choice for persistence, maintain-wise.

However, for a new AtomSpace design, I think, the question would be: are there any sufficiently good in-RAM graph db implementations out there which could feasibly address the requirements that Ben talks about in his proposal?

Cheers,
--K

Linas Vepstas

unread,
Jul 16, 2013, 4:54:54 PM7/16/13
to opencog
Hi David,


On 16 July 2013 03:56, David Hart <dh...@opencog.org> wrote:

If I understand the requirements correctly as outlined in Ben's design
sketch, the system needs to accommodate AtomSpaces much larger than
available physical RAM

Well, lets do some math. Lets assume a machine with 72GB RAM (cellphones and laptops have 4 or 8GB so this seems reasonable). Lets assume 100Bytes/atom.  This means we can fit 720M atoms into RAM.  The current AtomSpace API can access atoms at the rate of about 40K atoms/sec  Thus, a single MindAgent running on a single CPU core would take 720M/40K seconds == 18K seconds == 5 hours to touch every atom just once.

A ForgettingAgent would need to touch an atom twice: once to read its AV, once to write it.  (and a third time, actually: to find out about the atom from some other link, but I'll ignore this.  And a fourth time, to delete it if its ready to be deleted.).  So a forgetting agent can scan RAM no faster than once every 10 hours.  

Instead of a Forgetting agent, lets assume some inference agent, e.g. PLN.  I assume an inference would touch dozens of atoms, maybe hundreds, and touch each atoms 3-5-10 times: e.g. get TV, compute, set new TV, create some new atoms, adjust TV of old atoms, adjust AV, revisit places its been before, as it back-tracks.  It would be a busy beaver hitting a rather tiny portion of RAM (maybe a dozen KB) a lot, before moving on to some other part of the atomspace.  So I figure that a single PLN agent could spend several days, at least, before it exhausted 72GB of RAM.

And that is assuming we are 'forgetting' zero atoms.  If we assume that maybe 90% or 95% or 99% of all atoms are forgotten shortly after being created and used, thus reclaiming that RAM, we would now have an inferencing agent that could run for weeks or *months* before it ever needed to go to disk.

So, sure, although we can envision more than 72GB of atoms, and thus a need for disk storage, it could be days before an opencog server would *actually* need to go to disk to get something.   The maximum access rate of 40K/sec puts a very minor load on any modern-day disk i/o subsystem: this is a tiny fraction of what modern-day DB's (relational or non-relational) are capable of.  You could even put the DB at the far end of a crappy, slow ethernet line, and it would *still* not be a bottleneck.  You could put a bunch of opencog instances on a somewhat mediocre ethernet, and probably not notice a slowdown. Right now, today, for the opencog/atomspace, the i/o architecture is simply not a bottleneck.  Lack of distribution is not holding opencog architecture back.

Things will stay that way until we obtain an atomspace performance that is 10x or 20x faster.  Simply waiting N years for CPU's to get faster won't do the trick, because in N years, RAM will get bigger too, and the math above won't change.  In order to change the calculation above, we need to dramatically improve access time to Atoms in the Atomspace, and I don't see any way of doing this except to provide direct access.

and be able to query+copy sets of atoms between
nodes, and both of these things could be accomplished using your
persistence backend with some additional work, such as a well-tuned
ForgettingAgent

Yes.
 
and a new TelepathyAgent that uses remote connections to
atomspaces and postregess instances.

?? The current backend connects just fine to remote instances, no agent needed.  If one agent wants something in some remote atomspace, it just asks the backend about it. **(see footnote)
 
With that said, reducing explicit memory management (i.e. reducing
complexity & overheads of Forgetting)

Forgetting seems to be an integral part of inferencing; attention values are an explicit part of the atomspace architecture.  You can't get rid of explicit memory management until you get rid of explicit attention values.

(for certain kinds of nodes). Of course that assertion can be validated
only building it and measuring it.

Or by doing back-of-the-envelope calculations, like the above.  You don't actually have to build it to get some idea of how well it could or should work -- that is pretty much 100% of the whole point of "doing architecture".
 

> Philosophically, I am an evolutionary incrementalist. This means that
> I usually prefer small, incremental steps that change the codebase in
> small steps, rather than one big revolutionary, boil-the-ocean
> approaches. So I would propose this:
>
> -- Who are the primary users of the atomspace today?
> -- What performance bottlenecks are they experiencing?  (I would
> prefer hard numbers rather than vague assertions and accusations)
> -- What can we do to fix those bottlenecks? (viz, what can we do with
> a person-week or a person-month effort? I don't want a 5 man-year
> project to redesign everything.)

I think that's a good approach to the stable/development branches, but
we're also discussing potential experimental branches where a 5-man year
redesign might be the most sensible answer.

The number of successful redesign projects out there are tiny.  Remember the re-write of netscape that was mozilla?  It was supposed to be a 6 month project, a year, tops, that took a decade.  This is very typical of re-design projects.  More or less *every* successful project is an incremental evolution of some pre-existing code base.
 
> So, in the context of your question: "When would pattern matching and
> graph transformation operating directly on stored atoms make sense
> over query-load-process?"  my answer(s) would be:
>
> -- Sure, we could move pattern matching inside of the atomspace,
> instead of having it outside. It would run a lot faster that way.
> And, although it would not yet be operating directly on stored atoms,
> it would be one step closer to that.

If the atomspace used memory-mapped storage semantics for all memory
allocation/deallocation, pattern matching could operate on in-RAM atoms
and stored atoms in unison.

There is always a cost for accessing things on disk.  The question is who pays the bill:  which subsystem is the one that is blamed for incurring the overhead?
 
The parts of  stored atoms subject to the
pattern matching (assuming it's expressed as a series of masks) would,
at least temporary, be fetched into RAM (or perhaps just a register).

An atom is about 100 bytes, and fits in one or two cache lines.  Millions of cache lines can be filled in the amount of time that a single disk access completes. Knock off a factor of 10x for solid-state drives.  

note, here I'm thinking of 'storage' as PCI-attached flash that
completely bypasses all block and filesystem semantics.

Well, you can't, not without inventing a new kernel extension that would get laughed off of the linux kernel mailing list.

To put it more kindly, the overhead of using block and filesystem semantics in your database is maybe at most 0.2% of performance, and probably closer to 0.0001%  so you gain nothing by trying to avoid it.

Segregated atom
properties stored across physical storage devices may operate as a type
of striping. I'll guess that things like CPU-cache lines aren't a
factor, since the pattern match events are likely to be one-offs or
separated by relatively large intervals in time.

??  Please be aware that fetching data from a cache line is 20x to 100x faster than fetching it from RAM.  Modern day x86 and powerpc CPU's can perform 200 or 300 instructions in the time that it takes to fetch something from RAM.  Cache misses are expensive.
 
Streteching a bit
further into optimization speculation, it should be possible to flag
fetches for data for these ephemeral pattern matches as do-not-cache,
keeping cache lines usable for other purposes.

What other purposes are there?  What could we possibly be storing in RAM that are not atoms?
 

> -- 'graph transformation operating directly on stored atoms': if this
> is what I think it is, then we don't have graph transformations at
> all, yet, but it would make sense to put it inside the atomspace, not
> outside, since that would have much higher performance.

This also got me thinking about indexes which are, in the context of
hypergraph semantics, just the results of pattern matches that are
beneficial to keep around, perhaps because the pattern match is repeated
frequently, or because it's expensive.

Yes, both.
 
In a system with dynamic
indexing, each pattern match expression (graphs themselves) could be
optionally stored for statistical analysis for building and maintaining
a list of indexes to keep.

Yes!
 
The indexes themselves, if kept as memory
mapped structures, could be huge and swapped out to storage when not in
use

Maybe. I imagine that indexes take up 1% or maybe 10% at most 20% of what the raw data takes.  It's a question I'm willing to punt to the future.
 
(this assumes the case where creating the index is much more
expensive than simply fetching it, in whole or in part, from storage).
Of course there's still the overhead involved in updating indexes at
each atom add/change/remove,

It's a lot cheaper to perform a pattern match at the time of atom creation, because you have a very small hypergraph to explore.  So, yes.

but such a system makes it generalizable
rather than hard-coded for each index to be updated.
?
Yes, have "user-defined indexes". 

-- linas

Linas Vepstas

unread,
Jul 16, 2013, 5:08:43 PM7/16/13
to opencog, Ben Goertzel
On 16 July 2013 04:23, Ramin Barati <rek...@gmail.com> wrote:

  My proposal was to use each technology where it best performs. After some research on the matter, I've reached the following conclusions (summary):
  1) RDBs should be used where there is a need for ad-hoc queries or when complex data needs to be aggregated, e.g. summation, average, etc.
  2) If it's a simple structured data, best use KVPs. They're easier to query and needs far less design/develop time than relational models.
  3) GDBs are best used where networks need to be queried, for example "How can I get from X to Y?" or "Which nodes are most strongly related, but not directly connected to X?"
  4) As of right now, OpenCog doesn't have an immediate need for a complete and perfect persistence model, the current model is functional and ready. But it's not clear to me whether a "virtualization layer" exists somewhere in the codebase which one could use without heavily coupling with the persistence methodology.

You seem to be missing an important point: there already is a working backend. It is less than 3KLOC total: three .cc files and three .h files. Per other emails, it adds no more than 20% to the atomspace overhead.  So all of the protestations about complexity and performance  all ring hollow.  The code has already been written; there is no mandatory extra work.  The 3KLOC is just not a lot: its not complex.  The performance overhead is not ideal, but good enough for now.

The only real downside to the current interfaces is that its not fully automatic; the application programmer has to supply hints about what to save, when.  There are deep, complicated reasons for this, and those reasons would not change for any KVP or graphical approach.

I'm somewhat surprised that no one suggested a mapreduce approach yet, but that would be so radically different from today's world that its hard to envision.

-- Linas
 

Joel Pitt

unread,
Jul 16, 2013, 5:32:59 PM7/16/13
to opencog, Ben Goertzel
MapReduce is only really useful for narrow aspects of OpenCog's
functionality. Forgetting could be done trivially that way, but
anything requiring traversal of atoms kind of breaks the paradigm...

It's not impossible though. In particular, Riak has a modification of
M/R that allows other stages in the M/R pipeline... such as link
traversal: http://docs.basho.com/riak/latest/references/appendices/concepts/Links/#Link-walking

The map reduce development tools were also somewhat lacking and
annoying to debug when I used them... although that's more a complaint
about the various nosql databases that provide this. There's no
inherent reason it needs to be annoying.

There's also not a lot of benefit gained UNLESS the atomspace is
already distributed.

On the other hand, general map and reduce functions in the OpenCog
codebase would be useful. I vaguely recall these already exist though,
at least in the Scheme bindings.

Linas Vepstas

unread,
Jul 16, 2013, 5:38:00 PM7/16/13
to opencog, Ben Goertzel
Hi Joel,

Linas Vepstas

unread,
Jul 16, 2013, 6:01:00 PM7/16/13
to opencog, Ben Goertzel
Hi Joel,
Lets try again ...

On 16 July 2013 16:32, Joel Pitt <joel...@gmail.com> wrote:
MapReduce ... but

anything requiring traversal of atoms kind of breaks the paradigm...

Yeah, but I vaguely wonder: if there was someway to quasi-predict everything that might be traversed, and fetch that, and then do the traversal....

if there was some indicator, that this large group of atoms are all 'near' each other, and thus should be fetched together.

So: batch things: fetch 100K atoms at a time, with 99% probability that a traversal would not leave the boundaries of that batch.  This could be very economical:
-- 100K atoms is only 10MBytes; disks and networks and OS'es are pretty efficient when dealing with chucks that size.  Heck, some modern-cay CPU 3rd-level caches are about 10MBytes in size (well, OK, Xeon-class), so maybe the traversal could be done without leaving the cache.

Viz, a single Map step would materialize 100K atoms into RAM.

Not at all clear to me, but something like this might actually run pretty fast.  Or really slow if done wrong. Hard to say.


On the other hand, general map and reduce functions in the OpenCog
codebase would be useful. I vaguely recall these already exist though,
at least in the Scheme bindings

Surely python has these too. Don't be fooled: they are only map-reduce in the abstract comp-sci sense, not in the data+storage-management sense. 

When I say MapReduce here, what I'm really thinking is "some system that will fetch large blocks of data from some distributed store, allow me to perform work on those blocks, and then write something back".  For opencog, I know the data is "a bunch of atoms" and the work to be performed is "traverse them".  By "traverse them" I mean "run PLN on them" or "run pattern matching on them" or run some other agent on them.  Lets pretend we can throw away the current atomspace API completely.  Lets pretend each traversal agent was single-threaded, and never needed to touch any atoms outside of its little universe that were just fetched.  (Or maybe something like the microsoft facets paper that Chris Beyer posted earlier on this thread). How could all of this be made to work?

--linas



David Hart

unread,
Jul 16, 2013, 11:52:14 PM7/16/13
to ope...@googlegroups.com
On 17/07/13 06:54, Linas Vepstas wrote:
 
and a new TelepathyAgent that uses remote connections to
atomspaces and postregess instances.

?? The current backend connects just fine to remote instances, no agent needed.  If one agent wants something in some remote atomspace, it just asks the backend about it. **(see footnote)

Is it possible for pattern matching to transparently span multiple nodes if requested, or do the semantics need to switch to perform remote queries/fetches? In the TelepathyAgent case, I can image a multi-node pattern match fanning out between nodes where one TelepathyAgent asks another to perform the local pattern match of atoms in-RAM and in local storage.


With that said, reducing explicit memory management (i.e. reducing
complexity & overheads of Forgetting)

Forgetting seems to be an integral part of inferencing; attention values are an explicit part of the atomspace architecture.  You can't get rid of explicit memory management until you get rid of explicit attention values.

I'm only suggesting reducing it, not eliminating it. Joel's approach of marking atoms as stale, then batch-deleting them in a compaction operation sounds very sensible. I'm suggesting that for the case of "we're working near real-time on a big subgraph and we're out of memory, so let's quickly swap some atoms outside of our sub-graph to storage" using some fast paging algorithm at the persistence layer to do that, rather than some on-the-fly but expensive ECAN determinations by a MindAgent.


note, here I'm thinking of 'storage' as PCI-attached flash that completely bypasses all block and filesystem semantics.

Well, you can't, not without inventing a new kernel extension that would get laughed off of the linux kernel mailing list.

Well, it's already been done. http://www.theregister.co.uk/2012/04/19/fusion_io_sdk/

I dunno if Fusion-io's code in particular will ever be proposed for merging, but with *every* block storage vendor today making PCI-attached flash devices, some industry consensus and open source code is likely coming down the pipe. Linux will adapt.


To put it more kindly, the overhead of using block and filesystem semantics in your database is maybe at most 0.2% of performance, and probably closer to 0.0001%  so you gain nothing by trying to avoid it.

I think the database and Btrfs developers developing to Fusion-io APIs would disagree. Sure it won't be a consideration for us until after the other AtomSpace design bottlenecks are addressed, but it's at least worth using as a factor when thinking up new designs and choosing new tools & libraries.

Streteching a bit
further into optimization speculation, it should be possible to flag
fetches for data for these ephemeral pattern matches as do-not-cache,
keeping cache lines usable for other purposes.

What other purposes are there?  What could we possibly be storing in RAM that are not atoms?

I'm thinking here of parallel operations, where one operation is working on a graph that's just RAM-sized, with chunks or sub-graphs just fitting into CPU caches, and the other is doing some infrequent background maintenance task on stored atoms (e.g. importance updating). The background task has no need to populate or keep data around in CPU caches.

The indexes themselves, if kept as memory
mapped structures, could be huge and swapped out to storage when not in
use

Maybe. I imagine that indexes take up 1% or maybe 10% at most 20% of what the raw data takes.  It's a question I'm willing to punt to the future.

Sure. When it comes to squeezing the most out of some time-rented cloud compute nodes for a MMOG application, 20% of RAM can be a big deal!

-dave

boris

unread,
Jul 16, 2013, 11:55:01 PM7/16/13
to ope...@googlegroups.com, David Hart, linasv...@gmail.com


On Monday, July 15, 2013 7:58:14 PM UTC-4, linas wrote:
Hi Boris,

Some casual thinking-aloud below.

On 15 July 2013 17:38, Borislav Iordanov <borislav...@gmail.com> wrote:

A native HyperGraphDB storage could follow the idea of identifiers
(i.e. handles in OpenCog hypergraph lingo) being disk offsets

Hmm. Interesting idea. But offsets are kind-of-like pointers, and all of the comp-sci issues with pointers show up with offsets as well, right?   So either you have to alloc/free the offset, and make sure no one is using it when you free it, or you use reference counting, or you garbage collect unused disk offsets.

Yes, that's correct. Neo4J for example maintains a list of free identifiers to use. When somebody deletes an entity, they add that offset to the list.  The trick is then to have the main data entry for an entity be of fixed size, so freeing and reusing identifier/pointers is easy.
 
By contrast, ID's that are not offsets have some but not all of these problems. So, for example, you can't free the storage associated with an ID until you are sure no one is using it.  But you are not under any pressure to re-use an ID, since you don't waste any space if an ID never used again.

Finding out if an ID is used is "easy"  for a user of a relational DB: you just ask it "is it being used?" and you get an answer.  It's someone else's problem (viz the DB author's) to make this actually happen.  But if we go native, then we have to solve this alloc/free/refcnt/gc mess ourselves. 

I think the problem is not so hard to solve in the sense that the resulting system works - what's hard is making a scalable, distributed system out of it. Yes, you end up having to solve the same problems as RAM memory management, but then you can apply similar known techniques to solve those problems. It has been done before many times. Graph databases have been around for a while. It's funny how a lot of people buying into the NoSQL hype think they are new invention. 

But the key point here is that using a KV store is overkill for those identifiers. Be it graph node/edge ids or hypergraphdb handles or OODB oid, the point is that they are all of the same, small size, unique things, and the database system is the one creating them, not the user. They are not arbitrary by any means, so there's no need to use a general map (b-tree or hash or whatever) data structure for them. 
 
I noticed long ago that there are people out there who really, really want to write their own OS kernel, damn the rest of the world.  Then there are people who roll their own compilers and languages.  I suppose there are people who really really want to roll their own DB (are you one such?).  We

I'm more the compiler/language type actually :) Seriously, I would enjoy writing the low-level storage from scratch, I actually started one, but there are always other priorities that come up, and I have very little time to dedicate to HyperGraphDB.  Incidentally, I've talked to HyperGraphDB users about it (there are people who actually use in very data intensive projects for big commercial clients) and the feedback I get is that BerkeleyDB is good enough, but there are other things that can be improved first. Things such as the framework improvements, or query parallelization. Much of the effort that went into HGDB and much of the value is that it's really a framework to create various data models and even much of the way storage is managed is exposed through interfaces. For example, I was able to relatively quickly create a "plugin" that customizes how incidence sets are represented: to store the link type alongside its handle in an incidence set so that you don't have to load the link atom from disk in order to know its type. This sort of thing is fun, I don't know if you call it rolling your own DB since I've delegated all of the low-level headaches to BerkeleyDB, but it does have to do with organizing data in a way that's efficient to store and access. 

need to find one of these people, and get them to make one for opencog.  Maybe what we really need to do is to get on DB discussion forums and start pleading and advertising and begging and hyping and trying to capture someone's imagination.  How to do this?

Well, I guess that's how I got hooked into doing HyperGraphDB, thanks to David. Except I made the mistake of starting it in Java, thinking it would be just a quick way to learn more about the ideas and that I would not go too far before switching to C++. Then a couple of years ago, I almost jumped into a C++ rewrite for Opencog, but realized that this would (well, should, at least) involve a redesign of Opencog's atomspace, so it would have been too intrusive and it would have met a lot of resistance from the core Opencog developers. So I backed off. If you remember, we had a discussion where I was arguing that attention allocation was analogous to garbage collection and that handles should be implemented as smart pointers etc. Anyway, I'm only bringing this up because I think the decision of how much long-term storage management and storage concerns have to be part of Opencog internals is also important in deciding whether you want a dedicated db implementation, or what SQL or NoSQL you want to use. For example, where are the transaction boundaries? In general, how transparent are db operations to Opencog API users? Or, how much storage and/or data distribution algorithms would be driven, or modulated to some extent, by higher-level processes within Opencog?

 
and, to
solve the distribution issue, it would split an ID into two parts: a
long identifying the origin (the DB instance where the atom was
created) and a long which is a local disk offset.

The current opencog sql persistence backend solves this in a very simple, but I think clever way. Basically, each instance requests a block of ID's, which it can issue however it wants. When these run out, it just asks for more.  There are no ID collisions, and I don't have to issue them in blocks of 2^32 or 2^64, but can issue tiny block of 1M each.  Issuing is trivial: increment some counter by += 1M (or whatever). If one instance needs more than 2^64 ID's, that's also not a problem. And since ID's never have to be returned or freed, that's it. Real easy.
 
Each node in a HGDB
cluster would then have a separate file for each peer it replicates
atoms for. So all atoms originating at a given peer instance will be
stored in a dedicated file with the local offsets somehow mapped from
one peer to another (so that the file size doesn't have to
replicated).

Replica management seems like it can become very non-trivial.  So again, I'm wondering: can we convince some DB geek(s) out there to invent a brand new hypergraph DB?  
 
Now, I am utterly clueless as to how to think about your HGDB. How hard would it be to map opencog atoms into it? 

Trivial, almost immediate.
 
How hard would it be to do this from C++? 

Easy, but with a big overhead if you want the DB to be in-process. Otherwise, it would involve writing a REST interface that's specific for Opencog atoms. The main thing is deciding on the "wire" format for atoms, and since I believe you have a Scheme-like representation already in use, essentially the work would be to parse that in Java and convert to however Opencog atoms are represented in HGDB.
 
What might the performance be?

Depends on what. As I said, traversal won't be as quick as graph database because they use direct disk references as IDs. Until I implement the same thing in HGDB, I can't beat that. On the other hand finding sets of atoms meeting certain conditions should be reasonably fast. 

Best,
Boris

boris

unread,
Jul 17, 2013, 12:02:52 AM7/17/13
to ope...@googlegroups.com, David Hart
There is also the edge pointing to other edges limitations that classical graph dbs have. That makes it hard to represent binary links directly as classical graph edges because then you can't point to them. I guess you could have some sort of hybrid representation of hypergraphs into a classical graph, where only certain links need special treatment. Perhaps the best way is to actually try it: get some data into a graph db and implement one Opencog algorithm that works directly with the db instead of the in-memory Atomspace. Could be quite informative for devising the larger architecture for long-term storage of atoms.

Linas Vepstas

unread,
Jul 17, 2013, 3:02:19 PM7/17/13
to boris, opencog, David Hart
Hi Boris,


On 16 July 2013 22:55, boris <borislav...@gmail.com> wrote:

Well, I guess that's how I got hooked into doing HyperGraphDB, thanks to David. Except I made the mistake of starting it in Java, thinking it would be just a quick way to learn more about the ideas and that I would not go too far before switching to C++. Then a couple of years ago, I almost jumped into a C++ rewrite for Opencog, but realized that this would (well, should, at least) involve a redesign of Opencog's atomspace, so it would have been too intrusive and it would have met a lot of resistance from the core Opencog developers. So I backed off.

Sorry. You're right the atomspace needs a re-design.
 
If you remember, we had a discussion where I was arguing that attention allocation was analogous to garbage collection

Yes it is.
 
and that handles should be implemented as smart pointers etc.

I read somewhere that smart pointers and garbage collection are "dual" in some mathematical sense.  Which makes me feel better I suppose, but don't know what practical effect that has.
 
 Anyway, I'm only bringing this up because I think the decision of how much long-term storage management and storage concerns have to be part of Opencog internals is also important in deciding whether you want a dedicated db implementation, or what SQL or NoSQL you want to use.

Per other email, I decided that the issue is moot until performance improves.  The DB is not the bottleneck...

For example, where are the transaction boundaries? In general, how transparent are db operations to Opencog API users? Or, how much storage and/or data distribution algorithms would be driven, or modulated to some extent, by higher-level processes within Opencog?

No idea. Don't know how to even begin answering these questions, without some muddy trial-n-error.

You mentioned "parallel query" in HGDB.  What sort of queries do you support?  Any enlightening comments/lessons?

I keep thinking of the pattern matcher as the generic query for the atomspace, but, as obvious from other email thread, for simple queries, it is over-kill and too hard to set up and use.  I've no vision of what a limited but simple-to-use query  interface would be ...

--linas

Borislav Iordanov

unread,
Jul 18, 2013, 1:58:09 AM7/18/13
to Linas Vepstas, opencog, David Hart
Hi,

> You mentioned "parallel query" in HGDB. What sort of queries do you
> support? Any enlightening comments/lessons?

Queries are set-oriented: given an expression with some criteria, find
all atoms that match the criteria. The criteria lets you specify
constraints on:

- the type of an atom, e.g. it's of type T or it's of type T or any of
its sub-types
- the value of an atom, e.g. "equal to x" or "having a property p equal to y"
- the way it's linked to other atoms, e.g. "it's a link pointing to X
and Y" or it's the target of link L at position n

And you can combine those with the logical and, or, not operators. For
example, you can query for all links of type MyType, pointing to atom
A at position 2, and having some property called "weight" that's
greater than 0.5 like this ('hg' is the namespace used to for query
factory methods):

graph.findAll(hg.and(hg.type(MyType), hg.incidentAt(2), hg.gt("weight", 0.5));

If you have an index on the property "weight" for the type "MyType",
it will be used. You can also create indices on specific link targets.
That's analogous to indexing on a specific column in a relational DB.
And in general, when I explain HGDB to people, I tend to say that it's
not really a graph db, but it's kind of in-between a graph and a
relational db.

There some others, like query for members of a subgraph (not so long
ago I implemented the notion of subgraph). There is also an API to do
traversals of course, and there are implementations of traversals that
can be used in the set-oriented queries. For instance, to avoid all
atoms of type Foo that are connected to atom A, you could do:

graph.findAll(hg.and(hg.bfs(A), hg.type(Foo));

So that's about it, it's pretty limited. Those query expressions go
through a "compilation" phase to construct a query plan. This could be
relatively time consuming, so it's possible to "pre-compile" queries
and parameterize with variables like this:

HGQuery q = HGQuery.make(HGHandle.class, graph).compile(hg.and(hg.type(MyType),
hg.incidentAt(2),
hg.gt("weight", hg.var("weight"));

// set desired value of weight and execute
q.var("weight", 0.7).execute();

The one parallelization that I've done so far is on logical
disjunctions. Often there are queries of the form:

and(link(x,y), or(type(Foo), type(Bar)))

That is..give me all links that point to both x and y and that are
either of type Foo or type Bar. Before execution, this becomes:

or(and(type(Foo), incident(x), incident(y)), and(type(Bar),
incident(x), incident(y)))

Those are two independent intersections that can be done separately
and then the union constructed. There are other opportunities of
course, but I haven't explored yet.

If Opencog doesn't have such a simple API, I'm sure it won't be that
hard to create and it will probably help people a lot.

Best,
Boris

Ramin Barati

unread,
Jul 19, 2013, 3:29:00 AM7/19/13
to ope...@googlegroups.com, Linas Vepstas, David Hart
Hi Boris,

If Opencog doesn't have such a simple API, I'm sure it won't be that
hard to create and it will probably help people a lot.

Yes, that would make life a lot easier (for me at least).

BTW, is it possible to index nodes? I mean sth like the direction signs in a city. for example, The X landmark is a tourist attraction, so there are signs in the streets, guiding them there. IMO this would make graph traversals a lot faster.

Regards
Ramin

Ben Goertzel

unread,
Jul 19, 2013, 3:57:26 AM7/19/13
to ope...@googlegroups.com, Linas Vepstas, David Hart
> Yes, that would make life a lot easier (for me at least).
>
> BTW, is it possible to index nodes? I mean sth like the direction signs in a
> city. for example, The X landmark is a tourist attraction, so there are
> signs in the streets, guiding them there. IMO this would make graph
> traversals a lot faster.

But in that metaphor, we have a teleporter, which gives us the ability
to hop directly to any given location based on its GPS coordinates....

So, given this, we don't really need street signs, all we need are
signs with (landmark name, GPS coordinate) pairs, placed where people
potentialy interested in the landmark will see them...

;)
ben

Ramin Barati

unread,
Jul 19, 2013, 4:12:47 AM7/19/13
to ope...@googlegroups.com, Linas Vepstas, David Hart
But in that metaphor, we have a teleporter, which gives us the ability
to hop directly to any given location based on its GPS coordinates....

So, given this, we don't really need street signs, all we need are
signs with (landmark name, GPS coordinate) pairs,  placed where people
potentialy interested in the landmark will see them...

;)

Hm, Hadn't thought of it in that way... 

Joel Pitt

unread,
Jul 19, 2013, 8:13:30 PM7/19/13
to opencog, Ben Goertzel
On 17 July 2013 10:01, Linas Vepstas <linasv...@gmail.com> wrote:
> Yeah, but I vaguely wonder: if there was someway to quasi-predict everything
> that might be traversed, and fetch that, and then do the traversal....
>
> if there was some indicator, that this large group of atoms are all 'near'
> each other, and thus should be fetched together.
>
> So: batch things: fetch 100K atoms at a time, with 99% probability that a
> traversal would not leave the boundaries of that batch. This could be very
> economical:
> -- 100K atoms is only 10MBytes; disks and networks and OS'es are pretty
> efficient when dealing with chucks that size. Heck, some modern-cay CPU
> 3rd-level caches are about 10MBytes in size (well, OK, Xeon-class), so maybe
> the traversal could be done without leaving the cache.
>
> Viz, a single Map step would materialize 100K atoms into RAM.
>
> Not at all clear to me, but something like this might actually run pretty
> fast. Or really slow if done wrong. Hard to say.

That's a neat idea: making the "map" stage work on a higher level than
individual elements.

I also like it because it ties into my discussion of "chunks" in my
current project.

Many existing NoSql databases are key-value based, but the value is
often an entire document (often in JSON or some binary equivalent), so
they already use a higher level than individual values.

>> On the other hand, general map and reduce functions in the OpenCog
>> codebase would be useful. I vaguely recall these already exist though,
>> at least in the Scheme bindings
>
> Surely python has these too. Don't be fooled: they are only map-reduce in
> the abstract comp-sci sense, not in the data+storage-management sense.

Python most definitely does!

The latter data+storage-management sense you mention is essentially
just applying the same principles of the abstract one. There's job
synchronization and redundancy handling, along with the idea of
needing a "re-reduce" mechanism, but in thinking what you can
practically do with map reduce it's essentially the same as using
these functional methods.

> When I say MapReduce here, what I'm really thinking is "some system that
> will fetch large blocks of data from some distributed store, allow me to
> perform work on those blocks, and then write something back". For opencog,
> I know the data is "a bunch of atoms" and the work to be performed is
> "traverse them". By "traverse them" I mean "run PLN on them" or "run
> pattern matching on them" or run some other agent on them. Lets pretend we
> can throw away the current atomspace API completely. Lets pretend each
> traversal agent was single-threaded, and never needed to touch any atoms
> outside of its little universe that were just fetched. (Or maybe something
> like the microsoft facets paper that Chris Beyer posted earlier on this
> thread). How could all of this be made to work?

I think all those scenarios are worth considering, at the minimum:

- query using something like the pattern matcher, return and aggregate results
- apply a graph transformation rule (match and create/delete/modify)
- some kind of of traversal, using localised variables to save state
(e.g. to record path length/cost)

There are also a number of graph query languages, e.g. Cypher

http://docs.neo4j.org/chunked/stable/cypher-introduction.html

There's also Gremlin and a number of other interesting approaches.
It's been on my list to one day review all of these for their
applicability to hypergraphs and Atomspace-like structures.

J

boris

unread,
Jul 20, 2013, 10:39:50 PM7/20/13
to ope...@googlegroups.com, Linas Vepstas, David Hart
Hi,

I'm sorry, I'm not sure I understand the type of indexing you are talking about. 

During traversal, at each step you have a bunch of choices what the next atom would be. So you want some sort of index that helps you make that choice given a known end-goal? If I'm understanding correctly, that's an interesting problem...

Boris

Ramin Barati

unread,
Jul 21, 2013, 3:34:12 AM7/21/13
to ope...@googlegroups.com, Linas Vepstas, David Hart
I'm sorry, I'm not sure I understand the type of indexing you are talking about. 

During traversal, at each step you have a bunch of choices what the next atom would be. So you want some sort of index that helps you make that choice given a known end-goal? If I'm understanding correctly, that's an interesting problem...

Suppose the following is a sub-graph of a much bigger graph:

a --> b --> c --> d

Now the question we want to answer is: "Is there a relation between d and a?". The question is analogous to: "Is there a path between d and a?". We will need a heuristic to find the path itself, but finding the actual path is expensive. Can an index made in a way that will:
  -- tell us if there is a path between d and a (not the path itself)
or
  -- make our heuristic better and faster, so it's more feasible for trial and error
?

bfrs

unread,
Jul 22, 2013, 6:35:52 AM7/22/13
to ope...@googlegroups.com, Ben Goertzel, linasv...@gmail.com
Hi Linas,

On Tuesday, July 16, 2013 10:01:00 PM UTC, linas wrote:
Hi Joel,

Surely python has these too. Don't be fooled: they are only map-reduce in the abstract comp-sci sense, not in the data+storage-management sense. 

When I say MapReduce here, what I'm really thinking is "some system that will fetch large blocks of data from some distributed store, allow me to perform work on those blocks, and then write something back".  For opencog, I know the data is "a bunch of atoms" and the work to be performed is "traverse them".  By "traverse them" I mean "run PLN on them" or "run pattern matching on them" or run some other agent on them.  Lets pretend we can throw away the current atomspace API completely.  Lets pretend each traversal agent was single-threaded, and never needed to touch any atoms outside of its little universe that were just fetched.  (Or maybe something like the microsoft facets paper that Chris Beyer posted earlier on this thread). How could all of this be made to work?

--linas

Is the Bulk Synchronous Parallel model used in Google's Pregel graph database what you are looking for?

Linas Vepstas

unread,
Jul 22, 2013, 1:29:08 PM7/22/13
to Borislav Iordanov, opencog, David Hart
Hi,

Thanks, this is a very enlightening response.  I add some notes for Ramin, below.


On 18 July 2013 00:58, Borislav Iordanov <borislav...@gmail.com> wrote:

> You mentioned "parallel query" in HGDB.  What sort of queries do you
> support?  Any enlightening comments/lessons?

Queries are set-oriented: given an expression with some criteria, find
all atoms that match the criteria. The criteria lets you specify
constraints on:

- the type of an atom, e.g. it's of type T or it's of type T or any of
its sub-types
- the value of an atom, e.g. "equal to x" or "having a property p equal to y"
- the way it's linked to other atoms, e.g. "it's a link pointing to X
and Y" or it's the target of link L at position n

Ramin,

The atomspace does have C++ interfaces for these: see atomspace.h, lines 571 to 1048 and lines 1125 to 1193.  They should be relatively fast, as most/all of these routines are backed by an index, so that the query doesn't dig deeper than it needs to.

Some of these are exposed in the scheme API. Not sure about python. 

And you can combine those with the logical and, or, not operators.

The atomspace does not have this, not in the C++ API.  It does have this, in the native API: this is the pattern matcher.
 
graph.findAll(hg.and(hg.type(MyType), hg.incidentAt(2), hg.gt("weight", 0.5));

Boris: so we don't do this in C++, but we can build a hypergraph that encodes the above.  So all complex queries are hypegraphs, and that's why I call it "pattern matcher" instead of "query engine": we're matching hypergraphs to hypergraphs to perform a query.
 
can be used in the set-oriented queries. For instance, to avoid all
atoms of type Foo that are connected to atom A, you could do:

graph.findAll(hg.and(hg.bfs(A), hg.type(Foo));

Yeah, this is where things get weird.  Writing a query like the above is trivial in scheme; difficult and dicey in C++, at least, not without practice.  The flip-side is that calling scheme will be a fair bit slower than C++.  Oh well.

Those query expressions go
through a "compilation" phase to construct a query plan.

What do you compile them into? Oh, wait, see below ?

The one parallelization that I've done so far is on logical
disjunctions. Often there are queries of the form:

and(link(x,y), or(type(Foo), type(Bar)))

That is..give me all links that point to both x and y and that are
either of type Foo or type Bar. Before execution, this becomes:

or(and(type(Foo), incident(x), incident(y)), and(type(Bar),
incident(x), incident(y)))

I guess that's the compile step... not entirely clear to me what this accomplishes, but .. OK, save that for a rainy day.

If Opencog doesn't have such a simple API, I'm sure it won't be that
hard to create and it will probably help people a lot.

Yeah, but then its the language wars C++ vs python vs native. And the atomspace has other problems, such as terrible performance, just right now.

I'm very much in the mind-set of wanting to do everything in "native" mode: express everything in hypergraphs.  Unfortunately, defacto, a lot of coding must still take place in C++ or other traditional language ...

-- Linas

Linas Vepstas

unread,
Jul 22, 2013, 1:45:49 PM7/22/13
to Ramin Barati, opencog, David Hart
Hi Ramin,

On 21 July 2013 02:34, Ramin Barati <rek...@gmail.com> wrote:
I'm sorry, I'm not sure I understand the type of indexing you are talking about. 

During traversal, at each step you have a bunch of choices what the next atom would be. So you want some sort of index that helps you make that choice given a known end-goal? If I'm understanding correctly, that's an interesting problem...

Suppose the following is a sub-graph of a much bigger graph:

a --> b --> c --> d

Now the question we want to answer is: "Is there a relation between d and a?". The question is analogous to: "Is there a path between d and a?". We will need a heuristic to find the path itself, but finding the actual path is expensive. Can an index made in a way that will:
  -- tell us if there is a path between d and a (not the path itself)

Well, yes. 

Look, in the other discussion thread, given what you are doing, I was saying that you should just brute-force search the above: its just three nested for-loops in C++,  maybe a few dozen lines of code; its just not hard to code up.

However, if you are certain that you will be needing to do this a lot, with rather specific and fixed a and d, then we could create a "user-defined index" for this.

I've long, long had a to-do list of "user-defined indexes for the atomspace". Basically, you'd specify a query of some kind, give it to the atomspace.  Then, every time an atom was added to the atom space, the query would be run, to see if its satisfied. If it is, that particular result is added to the index.

Later on, when you actually want all a->b->c->d you just iterate over the index, so it becomes fast.

There is price for this: every time an atom is added, the query (or a portion of it) is run.  This could be a very hefty cost, if lots of atoms are being added and removed, and yet the result a->b->c->d is required only infrequently.

Even if I never get around to finishing a user-defined index, you can kind-of cock one up yourself: there is a signal that is emitted for every atom insertion/deletion.  You could catch that signal, decide if the atom is interesting to you (e.g. because it completes the above chain) and then memorize it.
 
-- Linas

Linas Vepstas

unread,
Jul 22, 2013, 1:58:05 PM7/22/13
to bfrs, opencog, Ben Goertzel
Hi,


!!
I suppose so, yes!  The abstract seems to be saying the right things. I'll quibble, we have hypergraphs, not graphs, but no matter.

The task would be to write out the pseudo-code for something like PLN running on something like pregel, and then estimating performance and bottlenecks.  If it looks like it scales correctly, then yes, that would be the answer.  Unfortunately, writing down the pseudocode and estimating its performance is a difficult task... and implementing something is a far, far larger task.  I'm feeling overwhelmed at the moment.

--linas

boris

unread,
Jul 23, 2013, 12:29:58 AM7/23/13
to ope...@googlegroups.com, Linas Vepstas, David Hart
Hi,

Ok, so essentially you'd like to have a connectivity index. Every atom should be indexed by all atoms it is connected to.  HGDB itself doesn't have such an index. It allows you to create custom indices with whatever you want in them, so you could do that on your own of course, but my guess is that it would very expensive to maintain. Perhaps maintaining this sort of information would be better done with some other, dedicated data structure, than a conventional db index. Interesting problem...

Boris

boris

unread,
Jul 23, 2013, 1:03:02 AM7/23/13
to ope...@googlegroups.com, Borislav Iordanov, David Hart, linasv...@gmail.com
Hi,

Yeah, I put "compilation" in quotes because it's not really compilation that I'm doing. There's no target language, though the idea of eventually having a virtual machine for queries is not foreign to me. What happens is that a query plan is constructed which is very much like an SQL engine query plan. But what I do is much simpler than SQL query engines, which are able to score different possibilities for a query plan quite accurately and select the best one. What I do is (1) first the query expression goes through an "expand" phase where more complex conditions are translated into collections of simpler ones. Then (2) the whole thing is transformed into a disjunctive normal form (an OR of ANDs). Then (3) it goes trough a "contraction" phase where each AND is simplified to remove redundant conditions, replace conditions with the equivalent use of indices and finally (4) the resulting query expression is converted into an executable query object where order of joins is decided etc. For every primitive condition, there's some associated meta data that helps guide the process. It consists of a size (how big is the set corresponding to that condition), whether it's ordered, whether it can be queried by key, whether it can be evaluated as a predicate. It's not very sophisticated, but it has served well so far. 

Representing query expressions as hypergraphs is a good idea, but in HGDB everything is persisted, there aren't RAM only atoms, so it would be inefficient to make use of the hypergraph API for something of this sort. 

For Opencog, it seems to me that the hard part is having a working query engine. Exposing it via C++ or Python or whatever API should be easy and mostly irrelevant to how to engine works. If your pattern matcher is powerful enough and if it can also do simpler queries efficiently, then it should be exposed as an API in C++, and Python. The API could hide the fact that a graph is constructed underneath to represent the query, if that would make it easier for users. Not that designing clean and user-friendly APIs is an easy art, but in this case it seems less of a challenge than writing the query engine itself.

Best,
Boris

Linas Vepstas

unread,
Jul 23, 2013, 2:39:09 PM7/23/13
to boris, opencog, David Hart
On 23 July 2013 00:03, boris <borislav...@gmail.com> wrote:


For Opencog, it seems to me that the hard part is having a working query engine


Hmm. Well, the pattern matcher is a completely general query engine; or a "subgraph isomorphism solver", as Ramin put it.  Its fully recursive, its got a stack so it can backtrack, etc.

The code is rather old, it predates the newer C++ standards, and changes to the atomspace API means it now runs much slower than it used to.  I would like to clean it up and modernize it someday, but right now it's not in anyone's critical path.

One of the neater things is that its configurable, at the C++ level, so that callbacks are used to decide if/when two nodes match, etc. so you can get clever with it.  The default callbacks implement a crisp-logic query.

--linas

Ramin Barati

unread,
Jul 24, 2013, 11:44:46 AM7/24/13
to ope...@googlegroups.com, Borislav Iordanov, David Hart, linasv...@gmail.com
Well, yes.  

Look, in the other discussion thread, given what you are doing, I was saying that you should just brute-force search the above: its just three nested for-loops in C++,  maybe a few dozen lines of code; its just not hard to code up.

That's what I'm doing currently. I start from a node and traverse up to 2 nodes far. 

However, if you are certain that you will be needing to do this a lot, with rather specific and fixed a and d, then we could create a "user-defined index" for this.

I don't think there is an immediate need for these indexes (that's an welcome change though), but I have a feeling that something like this would come in handy for a back chaining agent.

I've come up with a list of features that I think would benefit atomspace's functionality (I don't know if these features are available or not though):
 - Sub Spaces
This is similar to the discussion I had with Linas and Ben a while ago, a mechanism for categorizing the atoms. It's not a "Concept", it's similar to the user-defined indexes (if not the same). These spaces can be created on the fly. The user defines a condition and an agent fills the space with the appropriate atoms. The AttentionalFocus is a special case of a Sub Space. So for example, a feasible mechanism for implementing this IMO would be tags. An Atom can have the tag "Rule", so when one wants to fetch all the rules, she/he would just query all the atoms with this tag.

 - Events
An event is something that someone can hook to and apply a piece of code based on the specific changes that happen to atoms. I wonder whether a similar functionality is already present in the atomspace, but as per previous discussions, it would be useful to have a fully functional/well-documented feature ready for OpenCog developers.
This should be something that is available in all the application layers (e.g. python, scm, c++, etc.)

 - Agent's Depositories
It would be nice to have a custom KVP store for agents so they can save the data needed for their processes and share them if needed

--

Linas Vepstas

unread,
Jul 24, 2013, 11:59:57 AM7/24/13
to Ramin Barati, opencog, Borislav Iordanov, David Hart
Hi,


On 24 July 2013 10:44, Ramin Barati <rek...@gmail.com> wrote:

 - Sub Spaces
This is similar to the discussion I had with Linas and Ben a while ago, a mechanism for categorizing the atoms. It's not a "Concept", it's similar to the user-defined indexes (if not the same). These spaces can be created on the fly. The user defines a condition and an agent fills the space with the appropriate atoms. The AttentionalFocus is a special case of a Sub Space. So for example, a feasible mechanism for implementing this IMO would be tags. An Atom can have the tag "Rule", so when one wants to fetch all the rules, she/he would just query all the atoms with this tag.

More & more, this seems to be a good idea.  However, I think there a set of conflicting requirements about how these would get used and handled; these would need to get hashed out first.

 - Events
An event is something that someone can hook to and apply a piece of code based on the specific changes that happen to atoms. I wonder whether a similar functionality is already present in the atomspace, but as per previous discussions, it would be useful to have a fully functional/well-documented feature ready for OpenCog developers.
This should be something that is available in all the application layers (e.g. python, scm, c++, etc.)

There are  addAtomSignal() and removeAtomSignal() and mergeAtomSignal()  there's nothing for changing TV, AV, presumably due to laziness or lack of demand.

 - Agent's Depositories
It would be nice to have a custom KVP store for agents so they can save the data needed for their processes and share them if needed

No, I think that if an agent needs to save or encode something, it should do so as a hypergraph.  So for example,

ListLink
   AnchorNode "MyAgent Special SaveStuff"
   Concept Node "whoopee!"
   Number Node  42

etc.

or something like that.  I've been using AnchorNodes as a kind of place to attach stuff so that it can be found later.
 
--linas

Ramin Barati

unread,
Jul 31, 2013, 7:34:58 AM7/31/13
to linasv...@gmail.com, opencog, Borislav Iordanov, David Hart
Hi,

There are  addAtomSignal() and removeAtomSignal() and mergeAtomSignal()  there's nothing for changing TV, AV, presumably due to laziness or lack of demand.

Are these interfaces exposed to python/cython or scm?

No, I think that if an agent needs to save or encode something, it should do so as a hypergraph.  So for example,

ListLink
   AnchorNode "MyAgent Special SaveStuff"
   Concept Node "whoopee!"
   Number Node  42

etc.

You're right, it would be better to save these things as hyper-graphs but, as far as I know, ListLinks are static. Are there any other alternatives (which are dynamic)?

I've a design in mind for a PLN ForwardInference Agent (FIA) which incrementally does forward chaining as atoms get added to atomspace. And for that, I need those three features to be present (or simulated) in atomspace. Here is the algorithm I've in mind:

1. In the event of a link being added, the atoms in the outgoing set is put into a queue, wake the agent up if needed.
2. FIA does all the possible inferences on the first atom in the queue, adding new links if needed (which in turn by firing the"a link is added" event, populates the queue with new atoms).
3. Repeat the procedure until the queue is empty (I don't mean for the FIA to empty the queue in one cog loop).
4. If the queue emptied, the FIA goes to an "idle" state
5. If the FIA stays in the "idle" state for too long, put it to sleep.
6. If possible, release the resources that the sleep agent had used.

Sub Spaces are needed for storing the rules, Events for the events of course, and Agent's depositories for storing/persisting the queue somewhere, In case sth goes wrong, e.g. power outage. The queue's length isn't static so FIA needs a depository that is able to grow and shrink.

Regards,
Ramin

Ramin Barati

unread,
Jul 31, 2013, 7:46:14 AM7/31/13
to linasv...@gmail.com, opencog, Borislav Iordanov, David Hart
Oh I forgot, if the queue got too big, create more FIAs, that's the whole logic behind putting them to sleep and releasing resources 

Joel Pitt

unread,
Jul 31, 2013, 10:27:04 AM7/31/13
to opencog


On Jul 31, 2013 1:35 PM, "Ramin Barati" <rek...@gmail.com> wrote:
>
> Hi,
>
>> There are  addAtomSignal() and removeAtomSignal() and mergeAtomSignal()  there's nothing for changing TV, AV, presumably due to laziness or lack of demand.
>
>
> Are these interfaces exposed to python/cython or scm?

I did not implement this for python.

J

Linas Vepstas

unread,
Jul 31, 2013, 11:13:49 AM7/31/13
to Ramin Barati, opencog, Borislav Iordanov, David Hart
Hi,

On 31 July 2013 06:34, Ramin Barati <rek...@gmail.com> wrote:
Hi,

There are  addAtomSignal() and removeAtomSignal() and mergeAtomSignal()  there's nothing for changing TV, AV, presumably due to laziness or lack of demand.

Are these interfaces exposed to python/cython or scm?
 
No.

No, I think that if an agent needs to save or encode something, it should do so as a hypergraph.  So for example,

ListLink
   AnchorNode "MyAgent Special SaveStuff"
   Concept Node "whoopee!"
   Number Node  42

etc.

You're right, it would be better to save these things as hyper-graphs but, as far as I know, ListLinks are static. Are there any other alternatives (which are dynamic)?

The atomspace is "static" in this sense, by design: it makes all objects immutable, (except for tv, av). There are several reasons for this; a big one is for thread-safety, or more generally, multi-processing safety: once you have an object, you have it, it won't change while you're watching.   There's a wikipedia article on this, and its famously done by clojure.

I've a design in mind for a PLN ForwardInference Agent (FIA) which incrementally does forward chaining as atoms get added to atomspace. And for that, I need those three features to be present (or simulated) in atomspace. Here is the algorithm I've in mind:

1. In the event of a link being added, the atoms in the outgoing set is put into a queue, wake the agent up if needed.
2. FIA does all the possible inferences on the first atom in the queue, adding new links if needed (which in turn by firing the"a link is added" event, populates the queue with new atoms).
3. Repeat the procedure until the queue is empty (I don't mean for the FIA to empty the queue in one cog loop).
4. If the queue emptied, the FIA goes to an "idle" state
5. If the FIA stays in the "idle" state for too long, put it to sleep.
 
Well, this seems silly; we need to implement a proper blocking read or something similar to take care of agent sleep/wake semantics in general.  Right now, all agents are designed to poll every so often; someday, we need to fix this.

6. If possible, release the resources that the sleep agent had used.

Sub Spaces are needed for storing the rules,

subspaces are unlikely to get implemented any time soon, due to a fairly large number of issues that would need to be pondered, debated, implemented.

In the meanwhile, you could use the anchor node!
 
Events for the events of course, and Agent's depositories for storing/persisting the queue somewhere,

Use the anchor node!
 
In case sth goes wrong, e.g. power outage. The queue's length isn't static so FIA needs a depository that is able to grow and shrink.

Use the anchor node!
 

Ben Goertzel

unread,
Jul 31, 2013, 12:39:52 PM7/31/13
to ope...@googlegroups.com, linasv...@gmail.com, Borislav Iordanov, David Hart
Hi Ramin,

> I've a design in mind for a PLN ForwardInference Agent (FIA) which
> incrementally does forward chaining as atoms get added to atomspace.

That is one interesting kind of FIA

However, it's not a substitute for an FIA that selects Atoms based on
STI and reasons on them...

In general an FIA that's based on STI is a better idea, because it's
much more general purpose.... As new Atoms come into the Atomspace,
they should be assigned high STI in most cases; so an STI-based FIA
would carry out what your new-ATom based FIA does anyway...

However, I can see your attraction to playing with an FIA that doesn't
rely on attention allocation, just to simplify your initial task....
So for that reason, I guess it's OK for you to build a MindAgent along
the lines you describe ;)

... ben

Ben Goertzel

unread,
Jul 31, 2013, 12:40:22 PM7/31/13
to ope...@googlegroups.com
But it does make sense to add them to the python interface...
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at http://groups.google.com/group/opencog.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



Ben Goertzel

unread,
Jul 31, 2013, 12:42:57 PM7/31/13
to ope...@googlegroups.com, linasv...@gmail.com, Borislav Iordanov, David Hart
> 4. If the queue emptied, the FIA goes to an "idle" state
> 5. If the FIA stays in the "idle" state for too long, put it to sleep.
> 6. If possible, release the resources that the sleep agent had used.

I think you can ignore 4-6 for the current time... it's OK to have the
FIA activated each cognitive cycle. If its queue is empty it will
take very little time to execute. We don't need intensive efficiency
optimization of PLN at this stage, IMO...

ben

Ramin Barati

unread,
Aug 1, 2013, 12:47:16 PM8/1/13
to ope...@googlegroups.com, linasv...@gmail.com, Borislav Iordanov, David Hart
However, I can see your attraction to playing with an FIA that doesn't
rely on attention allocation, just to simplify your initial task....
So for that reason, I guess it's OK for you to build a MindAgent along
the lines you describe ;)

Actually it makes the initial task harder I believe. Anyways I'm not planning on implementing the proposed agent anytime soon, not until the requirements are met at least. (the cython bindings for signals/events are pretty crucial to the above algorithm.). Currently I'm working on getting the PLN functional and atomspace's evolution visualized, I guess we can always better the agents later.
 
However, it's not a substitute for an FIA that selects Atoms based on
STI and reasons on them...

In general an FIA that's based on STI is a better idea, because it's
much more general purpose....  As new Atoms come into the Atomspace,
they should be assigned high STI in most cases; so an STI-based FIA
would carry out what your new-ATom based FIA does anyway...

Correct me if I'm wrong, but trying to infer on an atom that is not in the queue would be fruitless. Inferring on an atom, regardless of how much it's in our attention, without having new info/links on the atom is just a waste of time. The queue can be easily sorted by the Atoms' STI, making it proportional to STI, and it doesn't need a "random draw function" to be implemented in the atomspace. Also incremental inference is cheaper than random draws IMHO. Sorting a list will take O(n*log(n)) time in the worst case scenario, and 'n' is almost independent of atomspace's size, it's more or less proportionate to the rate of discovery. OTOH choosing a random atom will put more stress on the atomspace's indexing services (fitting to the new STIs) and the agent will spend a significant portion of it's time on reasonings that wouldn't result in discovering sth new.

Maybe my understanding of the FIA's job or attention focus is wrong, but I just can not see a reason why not to choose this approach over random draws.

Linas Vepstas

unread,
Aug 1, 2013, 1:25:45 PM8/1/13
to Ramin Barati, opencog, Borislav Iordanov, David Hart
Hi,


On 1 August 2013 11:47, Ramin Barati <rek...@gmail.com> wrote:
However, it's not a substitute for an FIA that selects Atoms based on
STI and reasons on them...

In general an FIA that's based on STI is a better idea, because it's
much more general purpose....  As new Atoms come into the Atomspace,
they should be assigned high STI in most cases; so an STI-based FIA
would carry out what your new-ATom based FIA does anyway...

Correct me if I'm wrong, but trying to infer on an atom that is not in the queue would be fruitless. Inferring on an atom, regardless of how much it's in our attention, without having new info/links on the atom is just a waste of time. The queue can be easily sorted by the Atoms' STI, making it proportional to STI, and it doesn't need a "random draw function" to be implemented in the atomspace. Also incremental inference is cheaper than random draws IMHO. Sorting a list will take O(n*log(n)) time in the worst case scenario, and 'n' is almost independent of atomspace's size, it's more or less proportionate to the rate of discovery. OTOH choosing a random atom will put more stress on the atomspace's indexing services (fitting to the new STIs) and the agent will spend a significant portion of it's time on reasonings that wouldn't result in discovering sth new.

Maybe my understanding of the FIA's job or attention focus is wrong, but I just can not see a reason why not to choose this approach over random draws.

I'm with Ramin on this one.

There's nothing wrong with attention, in general, but we do know for a fact, that all new, incoming atoms have, by definition, a high STI.

This implies a pipelined kind of processing: new atoms get picked up and processed.

One can go further: for my question answering code, I very explicitly needed a pipeline.  I invented the anchor node to solve this problem: all new atoms were attached to anchor A.  Then process 1 would process them, attach results to anchor B.  Process 2 would look for new stuff at anchor B, process them, put output at C. Process 3 would look for new stuff at C ...

I just don't see how to do this kind of layering with attention values. The output atoms each stage would have to have a high STI, but certainly, process 1 would not know what to do with atoms that were output by process 2... and although it could check and ignore those, it would be a waste of time to do so.

I think STI would work best when used with anchors: it would rank all the different crap hanging off of an anchor, so that, instead of randomly fetching something, process N could fetch the highest-priority atom first (or fetch randomly with weighted boltzmann statistics, whatever).

-- Linas

p.s. in fact, anchors are not actually an ideal solution, there is a better but more complex way to do this.  But I've been waiting for the world to catch up and get to this stage before discussing that.

Ben Goertzel

unread,
Aug 1, 2013, 7:38:44 PM8/1/13
to ope...@googlegroups.com, linasv...@gmail.com, Borislav Iordanov, David Hart
> Correct me if I'm wrong, but trying to infer on an atom that is not in the
> queue would be fruitless. Inferring on an atom, regardless of how much it's
> in our attention, without having new info/links on the atom is just a waste
> of time.

According to my understanding, this queue consists only of Atoms that have been
newly added to the Atomspace (via inference or perception or whatever)

However, it does not contain Atoms that have had their truth values
*changed* by some process (e.g. inference itself)...

So, you say, "trying to infer on an Atom that's not in a queue would
be fruitless", but I think it's not true, because many Atoms may have
had their truth values altered since they were last inferred on, and
this would not cause them to get put in the queue...

If the queue consisted of everything that had had its truth value
changed, along with everything that had been newly created, then I'd
agree with you....

-- Ben

Ben Goertzel

unread,
Aug 1, 2013, 7:41:39 PM8/1/13
to ope...@googlegroups.com, Ramin Barati, Borislav Iordanov, David Hart
In the case of STI ... well, one could create an AnchorNode that
basically linked to everything with STI above the
AttentionalFocusBoundary threshold ... this AnchorNode would be an
index to the AttentionalFocus.... But I'm not sure this is better
than some other sort of index...

ben
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at http://groups.google.com/group/opencog.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



Ramin Barati

unread,
Aug 2, 2013, 5:37:41 AM8/2/13
to ope...@googlegroups.com, linasv...@gmail.com, Borislav Iordanov, David Hart
However, it does not contain Atoms that have had their truth values
*changed* by some process (e.g. inference itself)...

If you mean changed by e.g. the revision rule, it would trigger the event. But if there's sth that would change the TVs without going through "adding atoms", it won't get in the queue.

If it's a consistency problem then it's better to be handled by atomspace/other agents, not the FIA. For example, suppose we had a->b->c and we inferred a->c, now if sth changed the TV of a->b, the TV of a->c should be updated. If that's the case, I don't think the FIA is the best place for implementing it, this would make the agent too complex IMO.

If the queue consisted of everything that had had its truth value
changed, along with everything that had been newly created, then I'd
agree with you....

We can always introduce new events to atomspace (e.g. changedTvSignal()), but I don't know if it's the best approach. Both signals may trigger each other in inference steps and may cause an infinite loop.

Ben Goertzel

unread,
Aug 2, 2013, 5:43:26 AM8/2/13
to ope...@googlegroups.com, linasv...@gmail.com, Borislav Iordanov, David Hart
On Fri, Aug 2, 2013 at 5:37 PM, Ramin Barati <rek...@gmail.com> wrote:
>> However, it does not contain Atoms that have had their truth values
>> *changed* by some process (e.g. inference itself)...
>
>
> If you mean changed by e.g. the revision rule, it would trigger the event.
> But if there's sth that would change the TVs without going through "adding
> atoms", it won't get in the queue.

Hmmm... no, I guess it would always be the revision rule...

I didn't realize that the revision rule would trigger an addAtom type
event... if this
is necessarily the case, then I suppose that every inference-relevant change
would indeed be reflected in a new Atom addition event...

ben

Linas Vepstas

unread,
Aug 2, 2013, 7:50:20 PM8/2/13
to opencog, Ramin Barati, Borislav Iordanov, David Hart
The anchor nodes are not designed to replace indexes; they are designed to provide an easy way for different mind-agents to communicate with one-another in an asynchronous fashion. You can think of them as "mailboxes"; anyone can drop a "letter" there (link a hypergraph to it), and anyone else can look at the mail, or remove individual messages (by removing the link to the anchor).

Some specific agent might want to look at the "mail" sorted according to STI, but that is up to the agent; there is no explicit or implicit sorting at the anchor.

-- Linas

Linas Vepstas

unread,
Aug 2, 2013, 7:55:36 PM8/2/13
to Ben Goertzel, opencog, Borislav Iordanov, David Hart
Hi,

On 2 August 2013 04:43, Ben Goertzel <b...@goertzel.org> wrote:
On Fri, Aug 2, 2013 at 5:37 PM, Ramin Barati <rek...@gmail.com> wrote:
>> However, it does not contain Atoms that have had their truth values
>> *changed* by some process (e.g. inference itself)...
>
>
> If you mean changed by e.g. the revision rule, it would trigger the event.
> But if there's sth that would change the TVs without going through "adding
> atoms", it won't get in the queue.

Hmmm... no, I guess it would always be the revision rule...

I didn't realize that the revision rule would trigger an addAtom type
event...

Me neither. Why would this be?  The whole atomspace is designed around the idea that adding and removing atoms is expensive/slow, while changing their TV and AV is cheap/fast.

I'm working on a brand-new design, that should make everything much much faster, and most thing simpler, but this split between fast/slow doesn't change in the new design.

--linas

Ramin Barati

unread,
Aug 3, 2013, 4:56:10 AM8/3/13
to ope...@googlegroups.com, Ben Goertzel, Borislav Iordanov, David Hart
On 2 August 2013 04:43, Ben Goertzel <b...@goertzel.org> wrote:
On Fri, Aug 2, 2013 at 5:37 PM, Ramin Barati <rek...@gmail.com> wrote:
>> However, it does not contain Atoms that have had their truth values
>> *changed* by some process (e.g. inference itself)...
>
>
> If you mean changed by e.g. the revision rule, it would trigger the event.
> But if there's sth that would change the TVs without going through "adding
> atoms", it won't get in the queue.

Hmmm... no, I guess it would always be the revision rule...

I didn't realize that the revision rule would trigger an addAtom type
event...

Me neither. Why would this be?  The whole atomspace is designed around the idea that adding and removing atoms is expensive/slow, while changing their TV and AV is cheap/fast.

The revision rule get triggered if and only if an atom gets added to atomspace, so you can change the atom's TV and do what ever needed after e.g. start a process, active some agents, etc. That's what I meant by going through "adding atoms".

Ben Goertzel

unread,
Aug 3, 2013, 12:45:24 PM8/3/13
to Ramin Barati, ope...@googlegroups.com, Borislav Iordanov, David Hart
> The revision rule get triggered if and only if an atom gets added to
> atomspace, so you can change the atom's TV and do what ever needed after
> e.g. start a process, active some agents, etc. That's what I meant by going
> through "adding atoms".

Suppose

Inheritance A B

and

Inheritance B C

both have their truth values changed by some inference process...
Suppose one of them
has high STI.

Then, deduction may be triggered, leading to the conclusion

Inheritance A C

If this Atom already exists, then it will have its truth value changed
as well...

My suggestion is that, if this queue based idea is to be used, we will
need to keep a queue or other structure of Atoms that have recently
gotten their truth value modified...

OTOH, if we introduced confidence decay for a broad class of Atoms,
this would result in a lot of Atoms getting their truth values
modified at each point in time.....

-- Ben

Linas Vepstas

unread,
Aug 3, 2013, 3:04:58 PM8/3/13
to opencog, Ramin Barati, Borislav Iordanov, David Hart
Hi,

Well, the issue here is of scalability, and I can't help but think of subspaces again.  STI, as currently designed, is global, and thus allows for only one user: PLN.   If there were multiple concurrently running subsystems: say, PLN, and some vision system, and some action-planning system, and some 3D location system, and some speech system, and if they were all jiggling around STI values, then poor old PLN could very well end up digging out atoms that are completely irrelevant to it, irrelevant to whatever its supposed to be cogitating about.

Perhaps each of these subsystems is supposed to use a different set of atom types?  That way, STI indexes could keep them all separated. But it's hard to imagine that there wouldn't be leakage at the boundaries of the systems ...

I'm not saying 'queues are better', rather, pointing out that this is an issue. It does not need to be solved right now.  I suspect that subspaces might be a good solution.  (In the atomspace redesign I've been coding, there are some (other, completely unrelated) technical reasons favoring multiple small subspaces, but this is still rather vague right now, I'm still very much in the experimentation phase. But I thought I'd mention it.)

For now, for Ramin, I suggest only the KISS principle.

--linas

Ben Goertzel

unread,
Aug 3, 2013, 7:18:12 PM8/3/13
to ope...@googlegroups.com, Ramin Barati, Borislav Iordanov, David Hart
Linas,

Here we run into the dual nature of OpenCog: as a toolkit for AI
development, versus as a platform for implemented one particular AGI
design (the one I've called CogPrime and have outlined in various
writings...)

From the standpoint of OpenCog as a toolkit, you're right: There are
many possible uses of OpenCog for which a single STI value is not good
enough, because there might be multiple processes running
simultaneously on the same Atomspace, each wanting/needing their own
way of judging importance. To achieve this, one could have some sort
of subscripted STI and LTI values. Cognitive process set 1 could
boost, and select according to, STI_1; cognitive process set 2 could
boost, and select according to, STI_2 ... etc. .... The
ImportanceSpreading MindAgent could propagate all of these. Various
subtleties ensue, such as whether one builds subscripted HebbianLinks
based on these different subscripted STI values...

From the standpoint of OpenCog as a vehicle for implementing CogPrime
in particular, on the other hand, things are different. The specific
intention within CogPrime is to have multiple cognitive processes
collectively use the *same* STI value, as a means of cooperation. In
other words, what might be a bug with respect to some cognitive
architectures one could (meaningfully and usefully) implement within
OpenCog, is from the CogPrime perspective intended as a feature... ;)

As you intimate at the end of your message, for the present time it's
fine for Ramin to keep working with the simple (non-subscripted) STI
values.... When there is a practical purpose for someone to move to
subscripted STI values, this can be done... and will very likely be
able to make use of any mechanisms or tuning Ramin does with the
single STI values...

-- Ben
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at http://groups.google.com/group/opencog.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



Keyvan Mir Mohammad Sadeghi

unread,
Sep 9, 2013, 4:22:07 PM9/9/13
to opencog, Ramin Barati, David Hart, Borislav Iordanov

Hi,

It occurred to me in the midst of another discussion on this list that "tags are a special case of hypergraphs".

So I'm now wondering why we don't have a node in the AtomSpace named "AttentionalFocus" and have atoms linked to it maybe via MemberLinks?

-- K

--
Keyvan Mir Mohammad Sadeghi
MSc AI

"One has to pay dearly for immortality; one has to die several times while one is still alive." -- Friedrich Nietzsche

Ben Goertzel

unread,
Sep 10, 2013, 1:12:17 AM9/10/13
to opencog, Ramin Barati, David Hart, Borislav Iordanov

the reason that design was not taken is purely a matter of computational efficiency... adding and removing all those memberlinks, as the system's attention flows, would be really computationally costly... ben g
Reply all
Reply to author
Forward
0 new messages