Re: Atomspace RAM & CPU usage ... was Re: [opencog-dev] Indexing in the AtomSpace

17 views
Skip to first unread message

Linas Vepstas

unread,
Sep 4, 2020, 1:22:58 AM9/4/20
to opencog, Xabush Semrie, Kasim Ebrahim, Amir P, link-grammar
My apologies for the incessant email-bombing. Not sure how else to communicate. Here is version 0.2 of the doc. I think (I hope) that the pages 4-5-6 clearly demonstrate how hypergraphs  offer a RAM-usage performance over ordinary graph stores. If you think you see flawed arguments there, please let me know.  The new pages 8-9 on indexing then explain how indexing on a conventional SQL/noSQL DB just happens to look identical to a hypergraph. Bet no one ever told you that, before!

The key point: for SQL/noSQL DB's, the indexes are always hidden away from the user, and cannot be directly accessed, walked, manipulated; they are internal-use-only by the DB.When you use a graph DB, you get direct access to "indexes" as user-visible and user-controllable objects.   If you use a hypergraph instead of a graph, then you also get a (large) space savings.

To be written: how pattern-matching compares to tinkerpop/gremlin. The theory of recursive algorithms (which is what pattern matching actually is, under the covers). Specifying recursive algorithms with context free grammars. Finally, parsing with context-free grammars. The connectors/sheaves thing is the cornerstone for parsing  with context-free grammars. The "secret sauce" here is (I think) that this can be done with RAM/CPU consumption as good as hypergraph DB's. So the goal of writing this is to spell out in painful detail all these steps.

The source material for this is in https://github.com/linas/atomspace/blob/sheaf-ram-cpu/opencog/sheaf/docs/ which I'll merge when a bit farther along.

--linas

p.s. I am cc'ing the link-grammar mailing list because (1) all this sheaf stuff is inspired by link-grammar, (2) perhaps I can convince Amir to offer critical feedback. Amir, I suspect that if you read this, you might think, "oh that Linas, he's kooky and is playing fast-and-loose with the facts" but I think that spotting weaknesses and outright errors is best done early & is something you're good at.

On Thu, Sep 3, 2020 at 2:55 PM Linas Vepstas <linasv...@gmail.com> wrote:
As threatened, I started writing up the ram/cpu usage for storing sheaves, hypergraphs and ordinary graphs. Unfortunately, it's like boiling the ocean, and is spiralling out of control.  Rather than presenting an overbearingly long text, here is draft version 0.1, it is seven pages long; you already know the first five pages; I draw your attention to page 6, where I compare RAM usage for hypergraph stores to ordinary graph stores, and then specialize it to the special case of genomics/proteomics datasets, which have that square-root-Zipf profile (which is also what wikipedia article connectivity has ... the square-root-Zipf is not biology-specific, its generic.)

The conclusion is that graph stores are bloated and heavy and inefficient, compared to hypergraph stores. If you think I made a mistake somewhere, please call it out.   I often make mistakes or outright blunders. The point is, again, this stuff has real implications for usability and computability, e.g. for Ben's latest Hyperion ideas, and so it would be nice to get the facts and the numbers correct, so that it doesn't look like we're blowing fumes.

-- Linas

On Wed, Sep 2, 2020 at 1:01 PM Linas Vepstas <linasv...@gmail.com> wrote:


On Tue, Sep 1, 2020 at 11:13 AM Abdulrahman Semrie <hsam...@gmail.com> wrote:



So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that.  You've mis-designed some data representation, somehow

I don’t understand how this can be due to a “mis-design” issue. It’s fairly common to index graph dbs using the name of the nodes (or other properties on the vertices/edges) and retrieve a node based on its name only. e.g - http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html

I don't understand what you don't understand :-) If you need to search something by name, why don't you put it all into the same name-space? Just make everything a ConceptNode, then you're done.  The whole point of having namespaces, is that they're namespaces, but now you want to not have namespaces. -- You've got to kind of figure out whether you want to have them and use them, or whether you don't want to have them. So I'm back to the initial point: what are you actually trying to do?

I think there are three approaches to add indexing to the atomspace, each with its own pros and cons.

  1. The first potential solution , as you and Kasim have suggested, is to use MemberLinks to “index” the atoms. This has the benefit that it doesn’t require any additional work to add to the current atomspace. But it has the following disadvantages:

    a. It could result in doubling the number of atoms in atomspace - i.e more RAM usage. It is infeasible for large atomspaces

Why do you think it doubles? Where do you think all of that RAM usage is going? Where do you think indexes are kept?  The MemberLink maintains indexes in the incoming/outgoing sets, those are just c++ std::set and std::vector, respectively. If you create some other index, you are just moving around where the RAM is being used.  You're talking about shifting around the internal representation; you are not proposing anything that will actually decrease RAM usage.

Take a look at RAM and disk usage in one of these commercial databases. Create a table. Put a milllion things into it. Place an index on the table; RAM usage will double. Place two indexes on it. RAM usage will triple. It's quite linear, and it's a worth-while experiment to perform, if you are interested in databases.  It will provide some hints into how these systems are actually designed, under the covers.

(And not just RAM, disk, too, when stored to disk. Disk usage doubles, triples ... as you add indexes.)

  1. b. As you have explained earlier, using MemberLink means more graph search -> more cpu usage

    c. This solution requires the user of the atomspace to manage the index. The user has to take care of adding/deleting MemberLinks when a node is added/deleted.

Nothing in life is free. It is not the "user" who is adding these "by hand", it is a block of code. That block of code requires the expenditure of CPU cycles. You can re-arrange the location where those CPU cycles are spent, but you can't remove it. 

Adding/deleting MemberLinks is trivial, so the complaint is unfounded. 

Only you know what the structure of your data needs to be.  You have to manage the structure of your data.

As a general rule, you cannot force the AtomSpace to guess what you wanted to have.  Trying to automate indexes is particularly damaging: for every atom insertion/deletion, the "automatic" code would have to trigger the indexer, analyze the kind of atom being inserted, and then make a decision as to which indexes need to be updated.

Again: take a commercial database, pick one, any one, and place a million items into it. Then measure insert/delete performance.  Then place an index on it; repeat. Put two indexes on it, repeat. As before, you will discover that CPU time will double, triple, quadruple for each new index that you add. It is very linear.  It's O(N) where N is the number of indexes.

The AtomSpace is very unique -- it offers you indexes that run at O(1) instead of O(N).  This is because the atomspace design has a trick that few other (maybe no other) databases offer: it has fractional indexes. aka incoming/outgoing sets.  When you add a MemberLink, you update only a handful of very small, per-atom indexes (three, to be precise: for (Member A B) you update incoming-A, incoming-B and outgoing Member) Because they are tiny, they fit on one CPU cache-line. Because they are tiny, there is no tree-rebalancing. Because they are tiny, there are no hash-buckets. 

But wait, there's more -- the most important point is that if you add some other atom -- say (Concept "foo") -- then exactly ZERO index updates are needed! Compare this to the commercial databases: if you have a commercial database with N=5 indexes, and K=1000000 entries, you have to update N indexes of size K, when you add "foo" to some table. CPU time will be typically be O(N log K) on those commercial databases, vs O(1) for the AtomSpace.

Seriously, this is a big deal. The AtomSpace literally does something that other SQL/noSQL databases literally cannot do. This is something that "graph databases" can kind-of do, because (in principle) graph databases can also maintain fractional indexes (that is what makes them graphs)  Whether they actually do this, I don't know.

This is one reason I'm working so hard on the "sheaf" theory -- the sheaving shows you precisely how to manage RAM and CPU for the connectivity of a graph.   I guess I now have to write yet another PDF in this series, talking about indexes. The sheaves aren't just some willy-nilly abstraction: they have real-life implications for RAM and CPU.

This is just email, so if you are uncertain of what I write or don't believe me, you can actually perform these experiments yourself, make the measurements. Prove me wrong -- I'm not always right. There's always room for improvement.  But I'm certain that, in this case, I'm right, because I've, uhh, "been around the block a few times". Or, uhh, "been there, done that".

Having well-done, well-documented measurements would be nice: we could publish a white-paper comparing atomspace performance to performance of database X. I'm fairly confident we'll come out looking pretty good.

  1. d. We can’t add index atoms by their values using this approach. For example, if we want to retrieve all atoms that have values with the key key1, it is not possible with this approach

  2. The second solution is adding support for indices that are managed by the atomspace similar to that of TypeIndex and having a way for users to define custom indices on atoms. This has the benefit that the user of the atomspace doesn’t have to manually handle insertion and removal of atoms for the indices. It also allows indexing atoms by their Values. The cons with this solution, in addition to the extra work required, is as we add more indices inserting and deleting atoms will be slower.

  3. The third solution is using external search engines/dbs (such as ElasticSearch or Apache Solr) to store the indices of the atomspaces. This moves managing the indices from the atomspace to the dbs and will improve the search time without having slower write speeds. But this requires to have some interface code to connect the atomspace with external index store.

Combination 2 & 3 is what I have seen being mostly used in other databases. For example, in the variant annotation work I did couple of months ago, I used mongodb to store the genomic data which indexed data by their id and used elasticsearch to store the location and gene indices of the variants.


I've nothing more to say here. This is why "other databases" suck. Seriously -- you should try to actually use some of them, before you tell me how great they are. Spend a few weeks, spend a month. You'll be back. And if you don't come back, .. well, hey, then clearly, the Atomspace was not the solution for whatever problem you are trying to solve.  The AtomSpace does not solve every problem. But the ones it does solve, it solves better than anything else out there.

--linas


Regards,

Abdulrahman Semrie

On Thursday, Aug 27, 2020 at 9:43 PM, Linas Vepstas <linasv...@gmail.com> wrote:


On Thu, Aug 27, 2020 at 12:55 PM Abdulrahman Semrie <hsam...@gmail.com> wrote:
> A second is to create a UniProtNode and use that; queries are then simple because you just ask for all UniprotNodes.

We are already using this approach. We have added new, data-source specific types to the atomspace and we use those types in pattern matching query.

> A third (recommended) way is to write  (MemberLink (Node "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))

can you please explain why this approach is recommended compared to the second one? Doesn't using this approach add many links that can be avoided by having a specific type?

The second approach is problematic, because it requires a human being to create, edit and publish an "atom_types.script" file to github ... and then download, compile, install. That's a hassle.

It is possible to dynamically add new Atom types at run-time, but the API's for this are incomplete; this is not encouraged, because it raises a variety of other technical issues (which I can review, but is a bit off-topic).

The third approach (of using MemberLinks and/or EvaluationLinks) is "recommended"  because it can be done at run-time, and has a giant raft of neat features, bells and whistles for doing all kinds of stuff. It's very flexible.

The disadvantage of using MemberLinks is slightly larger RAM usage: about 100 bytes for the link itself, and maybe another 50 bytes for the incoming-set index. (roughly; it's hard to measure because these bytes are not all in one place; some are in c++ std::set internal  nodes, etc.)  This implies that one million MemberLinks require 150MB extra RAM which ... well, that seems reasonable, these days.  Defining a UniProtNode does save you this RAM.

Another disadvantage of using MemberLinks during search is that it requires one extra graph-walk step during pattern matching -- i.e. walking from (Node "Uniprot: 1234") to the (Concept "the-set-of-all-uniprots")) via the MemberLink. This walk does take many thousands of cycles and lord-knows how many cpu-cache misses.   Based on my measurements from January with the actual bio-agi/mozi datasets in use at that time, I had the impression that this extra step accounts for 5% to 25% extra CPU, depending on the query.  Something like that. And since the queries you are doing are extremely common, frequent and cpu-intensive, it makes sense to hand-optimize and performance-tune, i.e. to invent a new UniProtNode.

But both of these considerations are very highly specific to the agi-bio/mozi project, where you are willing to put in the human labor costs to optimize for performance. More generally, one must recognize that human labor is expensive, and CPU-cycles/RAM is cheap, so the generic answer has to be "buy more RAM, buy more CPU" because that is cheaper than paying humans.  That's why the third way is the recommended way.


> . unless you mean "can I ask if (Node "uniprot: 1234") exists, without accidentally creating it if it does not?"

More like "can I ask if any node with name "uniprot:1234" exists? If so, can you return that node."

> you can do this from the C++, scheme and python API's, but you cannot do this in Atomese.

If I know the type and the name, yes I can do this from the C++, scheme and python - I'm actually doing this in the C++ code for the rpc server. But in the case I'm describing, I only know the name and not the type. And to create a Handle to retrieve the atom, I need both the type and the name.

So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that.  You've mis-designed some data representation, somehow.

But if you really, really want to ... you can get a list of all atom types, and then ask the atomspace, one type at a time, if it has a node with that name.

Another possibility is to "create an index" via the third (recommended) way:

(MemberLink (ProteinNode "unprot: 1234") (Concept "things named 1234"))
(MemberLink (GoNode "structure: 1234") (Concept "things named 1234"))
(MemberLink (PathwayNode "pathway: 1234") (Concept "things named 1234"))

And then there's RegexNode .... which is theoretically ugly, because it is a crutch, a work-around to a mis-designed data representation. The theoretical ugliness in a nut-shell: if you really want to use strings, then don't use the atomspace, use perl. That' is what perl is for. Regex's are finite state machines, and the comp-sci industry has well-developed theories of how finite-state machines work, and what you can do with them. More generally, there is a well-defined theory of string manipulation and string-rewriting... adding this to the atomspace is .... its ...  well, it's like trying to surgically attach that machinery ... it's like, .. I dunno, building a car with three wheels powered by a gasoline engine and one wheel powered by an electric motor. You can do it, but why?

--linas
 

On Thursday, August 27, 2020 at 8:33:29 PM UTC+3 linas wrote:
I just provided three different solutions to that task... -- linas

On Thu, Aug 27, 2020 at 11:14 AM Ben Goertzel <b...@goertzel.org> wrote:

I think perhaps what Xabush wants is to be able to query

" Find me all Atoms whose name string contains the substring "ABDPDQ".  "

even if he doesn't know what types these Atoms may be ?

ben

On Thu, Aug 27, 2020 at 9:09 AM Linas Vepstas <linasv...@gmail.com> wrote:
This statement I find confusing: "I can’t write a pattern matching query to retrieve an atom using its id/name" There is one and only one such atom, ever, by definition... There is nothing to query; if you know the name, you know the atom.

There was talk previously about "substring matching", for example, you have atoms  named "Uniprot: 1234" and "Uniprot: 5678" and you want to find all atoms that start with the eight characters "Uniprot:". There are (at least) three solutions for this. One is to create a RegexNode, but this is ugly from a theoretical standpoint. A second is to create a UniProtNode and use that; queries are then simple because you just ask for all UniprotNodes.  A third (recommended) way is to write  (MemberLink (Node "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))

This third way is recommended because, in a sense, the atomspace is nothing but one giant network of interconnected partial indexes. There is an index from (Node "Uniprot: 1234") to everything that makes use of it -- its called "the incoming set" and it is a real index - a c++ std::set  if I recall. Same for (Concept "the-set-of-all-uniprots") and what the pattern matcher "actually does" is to stitch together these partial indexes into a whole, and then prune away the irrelevant parts.

-- Linas

... unless you mean "can I ask if (Node "uniprot: 1234") exists, without accidentally creating it if it does not?" ... you can do this from the C++, scheme and python API's, but you cannot do this in Atomese.




On Thu, Aug 27, 2020 at 4:07 AM Abdulrahman Semrie <hsam...@gmail.com> wrote:



TL;DR: you can already do that.  It's already supported.

It’s partially supported. As you’ve described, we can cache the result of a pattern matching query and it is already supported. However, since I can’t write a pattern matching query to retrieve an atom using its id/name from the atomspace, there is no way to cache/index. If there was some ExistsLink that inherits from QueryLink where you can use to retrieve an atom by its name if it exists or return a false truth value, then what you’ve described can be done.


Regards,

Abdulrahman Semrie

On Thursday, Aug 27, 2020 at 2:46 AM, Linas Vepstas <linasv...@gmail.com> wrote:
TL;DR: you can already do that.  It's already supported.

Please follow me on this train of thought.

1) What is an "index"? Well, its a pre-defined cache of all atoms of some shape or pattern.

2) How can one specify an index?  Well, if its a pattern, then a pattern query can be used.

3) Where should the index be stored, or kept? Well, it can be stored or kept with the pattern that defines the shape of the index.

Before I move on to the next thought, let me point out that 1-2-3 can be directly solved today. Define a pattern, e.g. a query link. Run it. Store the results on the query, as a value. You can "do this yourself", today, its easy, but it becomes even easier if you are willing to read the docs for `cog-execute-cache!` (appended below)

4) How should the index be updated? Ah, well, that is actually the tricky question, the hard question, the place where all of the interesting technology debates and thinking are centered.  One strategy is to update the index every single time an Atom is added to/removed from the atomspace. But recomputing the index every time is wildly inefficient, burning through vast quantities of CPU time. What else can one do? Well, maybe recompute on demand. Or recompute every few minutes. Or maybe once a night. (aka "eventually consistent")  Maybe store a time-stamp on the index, to tell you how old it is. Or maybe have an append-only log of atomspace changes... I can propose many different kinds of solutions. They all have space and time-overhead, and/or assorted usability issues. Which of these best suits your needs, I have trouble guessing, so you would have to explain what the problem is (if any).

--linas

Here's the docs:
 cog-execute-cache! EXEC KEY [METADATA [FRESH]]

   Execute or return cached execution results. This is a caching version
   of the `cog-execute!` call.

   If the optional FRESH boolean flag is #f, then if there is a Value
   stored at KEY on EXEC, return that Value. The default value of FRESH
   is #f, so the default behavior is always to return the cached value.
   If the optional FRESH boolean flag is #t, or if there is no Value
   stored at KEY, then the `cog-execute!` function is called on EXEC,
   and the result is stored at KEY.

   The METADATA Atom is optional.  If it is specified, then metadata
   about the execution is placed on EXEC at the key METADATA.
   Currently, this is just a timestamp of when this execution was
   performed. The format of the meta-data is subject to change; this
   is currently an experimental feature, driven by user requirements.

   At this time, execution is synchronous. It may be worthwhile to have
   an asynchronous version of this call, where the execution is performed
   at some other time. This has not been done yet.

On Wed, Aug 26, 2020 at 7:41 AM Abdulrahman Semrie <hsam...@gmail.com> wrote:

In the current atomspace, atoms are indexed by their type, i.e given a type we can retrieve all the atoms that have that type. But there is no other away of adding custom indices in the atomspace. For example, if we want to index nodes by their name, there is no way of doing this.

As discussed in this issue, we plan to expand the annotation-service, which uses the AtomSpace to store genomics data, to support the annotation of more types in addition to genes. Currently, when I user submits a list of ids to the service, it is assumed that these ids/symbols represent `GeneNode`s. But in the case where the input can be a protein, a drug molecule, pathway or a gene, there is no direct way of retrieving what type of the atom with the given name is unless we iterate through all atoms searching for that particular id. This isn't be a good approach from performance standpoint. But if we had a custom index - e.g `name_index`, on the ids/names of the atoms, it will be easier to search the atoms by name and identify the type that the atom belongs to.

Hence, if there is a way to add custom indices to the atomspace, it will greatly simplify some searches. Or maybe there is a way to do what I described above without the need for an index. If so, please share it.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com.


--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

--
You received this message because you are subscribed to a topic in the Google Groups "opencog" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary.


--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com.


--
Ben Goertzel, PhD
http://goertzel.org

“The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.” -- Jack Kerouac

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.


--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com.


--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

--
You received this message because you are subscribed to a topic in the Google Groups "opencog" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/e4efdf67-88d3-4163-9e98-78363fc6ed0a%40Canary.


--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva



--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva



--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

ram-cpu.pdf
Reply all
Reply to author
Forward
0 new messages