Indexing in the AtomSpace

Abdulrahman Semrie

unread,

Aug 26, 2020, 8:41:49 AM8/26/20

to opencog

In the current atomspace, atoms are indexed by their type, i.e given a type we can retrieve all the atoms that have that type. But there is no other away of adding custom indices in the atomspace. For example, if we want to index nodes by their name, there is no way of doing this.

As discussed in this issue, we plan to expand the annotation-service, which uses the AtomSpace to store genomics data, to support the annotation of more types in addition to genes. Currently, when I user submits a list of ids to the service, it is assumed that these ids/symbols represent `GeneNode`s. But in the case where the input can be a protein, a drug molecule, pathway or a gene, there is no direct way of retrieving what type of the atom with the given name is unless we iterate through all atoms searching for that particular id. This isn't be a good approach from performance standpoint. But if we had a custom index - e.g `name_index`, on the ids/names of the atoms, it will be easier to search the atoms by name and identify the type that the atom belongs to.

Hence, if there is a way to add custom indices to the atomspace, it will greatly simplify some searches. Or maybe there is a way to do what I described above without the need for an index. If so, please share it.

Linas Vepstas

unread,

Aug 26, 2020, 7:46:13 PM8/26/20

to opencog

TL;DR: you can already do that. It's already supported.

Please follow me on this train of thought.

1) What is an "index"? Well, its a pre-defined cache of all atoms of some shape or pattern.

2) How can one specify an index? Well, if its a pattern, then a pattern query can be used.

3) Where should the index be stored, or kept? Well, it can be stored or kept with the pattern that defines the shape of the index.

Before I move on to the next thought, let me point out that 1-2-3 can be directly solved today. Define a pattern, e.g. a query link. Run it. Store the results on the query, as a value. You can "do this yourself", today, its easy, but it becomes even easier if you are willing to read the docs for `cog-execute-cache!` (appended below)

4) How should the index be updated? Ah, well, that is actually the tricky question, the hard question, the place where all of the interesting technology debates and thinking are centered. One strategy is to update the index every single time an Atom is added to/removed from the atomspace. But recomputing the index every time is wildly inefficient, burning through vast quantities of CPU time. What else can one do? Well, maybe recompute on demand. Or recompute every few minutes. Or maybe once a night. (aka "eventually consistent") Maybe store a time-stamp on the index, to tell you how old it is. Or maybe have an append-only log of atomspace changes... I can propose many different kinds of solutions. They all have space and time-overhead, and/or assorted usability issues. Which of these best suits your needs, I have trouble guessing, so you would have to explain what the problem is (if any).

--linas

Here's the docs:

cog-execute-cache! EXEC KEY [METADATA [FRESH]]

Execute or return cached execution results. This is a caching version
of the `cog-execute!` call.

If the optional FRESH boolean flag is #f, then if there is a Value
stored at KEY on EXEC, return that Value. The default value of FRESH
is #f, so the default behavior is always to return the cached value.
If the optional FRESH boolean flag is #t, or if there is no Value
stored at KEY, then the `cog-execute!` function is called on EXEC,
and the result is stored at KEY.

The METADATA Atom is optional. If it is specified, then metadata
about the execution is placed on EXEC at the key METADATA.
Currently, this is just a timestamp of when this execution was
performed. The format of the meta-data is subject to change; this
is currently an experimental feature, driven by user requirements.

At this time, execution is synchronous. It may be worthwhile to have
an asynchronous version of this call, where the execution is performed
at some other time. This has not been done yet.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com.

--

Verbogeny is one of the pleasurettes of a creatific thinkerizer.
--Peter da Silva

Abdulrahman Semrie

unread,

Aug 27, 2020, 5:07:38 AM8/27/20

to ope...@googlegroups.com

TL;DR: you can already do that. It's already supported.

It’s partially supported. As you’ve described, we can cache the result of a pattern matching query and it is already supported. However, since I can’t write a pattern matching query to retrieve an atom using its id/name from the atomspace, there is no way to cache/index. If there was some ExistsLink that inherits from QueryLink where you can use to retrieve an atom by its name if it exists or return a false truth value, then what you’ve described can be done.

—

Regards,

Abdulrahman Semrie

You received this message because you are subscribed to a topic in the Google Groups "opencog" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com.

Linas Vepstas

unread,

Aug 27, 2020, 12:09:04 PM8/27/20

to opencog

This statement I find confusing: "I can’t write a pattern matching query to retrieve an atom using its id/name" There is one and only one such atom, ever, by definition... There is nothing to query; if you know the name, you know the atom.

There was talk previously about "substring matching", for example, you have atoms named "Uniprot: 1234" and "Uniprot: 5678" and you want to find all atoms that start with the eight characters "Uniprot:". There are (at least) three solutions for this. One is to create a RegexNode, but this is ugly from a theoretical standpoint. A second is to create a UniProtNode and use that; queries are then simple because you just ask for all UniprotNodes. A third (recommended) way is to write (MemberLink (Node "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))

This third way is recommended because, in a sense, the atomspace is nothing but one giant network of interconnected partial indexes. There is an index from (Node "Uniprot: 1234") to everything that makes use of it -- its called "the incoming set" and it is a real index - a c++ std::set if I recall. Same for (Concept "the-set-of-all-uniprots") and what the pattern matcher "actually does" is to stitch together these partial indexes into a whole, and then prune away the irrelevant parts.

-- Linas

... unless you mean "can I ask if (Node "uniprot: 1234") exists, without accidentally creating it if it does not?" ... you can do this from the C++, scheme and python API's, but you cannot do this in Atomese.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary.

Ben Goertzel

unread,

Aug 27, 2020, 12:14:28 PM8/27/20

to opencog

I think perhaps what Xabush wants is to be able to query

" Find me all Atoms whose name string contains the substring "ABDPDQ". "

even if he doesn't know what types these Atoms may be ?

ben

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com.

--

Ben Goertzel, PhD
http://goertzel.org

“The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.” -- Jack Kerouac

Linas Vepstas

unread,

Aug 27, 2020, 1:33:29 PM8/27/20

to opencog

I just provided three different solutions to that task... -- linas

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com.

Abdulrahman Semrie

unread,

Aug 27, 2020, 1:55:53 PM8/27/20

to opencog

> A second is to create a UniProtNode and use that; queries are then simple because you just ask for all UniprotNodes.

We are already using this approach. We have added new, data-source specific types to the atomspace and we use those types in pattern matching query.

> A third (recommended) way is to write (MemberLink (Node "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))

can you please explain why this approach is recommended compared to the second one? Doesn't using this approach add many links that can be avoided by having a specific type?

> . unless you mean "can I ask if (Node "uniprot: 1234") exists, without accidentally creating it if it does not?"

More like "can I ask if any node with name "uniprot:1234" exists? If so, can you return that node."

> you can do this from the C++, scheme and python API's, but you cannot do this in Atomese.

If I know the type and the name, yes I can do this from the C++, scheme and python - I'm actually doing this in the C++ code for the rpc server. But in the case I'm describing, I only know the name and not the type. And to create a Handle to retrieve the atom, I need both the type and the name.

Linas Vepstas

unread,

Aug 27, 2020, 2:43:24 PM8/27/20

to opencog

On Thu, Aug 27, 2020 at 12:55 PM Abdulrahman Semrie <hsam...@gmail.com> wrote:

> A second is to create a UniProtNode and use that; queries are then simple because you just ask for all UniprotNodes.

We are already using this approach. We have added new, data-source specific types to the atomspace and we use those types in pattern matching query.

> A third (recommended) way is to write (MemberLink (Node "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))

can you please explain why this approach is recommended compared to the second one? Doesn't using this approach add many links that can be avoided by having a specific type?

The second approach is problematic, because it requires a human being to create, edit and publish an "atom_types.script" file to github ... and then download, compile, install. That's a hassle.

It is possible to dynamically add new Atom types at run-time, but the API's for this are incomplete; this is not encouraged, because it raises a variety of other technical issues (which I can review, but is a bit off-topic).

The third approach (of using MemberLinks and/or EvaluationLinks) is "recommended" because it can be done at run-time, and has a giant raft of neat features, bells and whistles for doing all kinds of stuff. It's very flexible.

The disadvantage of using MemberLinks is slightly larger RAM usage: about 100 bytes for the link itself, and maybe another 50 bytes for the incoming-set index. (roughly; it's hard to measure because these bytes are not all in one place; some are in c++ std::set internal nodes, etc.) This implies that one million MemberLinks require 150MB extra RAM which ... well, that seems reasonable, these days. Defining a UniProtNode does save you this RAM.

Another disadvantage of using MemberLinks during search is that it requires one extra graph-walk step during pattern matching -- i.e. walking from (Node "Uniprot: 1234") to the (Concept "the-set-of-all-uniprots")) via the MemberLink. This walk does take many thousands of cycles and lord-knows how many cpu-cache misses. Based on my measurements from January with the actual bio-agi/mozi datasets in use at that time, I had the impression that this extra step accounts for 5% to 25% extra CPU, depending on the query. Something like that. And since the queries you are doing are extremely common, frequent and cpu-intensive, it makes sense to hand-optimize and performance-tune, i.e. to invent a new UniProtNode.

But both of these considerations are very highly specific to the agi-bio/mozi project, where you are willing to put in the human labor costs to optimize for performance. More generally, one must recognize that human labor is expensive, and CPU-cycles/RAM is cheap, so the generic answer has to be "buy more RAM, buy more CPU" because that is cheaper than paying humans. That's why the third way is the recommended way.

> . unless you mean "can I ask if (Node "uniprot: 1234") exists, without accidentally creating it if it does not?"

More like "can I ask if any node with name "uniprot:1234" exists? If so, can you return that node."

> you can do this from the C++, scheme and python API's, but you cannot do this in Atomese.

If I know the type and the name, yes I can do this from the C++, scheme and python - I'm actually doing this in the C++ code for the rpc server. But in the case I'm describing, I only know the name and not the type. And to create a Handle to retrieve the atom, I need both the type and the name.

So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that. You've mis-designed some data representation, somehow.

But if you really, really want to ... you can get a list of all atom types, and then ask the atomspace, one type at a time, if it has a node with that name.

Another possibility is to "create an index" via the third (recommended) way:

(MemberLink (ProteinNode "unprot: 1234") (Concept "things named 1234"))

(MemberLink (GoNode "structure: 1234") (Concept "things named 1234"))

(MemberLink (PathwayNode "pathway: 1234") (Concept "things named 1234"))

And then there's RegexNode .... which is theoretically ugly, because it is a crutch, a work-around to a mis-designed data representation. The theoretical ugliness in a nut-shell: if you really want to use strings, then don't use the atomspace, use perl. That' is what perl is for. Regex's are finite state machines, and the comp-sci industry has well-developed theories of how finite-state machines work, and what you can do with them. More generally, there is a well-defined theory of string manipulation and string-rewriting... adding this to the atomspace is .... its ... well, it's like trying to surgically attach that machinery ... it's like, .. I dunno, building a car with three wheels powered by a gasoline engine and one wheel powered by an electric motor. You can do it, but why?

--linas

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com.

Abdulrahman Semrie

unread,

Sep 1, 2020, 12:13:02 PM9/1/20

to ope...@googlegroups.com, Kasim Ebrahim

So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that. You've mis-designed some data representation, somehow

I don’t understand how this can be due to a “mis-design” issue. It’s fairly common to index graph dbs using the name of the nodes (or other properties on the vertices/edges) and retrieve a node based on its name only. e.g - http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html

I think there are three approaches to add indexing to the atomspace, each with its own pros and cons.

The first potential solution , as you and Kasim have suggested, is to use MemberLinks to “index” the atoms. This has the benefit that it doesn’t require any additional work to add to the current atomspace. But it has the following disadvantages:

a. It could result in doubling the number of atoms in atomspace - i.e more RAM usage. It is infeasible for large atomspaces

b. As you have explained earlier, using MemberLink means more graph search -> more cpu usage

c. This solution requires the user of the atomspace to manage the index. The user has to take care of adding/deleting MemberLinks when a node is added/deleted.

d. We can’t add index atoms by their values using this approach. For example, if we want to retrieve all atoms that have values with the key key1, it is not possible with this approach
The second solution is adding support for indices that are managed by the atomspace similar to that of TypeIndex and having a way for users to define custom indices on atoms. This has the benefit that the user of the atomspace doesn’t have to manually handle insertion and removal of atoms for the indices. It also allows indexing atoms by their Values. The cons with this solution, in addition to the extra work required, is as we add more indices inserting and deleting atoms will be slower.
The third solution is using external search engines/dbs (such as ElasticSearch or Apache Solr) to store the indices of the atomspaces. This moves managing the indices from the atomspace to the dbs and will improve the search time without having slower write speeds. But this requires to have some interface code to connect the atomspace with external index store.

Combination 2 & 3 is what I have seen being mostly used in other databases. For example, in the variant annotation work I did couple of months ago, I used mongodb to store the genomic data which indexed data by their id and used elasticsearch to store the location and gene indices of the variants.

—

Regards,

Abdulrahman Semrie

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com.

Linas Vepstas

unread,

Sep 2, 2020, 2:02:12 PM9/2/20

to opencog, Kasim Ebrahim

On Tue, Sep 1, 2020 at 11:13 AM Abdulrahman Semrie <hsam...@gmail.com> wrote:

So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that. You've mis-designed some data representation, somehow

I don’t understand how this can be due to a “mis-design” issue. It’s fairly common to index graph dbs using the name of the nodes (or other properties on the vertices/edges) and retrieve a node based on its name only. e.g - http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html

I don't understand what you don't understand :-) If you need to search something by name, why don't you put it all into the same name-space? Just make everything a ConceptNode, then you're done. The whole point of having namespaces, is that they're namespaces, but now you want to not have namespaces. -- You've got to kind of figure out whether you want to have them and use them, or whether you don't want to have them. So I'm back to the initial point: what are you actually trying to do?

I think there are three approaches to add indexing to the atomspace, each with its own pros and cons.

The first potential solution , as you and Kasim have suggested, is to use MemberLinks to “index” the atoms. This has the benefit that it doesn’t require any additional work to add to the current atomspace. But it has the following disadvantages:

a. It could result in doubling the number of atoms in atomspace - i.e more RAM usage. It is infeasible for large atomspaces

Why do you think it doubles? Where do you think all of that RAM usage is going? Where do you think indexes are kept? The MemberLink maintains indexes in the incoming/outgoing sets, those are just c++ std::set and std::vector, respectively. If you create some other index, you are just moving around where the RAM is being used. You're talking about shifting around the internal representation; you are not proposing anything that will actually decrease RAM usage.

Take a look at RAM and disk usage in one of these commercial databases. Create a table. Put a milllion things into it. Place an index on the table; RAM usage will double. Place two indexes on it. RAM usage will triple. It's quite linear, and it's a worth-while experiment to perform, if you are interested in databases. It will provide some hints into how these systems are actually designed, under the covers.

(And not just RAM, disk, too, when stored to disk. Disk usage doubles, triples ... as you add indexes.)

b. As you have explained earlier, using MemberLink means more graph search -> more cpu usage

c. This solution requires the user of the atomspace to manage the index. The user has to take care of adding/deleting MemberLinks when a node is added/deleted.

Nothing in life is free. It is not the "user" who is adding these "by hand", it is a block of code. That block of code requires the expenditure of CPU cycles. You can re-arrange the location where those CPU cycles are spent, but you can't remove it.

Adding/deleting MemberLinks is trivial, so the complaint is unfounded.

Only you know what the structure of your data needs to be. You have to manage the structure of your data.

As a general rule, you cannot force the AtomSpace to guess what you wanted to have. Trying to automate indexes is particularly damaging: for every atom insertion/deletion, the "automatic" code would have to trigger the indexer, analyze the kind of atom being inserted, and then make a decision as to which indexes need to be updated.

Again: take a commercial database, pick one, any one, and place a million items into it. Then measure insert/delete performance. Then place an index on it; repeat. Put two indexes on it, repeat. As before, you will discover that CPU time will double, triple, quadruple for each new index that you add. It is very linear. It's O(N) where N is the number of indexes.

The AtomSpace is very unique -- it offers you indexes that run at O(1) instead of O(N). This is because the atomspace design has a trick that few other (maybe no other) databases offer: it has fractional indexes. aka incoming/outgoing sets. When you add a MemberLink, you update only a handful of very small, per-atom indexes (three, to be precise: for (Member A B) you update incoming-A, incoming-B and outgoing Member) Because they are tiny, they fit on one CPU cache-line. Because they are tiny, there is no tree-rebalancing. Because they are tiny, there are no hash-buckets.

But wait, there's more -- the most important point is that if you add some other atom -- say (Concept "foo") -- then exactly ZERO index updates are needed! Compare this to the commercial databases: if you have a commercial database with N=5 indexes, and K=1000000 entries, you have to update N indexes of size K, when you add "foo" to some table. CPU time will be typically be O(N log K) on those commercial databases, vs O(1) for the AtomSpace.

Seriously, this is a big deal. The AtomSpace literally does something that other SQL/noSQL databases literally cannot do. This is something that "graph databases" can kind-of do, because (in principle) graph databases can also maintain fractional indexes (that is what makes them graphs) Whether they actually do this, I don't know.

This is one reason I'm working so hard on the "sheaf" theory -- the sheaving shows you precisely how to manage RAM and CPU for the connectivity of a graph. I guess I now have to write yet another PDF in this series, talking about indexes. The sheaves aren't just some willy-nilly abstraction: they have real-life implications for RAM and CPU.

This is just email, so if you are uncertain of what I write or don't believe me, you can actually perform these experiments yourself, make the measurements. Prove me wrong -- I'm not always right. There's always room for improvement. But I'm certain that, in this case, I'm right, because I've, uhh, "been around the block a few times". Or, uhh, "been there, done that".

Having well-done, well-documented measurements would be nice: we could publish a white-paper comparing atomspace performance to performance of database X. I'm fairly confident we'll come out looking pretty good.

d. We can’t add index atoms by their values using this approach. For example, if we want to retrieve all atoms that have values with the key key1, it is not possible with this approach

The second solution is adding support for indices that are managed by the atomspace similar to that of TypeIndex and having a way for users to define custom indices on atoms. This has the benefit that the user of the atomspace doesn’t have to manually handle insertion and removal of atoms for the indices. It also allows indexing atoms by their Values. The cons with this solution, in addition to the extra work required, is as we add more indices inserting and deleting atoms will be slower.

The third solution is using external search engines/dbs (such as ElasticSearch or Apache Solr) to store the indices of the atomspaces. This moves managing the indices from the atomspace to the dbs and will improve the search time without having slower write speeds. But this requires to have some interface code to connect the atomspace with external index store.

Combination 2 & 3 is what I have seen being mostly used in other databases. For example, in the variant annotation work I did couple of months ago, I used mongodb to store the genomic data which indexed data by their id and used elasticsearch to store the location and gene indices of the variants.

I've nothing more to say here. This is why "other databases" suck. Seriously -- you should try to actually use some of them, before you tell me how great they are. Spend a few weeks, spend a month. You'll be back. And if you don't come back, .. well, hey, then clearly, the Atomspace was not the solution for whatever problem you are trying to solve. The AtomSpace does not solve every problem. But the ones it does solve, it solves better than anything else out there.

--linas

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/e4efdf67-88d3-4163-9e98-78363fc6ed0a%40Canary.

Linas Vepstas

unread,

Sep 3, 2020, 3:56:09 PM9/3/20

to opencog, Kasim Ebrahim

As threatened, I started writing up the ram/cpu usage for storing sheaves, hypergraphs and ordinary graphs. Unfortunately, it's like boiling the ocean, and is spiralling out of control. Rather than presenting an overbearingly long text, here is draft version 0.1, it is seven pages long; you already know the first five pages; I draw your attention to page 6, where I compare RAM usage for hypergraph stores to ordinary graph stores, and then specialize it to the special case of genomics/proteomics datasets, which have that square-root-Zipf profile (which is also what wikipedia article connectivity has ... the square-root-Zipf is not biology-specific, its generic.)

The conclusion is that graph stores are bloated and heavy and inefficient, compared to hypergraph stores. If you think I made a mistake somewhere, please call it out. I often make mistakes or outright blunders. The point is, again, this stuff has real implications for usability and computability, e.g. for Ben's latest Hyperion ideas, and so it would be nice to get the facts and the numbers correct, so that it doesn't look like we're blowing fumes.

-- Linas

ram-cpu.pdf

Linas Vepstas

unread,

Sep 4, 2020, 1:23:00 AM9/4/20

to opencog, Xabush Semrie, Kasim Ebrahim, Amir P, link-grammar

My apologies for the incessant email-bombing. Not sure how else to communicate. Here is version 0.2 of the doc. I think (I hope) that the pages 4-5-6 clearly demonstrate how hypergraphs offer a RAM-usage performance over ordinary graph stores. If you think you see flawed arguments there, please let me know. The new pages 8-9 on indexing then explain how indexing on a conventional SQL/noSQL DB just happens to look identical to a hypergraph. Bet no one ever told you that, before!

The key point: for SQL/noSQL DB's, the indexes are always hidden away from the user, and cannot be directly accessed, walked, manipulated; they are internal-use-only by the DB.When you use a graph DB, you get direct access to "indexes" as user-visible and user-controllable objects. If you use a hypergraph instead of a graph, then you also get a (large) space savings.

To be written: how pattern-matching compares to tinkerpop/gremlin. The theory of recursive algorithms (which is what pattern matching actually is, under the covers). Specifying recursive algorithms with context free grammars. Finally, parsing with context-free grammars. The connectors/sheaves thing is the cornerstone for parsing with context-free grammars. The "secret sauce" here is (I think) that this can be done with RAM/CPU consumption as good as hypergraph DB's. So the goal of writing this is to spell out in painful detail all these steps.

The source material for this is in https://github.com/linas/atomspace/blob/sheaf-ram-cpu/opencog/sheaf/docs/ which I'll merge when a bit farther along.

--linas

p.s. I am cc'ing the link-grammar mailing list because (1) all this sheaf stuff is inspired by link-grammar, (2) perhaps I can convince Amir to offer critical feedback. Amir, I suspect that if you read this, you might think, "oh that Linas, he's kooky and is playing fast-and-loose with the facts" but I think that spotting weaknesses and outright errors is best done early & is something you're good at.

ram-cpu.pdf

Abdulrahman Semrie

unread,

Sep 8, 2020, 4:31:58 PM9/8/20

to opencog

Great write up Linas!

I'm fuzzy on the formula for the size of vertex and edge tables on page 6. It'd be great if you added an explanation to make it more clear.

With regards to indexing, the benefit of using graph dbs for partial indices is clear. But I have one question with regards to the current Atomspace design. In your example, you represent the departments as "privileged vertices" and connect them to their respective employee vertices. In the current AtomSpace, there is a TypeIndex which is represented using a hash table (std::unordered_multimap to be exact). Why not represent the types using vertices and connect every other atom to the type vertice it belongs to? Like you suggested above, something like (MemberLink (Concept "Uniprot:12233") (Concept "ProteinNode")). This will lead to some type vertices being "Supernodes" in that a single vertex will be connected to many vertices, perhaps millions of vertices. This will result in a performance issue with naive graph db representations because the outgoing set of the type vertices will be very large. Titandb solves this by having the concept of unidirectional edge where only the destination vertex is aware of its connection to the supernode. But looking at the hypergraph tables in the document, this problem is already solved. So why not use this approach for the TypeIndex?

Re: using MemberLinks as a way of indexing by name

Where do you think all of that RAM usage is going? Where do you think indexes are kept? The MemberLink maintains indexes in the incoming/outgoing sets, those are just c++ std::set and std::vector, respectively. If you create some other index, you are just moving around where the RAM is being used. You're talking about shifting around the internal representation; you are not proposing anything that will actually decrease RAM usage.

Correct me if I'm wrong but won't using std::unorder_multimap<string, Handle> will have less RAM usage than creating new ConceptNode and MemberLinks for indexing?

Also how about adding an api in the atomspace so that we can use external index stores like ElasticSearch/Apache Solr? This is especially useful if we want to do full-text search on atom names. I can help with this integration if you think this idea is worthwhile.

Linas Vepstas

unread,

Sep 8, 2020, 6:24:09 PM9/8/20

to opencog

On Tue, Sep 8, 2020 at 3:32 PM Abdulrahman Semrie <hsam...@gmail.com> wrote:

Great write up Linas!

I'm fuzzy on the formula for the size of vertex and edge tables on page 6. It'd be great if you added an explanation to make it more clear.

Which part is unclear?

With regards to indexing, the benefit of using graph dbs for partial indices is clear. But I have one question with regards to the current Atomspace design. In your example, you represent the departments as "privileged vertices" and connect them to their respective employee vertices. In the current AtomSpace, there is a TypeIndex which is represented using a hash table (std::unordered_multimap to be exact). Why not represent the types using vertices and connect every other atom to the type vertice it belongs to? Like you suggested above, something like (MemberLink (Concept "Uniprot:12233") (Concept "ProteinNode")). This will lead to some type vertices being "Supernodes" in that a single vertex will be connected to many vertices, perhaps millions of vertices. This will result in a performance issue with naive graph db representations because the outgoing set of the type vertices will be very large. Titandb solves this by having the concept of unidirectional edge where only the destination vertex is aware of its connection to the supernode. But looking at the hypergraph tables in the document, this problem is already solved. So why not use this approach for the TypeIndex?

Minor correction: it would by (TypeNode 'Protein) -- the TypeNode performs a check to avoid mis-spellings, and slots in naturally into things that expect types.

But I don't understand what you are saying. What is the problem that is being solved? How would this be better than what there is now?

Re: using MemberLinks as a way of indexing by name

Where do you think all of that RAM usage is going? Where do you think indexes are kept? The MemberLink maintains indexes in the incoming/outgoing sets, those are just c++ std::set and std::vector, respectively. If you create some other index, you are just moving around where the RAM is being used. You're talking about shifting around the internal representation; you are not proposing anything that will actually decrease RAM usage.

Correct me if I'm wrong but won't using std::unorder_multimap<string, Handle> will have less RAM usage than creating new ConceptNode and MemberLinks for indexing?

What problem does this solve?

Also how about adding an api in the atomspace so that we can use external index stores like ElasticSearch/Apache Solr? This is especially useful if we want to do full-text search on atom names. I can help with this integration if you think this idea is worthwhile.

What problem does this solve? My knee-jerk reaction is that it is a terrible idea -- you were complaining about RAM usage, but now you want to introduce large, complex, RAM-intensive applications? Yuck!

Previous emails offered 3-4 different simple, small light-weight solutions that would be faster, easier, less RAM-intensive than Solr/Elastic. But the key problem is that you still haven't explained what the actual problem is -- so it is effectively impossible to discuss solutions, if the problem remains hidden.

I have the general impression that you are suffering from a case of "the grass is greener on the other side" -- you are looking at these corporate mega-solutions and are imagining how wonderful it might be to work for some large corporation, solving some kind of mega-problem where you have 300 people attending division-wide meetings to organize and plan.

I'm not sure what to say to that -- there might be more money, but it is definitely NOT more pleasant or fun. Thrilling, at first -- sure. Like any good drug. I could say something like "try it- get a job at mega-corp" but you will then disappear for a decade, before you realize what a terrible mistake you've made, and will be quite unable to find the exit door.(*) If you are young .. I dunno. What's 10 years, when you're young? You will get a hands-on education in corporate structure and business administration, which is worth-while if your end-goals are to run departments or manage large projects.

Just be aware that managing large projects requires exceptional political skills. Politics is a snake-pit -- things are always reaching up from unexpected locations and biting you in unexpected ways. Random people will stab you in the back, and you will be nicely set up in situations where you will be knocked down, forced to fail in the most ugly, public fashion possible. The competition for executive jobs is fierce and brutal. Winning is only 2/3rds luck and 1/3rd skill.

-- Linas

(*) Those Hollywood movies about unhappy Americans and unhappy housewives? They may be fictional, but they have a basis in reality.

--

Patrick: Are they laughing at us?

Sponge Bob: No, Patrick, they are laughing next to us.

Abdulrahman Semrie

unread,

Sep 15, 2020, 12:00:16 PM9/15/20

to opencog

> But I don't understand what you are saying. What is the problem that is being solved? How would this be better than what there is now?

In this particular case, I’m pointing out that you aren’t using the suggestion you gave me for indexing atoms by their type in the current Atomspace - i.e MemberLinks instead of a hash table. So I am wondering why is the TypIndex not implemented using a MemberLink (as you suggested above) and according to the example in the document you linked ?

Anyways, as it pertains to the annotation work, I'm using a simple hack to get the type using only the name. It works and the performances isn't bad because we only have about two dozen types. You can see the code here : https://github.com/Habush/atomspace-rpc/blob/7b01ef18f70473457a97159c0b120743428272cb/src/manager/AtomSpaceManager.cpp#L103

Reply all

Reply to author

Forward