--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com.
TL;DR: you can already do that. It's already supported.
It’s partially supported. As you’ve described, we can cache the result of a pattern matching query and it is already supported. However, since I can’t write a pattern matching query to retrieve an atom using its id/name from the atomspace, there is no way to cache/index. If there was some ExistsLink
that inherits from QueryLink
where you can use to retrieve an atom by its name if it exists or return a false truth value, then what you’ve described can be done.
You received this message because you are subscribed to a topic in the Google Groups "opencog" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com.
> A second is to create a UniProtNode and use that; queries are then simple because you just ask for all UniprotNodes.We are already using this approach. We have added new, data-source specific types to the atomspace and we use those types in pattern matching query.> A third (recommended) way is to write (MemberLink (Node "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))can you please explain why this approach is recommended compared to the second one? Doesn't using this approach add many links that can be avoided by having a specific type?
> . unless you mean "can I ask if (Node "uniprot: 1234") exists, without accidentally creating it if it does not?"More like "can I ask if any node with name "uniprot:1234" exists? If so, can you return that node."> you can do this from the C++, scheme and python API's, but you cannot do this in Atomese.If I know the type and the name, yes I can do this from the C++, scheme and python - I'm actually doing this in the C++ code for the rpc server. But in the case I'm describing, I only know the name and not the type. And to create a Handle to retrieve the atom, I need both the type and the name.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com.
So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that. You've mis-designed some data representation, somehow
I don’t understand how this can be due to a “mis-design” issue. It’s fairly common to index graph dbs using the name of the nodes (or other properties on the vertices/edges) and retrieve a node based on its name only. e.g - http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html
I think there are three approaches to add indexing to the atomspace, each with its own pros and cons.
The first potential solution , as you and Kasim have suggested, is to use MemberLink
s to “index” the atoms. This has the benefit that it doesn’t require any additional work to add to the current atomspace. But it has the following disadvantages:
a. It could result in doubling the number of atoms in atomspace - i.e more RAM usage. It is infeasible for large atomspaces
b. As you have explained earlier, using MemberLink
means more graph search -> more cpu usage
c. This solution requires the user of the atomspace to manage the index. The user has to take care of adding/deleting MemberLink
s when a node is added/deleted.
d. We can’t add index atoms by their values using this approach. For example, if we want to retrieve all atoms that have values with the key key1
, it is not possible with this approach
The second solution is adding support for indices that are managed by the atomspace similar to that of TypeIndex and having a way for users to define custom indices on atoms. This has the benefit that the user of the atomspace doesn’t have to manually handle insertion and removal of atoms for the indices. It also allows indexing atoms by their Values. The cons with this solution, in addition to the extra work required, is as we add more indices inserting and deleting atoms will be slower.
The third solution is using external search engines/dbs (such as ElasticSearch or Apache Solr) to store the indices of the atomspaces. This moves managing the indices from the atomspace to the dbs and will improve the search time without having slower write speeds. But this requires to have some interface code to connect the atomspace with external index store.
Combination 2 & 3 is what I have seen being mostly used in other databases. For example, in the variant annotation work I did couple of months ago, I used mongodb to store the genomic data which indexed data by their id and used elasticsearch to store the location and gene indices of the variants.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com.
So, without knowing the type, but only knowing the string name? My knee-jerk reaction is you're doing something wrong, if you feel you need to do that. You've mis-designed some data representation, somehow
I don’t understand how this can be due to a “mis-design” issue. It’s fairly common to index graph dbs using the name of the nodes (or other properties on the vertices/edges) and retrieve a node based on its name only. e.g - http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html
I think there are three approaches to add indexing to the atomspace, each with its own pros and cons.
The first potential solution , as you and Kasim have suggested, is to use
MemberLink
s to “index” the atoms. This has the benefit that it doesn’t require any additional work to add to the current atomspace. But it has the following disadvantages:a. It could result in doubling the number of atoms in atomspace - i.e more RAM usage. It is infeasible for large atomspaces
b. As you have explained earlier, using
MemberLink
means more graph search -> more cpu usagec. This solution requires the user of the atomspace to manage the index. The user has to take care of adding/deleting
MemberLink
s when a node is added/deleted.
d. We can’t add index atoms by their values using this approach. For example, if we want to retrieve all atoms that have values with the key
key1
, it is not possible with this approachThe second solution is adding support for indices that are managed by the atomspace similar to that of TypeIndex and having a way for users to define custom indices on atoms. This has the benefit that the user of the atomspace doesn’t have to manually handle insertion and removal of atoms for the indices. It also allows indexing atoms by their Values. The cons with this solution, in addition to the extra work required, is as we add more indices inserting and deleting atoms will be slower.
The third solution is using external search engines/dbs (such as ElasticSearch or Apache Solr) to store the indices of the atomspaces. This moves managing the indices from the atomspace to the dbs and will improve the search time without having slower write speeds. But this requires to have some interface code to connect the atomspace with external index store.
Combination 2 & 3 is what I have seen being mostly used in other databases. For example, in the variant annotation work I did couple of months ago, I used mongodb to store the genomic data which indexed data by their id and used elasticsearch to store the location and gene indices of the variants.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/e4efdf67-88d3-4163-9e98-78363fc6ed0a%40Canary.
Great write up Linas!I'm fuzzy on the formula for the size of vertex and edge tables on page 6. It'd be great if you added an explanation to make it more clear.
With regards to indexing, the benefit of using graph dbs for partial indices is clear. But I have one question with regards to the current Atomspace design. In your example, you represent the departments as "privileged vertices" and connect them to their respective employee vertices. In the current AtomSpace, there is a TypeIndex which is represented using a hash table (std::unordered_multimap to be exact). Why not represent the types using vertices and connect every other atom to the type vertice it belongs to? Like you suggested above, something like (MemberLink (Concept "Uniprot:12233") (Concept "ProteinNode")). This will lead to some type vertices being "Supernodes" in that a single vertex will be connected to many vertices, perhaps millions of vertices. This will result in a performance issue with naive graph db representations because the outgoing set of the type vertices will be very large. Titandb solves this by having the concept of unidirectional edge where only the destination vertex is aware of its connection to the supernode. But looking at the hypergraph tables in the document, this problem is already solved. So why not use this approach for the TypeIndex?
Re: using MemberLinks as a way of indexing by nameWhere do you think all of that RAM usage is going? Where do you think indexes are kept? The MemberLink maintains indexes in the incoming/outgoing sets, those are just c++ std::set and std::vector, respectively. If you create some other index, you are just moving around where the RAM is being used. You're talking about shifting around the internal representation; you are not proposing anything that will actually decrease RAM usage.Correct me if I'm wrong but won't using std::unorder_multimap<string, Handle> will have less RAM usage than creating new ConceptNode and MemberLinks for indexing?
Also how about adding an api in the atomspace so that we can use external index stores like ElasticSearch/Apache Solr? This is especially useful if we want to do full-text search on atom names. I can help with this integration if you think this idea is worthwhile.