Announce: Distributed AtomSpace proof-of-concept

74 views
Skip to first unread message

Linas Vepstas

unread,
Jul 19, 2020, 4:35:05 PM7/19/20
to opencog

As a "distributed atomspace" is a commonly recurring request, I thought I should put together yet-another distributed atomspace variant. This one is very super-simple: easy to use, and has a tiny implementation (500 lines of code).

Easy-to-use: Take a look at the two examples, here:

These examples use exactly the same API as the 3-4 other distributed atomspace backends. Some comments about those.

* The Postgres SQL backend. It's part of the core AtomSpace code, provided by default. Large, complex, bullet-proof, production-ready. Kind-of slow: there's a lot of overhead.

* The IPFS backend. Sounds great, right? I tried a very very naive implementation, and it's surprisingly terrible. Turns out IPFS is actually "centralized" not "decentralized": to get anything done, one must build an index, that index must fit into just one file. Whoops. So clearly, my naive design is the wrong way to go. The code "works" (passes unit tests) but is disappointing. https://github.com/opencog/atomspace-ipfs

* The OpenDHT backend. Taking the lessons learned above, I ported the same naive implementation to OpenDHT. So, DHT stands for "Distributed Hash Table", in this case Kademlia, the same one as in bittorrent and ethereum and gnunet and many others. Much better, but revealed a different flaw in my naive thinking. Two problems, harder to explain. Problem #1: an Atom, sitting in OpenDHT, is just taking up RAM, thus competing for RAM with any local AtomSpace. Problem #2: the hash used by DHT's completely randomizes Atoms. So even if they are close to one-another, e.g. (List (Concept "a") (Concept "b")) -- these three atoms - the two concepts and the list, will end up on different servers on opposite sides of the planet. The DHT hashing algo has no clue about the locality-of-reference that we want for the atomspace.  Again, this code "works" (passes unit tests) but is disappointing. https://github.com/opencog/atomspace-dht

* So what's the right design? Well, it seems that the best bet would be to use OpenDHT to store AtomSpace indexes, but do the actual serving of atoms by "seeders". And so this is why I wrote this super-simple cogserver-based distributed atomspace.  The hope is to use it as a "seeder" https://github.com/opencog/atomspace-cog/

Future plans: I'm hoping that someone interested can build a high-performance server/seeder, based on the prototype here. (really -- 500 LOC is very simple, very easy to understand, and thus easy to improve upon.) This does NOT require any special skills: if you have basic coding skills, maybe some experience with network i/o, or are willing to explore,  it should be possible to build a high-performance variation thereof. So all those people saying "I'm just an ordinary coder, how can I help?" well -- here's your chance.

A more difficult, more conceptual task would be how to wire up a bunch of these servers using the OpenDHT/Kademlia infrastructure.  I think this is possible, but it's more cerebral, and  requires thinking-work.

--linas

--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

Lansana Camara

unread,
Jul 19, 2020, 9:57:04 PM7/19/20
to ope...@googlegroups.com
Can you provide a diagram of the architecture? I'm having a hard time visualizing high-performance servers being wired up by a database.

The DHT is just meant to be a storage mechanism, right? If so, why/how is it wiring up servers? Seems like something is missing in this picture.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37wx%2B3SAtBS45eS6vf_XrC8rgCCdNW2HziA7xkYasCw_A%40mail.gmail.com.

Linas Vepstas

unread,
Jul 20, 2020, 2:21:52 AM7/20/20
to opencog
On Sun, Jul 19, 2020 at 8:57 PM Lansana Camara <lxc...@gmail.com> wrote:
Can you provide a diagram of the architecture? I'm having a hard time visualizing high-performance servers being wired up by a database.

The DHT is just meant to be a storage mechanism, right? If so, why/how is it wiring up servers? Seems like something is missing in this picture.

I assume you are talking about Kademlia, and not the cogserver. A good place to start would be to read about Kademlia, and how, for example, bittorrent works.

--linas

Amirouche Boubekki

unread,
Jul 20, 2020, 2:45:21 AM7/20/20
to opencog
Hello all,

Sorry to be the person to announce not so good news, but it seems to
be distributed not in the sense of peer-to-peer distribution. At the
very least, if distributed in the sense of peer-to-peer it will
require a private peer-to-peer network. That is not the case of
OpenDHT.

Again, I bump the idea to rebase the AtomSpace on top of FoundationDB
which is distributed in the sense that people think. I will be very
grateful if / when video records of the conference are released so
that I can provide hopefully more insightful feedback on the new
AtomSpace.

Best wishes
> --
> You received this message because you are subscribed to the Google Groups "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37wx%2B3SAtBS45eS6vf_XrC8rgCCdNW2HziA7xkYasCw_A%40mail.gmail.com.



--
Amirouche ~ https://hyper.dev

Amirouche Boubekki

unread,
Jul 20, 2020, 3:01:06 AM7/20/20
to opencog
Le lun. 20 juil. 2020 à 03:57, Lansana Camara <lxc...@gmail.com> a écrit :
>
> Can you provide a diagram of the architecture?

A DHT backend could be used to spread the on-disk load over many
peers, but only as _cold storage_ not as a primary source of truth.

> I'm having a hard time visualizing high-performance servers being wired up by a database.

I am not sure I understand that. Every system requires some kind of
database. What could be done is to have a Single Source of Truth where
everything is persisted forever and secondary sources of truth where
expert systems (high-performance services) store the data in their
favorite schema. This is usually the case of Event Sourcing systems.
The whole system is eventually consistent. That means that a given
expert system will have to wait sometime before they become aware of
new knowledge. The advantage is that you keep around all the data in
the log, the event source, and aggregate the useful knowledge for a
given system in a fine-tuned highly performant database for specific
uses. This is not my favorite approach but it is workable.

> The DHT is just meant to be a storage mechanism, right? If so, why/how is it wiring up servers? Seems like something is missing in this picture.

For more info on DHT look at
https://github.com/amirouche/qadom/blob/master/qadom/peer.py

In particular, the routing mechanism is described in:

https://github.com/amirouche/qadom/blob/b1bf5762ee62a958f5c9e9ea991d6f7c792fbece/qadom/peer.py#L305-L321

the nearest procedure is defined at:

https://github.com/amirouche/qadom/blob/b1bf5762ee62a958f5c9e9ea991d6f7c792fbece/qadom/peer.py#L56-L62

That procedure is a hack to avoid the use of more involving data
structure that is implemented in kademlia:

https://github.com/bmuller/kademlia

But it works in most situations when there are less than 1M peers in
the network.

Here is the procedure that fetches a key from the network:

https://github.com/amirouche/qadom/blob/b1bf5762ee62a958f5c9e9ea991d6f7c792fbece/qadom/peer.py#L611-L658


Hope this helps!
> To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPPXERqiYtxJ1ucnkqMQLWmwo3OCy894J2OgYruE_dM7MQAjnw%40mail.gmail.com.

Linas Vepstas

unread,
Jul 20, 2020, 12:35:13 PM7/20/20
to opencog
On Mon, Jul 20, 2020 at 1:45 AM Amirouche Boubekki <amirouche...@gmail.com> wrote:
Hello all,

Sorry to be the person to announce not so good news, but it seems to
be distributed not in the sense of peer-to-peer distribution. At the
very least, if distributed in the sense of peer-to-peer it will
require a private peer-to-peer network. That is not the case of
OpenDHT.

There is no reason whatsoever that OpenDHT cannot be run in private mode.  If you choose the default, then you get the public network, but you can define any kind of network and discovery process that you want.


Again, I bump the idea to rebase the AtomSpace on top of FoundationDB
which is distributed in the sense that people think.

Again, I will state that technically, this is not hard. You do not have to rebase anything at all. You just have to write the backend shim.  The demo shim here: https://github.com/opencog/atomspace-cog/ is useful precisely because it is only 500 lines of code, and provides a clear, simple, small example of the steps that have to be performed to have things run on your favorite DB.

It's simply just not that hard to do any of this stuff. It is fantastically simpler than writing AI algorithms. The point here is that databases are a "known technology"; there are endless blog posts written about this stuff.  It's not like AGI research, where pretty much no one knows anything and none of it has been done before. This is a nearly ideal task for newcomers to get engaged in - it's concrete, it's doable, it's well-defined, there are unit tests, you know when you're done.

--linas
 

Linas Vepstas

unread,
Jul 20, 2020, 12:57:36 PM7/20/20
to opencog
Hi Amirouche,

On Mon, Jul 20, 2020 at 2:01 AM Amirouche Boubekki <amirouche...@gmail.com> wrote:
Le lun. 20 juil. 2020 à 03:57, Lansana Camara <lxc...@gmail.com> a écrit :
>
> Can you provide a diagram of the architecture?

A DHT backend could be used to spread the on-disk load over many
peers, but only as _cold storage_ not as a primary source of truth.

Um, no, I am unclear on why you are making a distinction between hot and cold storage. The current three prototypes all provide hot storage, and have no provisions at all for cold storage.


> I'm having a hard time visualizing high-performance servers being wired up by a database.

I am not sure I understand that. Every system requires some kind of
database.

The atomspace is a database. It is an in-RAM database.
 
What could be done is to have a Single Source of Truth

There is no single source or truth.  Fundamentally, there cannot be. Some people will say "eventually consistent", but I am not aware of any reason why that is desirable. Yes, I understand that certain kinds of datasets might need to be consistent, but this should NOT be a built-in property of the database! 

This is a VERY IMPORTANT point that results in a huge amount of friction and confusion because it leads people into making utterly terrible design decisions that have no rational basis.  I have worked very long and very hard to completely eliminate all concepts of "eventual consistency" and "single source of truth" from the atomspace.  These are two of its important selling-points! Don't erase or subvert them!

where
everything is persisted forever

No, not forever. Deletion IS a very important quality!

and secondary sources of truth where
expert systems (high-performance services) store the data in their
favorite schema.

You've lost me. That atomspace is a database. It has a certain built-in schema that is (supposed to be) general enough to layer other schemas on top. The idea is that any "favorite schema" can be easily/trivially layered on top of the atomspace. This is "mostly true" today, but there remain some ugly bits that I would like to see fixed.

This is usually the case of Event Sourcing systems.
The whole system is eventually consistent.

Again, No. This is NOT a desirable property of the system!  It's an anti-pattern.

That means that a given
expert system will have to wait sometime before they become aware of
new knowledge. The advantage is that you keep around all the data in
the log, the event source,

That is what the demo at  https://github.com/opencog/atomspace-cog/ is. It is the "event source"
 
and aggregate the useful knowledge for a
given system in a fine-tuned highly performant database for specific
uses.

That is what the atomspace is. It is a fine-tuned, high-performant database. It's got the highest performance possible, that I have been able to squeeze out of it, and if someone thinks they know how to make it faster, ... well, that would be a very interesting direction to go in. But, to get faster, it would need some truly new, radically different approach, and I cannot really imagine what that would be.

This is not my favorite approach but it is workable.

:-/

--linas

Amirouche Boubekki

unread,
Jul 20, 2020, 1:47:33 PM7/20/20
to opencog
Sorry! I do not want to upset you. I will re-read the paper and go
through atomspace-cog and the original atomspace.

Again sorry for misleading the conversation.

Linas Vepstas

unread,
Jul 20, 2020, 2:58:03 PM7/20/20
to opencog
Urk,  I'm not upset. it's just ... well, there are actually many important and subtle design decisions in all of this, and, in a certain sense, the AtomSpace actually gets many of them wrong. However, getting them right is (extremely?) difficult, and I keep hoping for, waiting for a discussion that sheds light onto the actual issues, and actually focuses on the problematic design areas. 

I should be the one to apologize; I get excited and like to shout and get very demonstrative when .. certain things excite me. This ... doesn't go over well, socially.

--linas
Reply all
Reply to author
Forward
0 new messages