The Graph Traversal Machine: Close Encounters of the Fourth Kind

169 views
Skip to first unread message

Amirouche Boubekki

unread,
Nov 2, 2018, 4:20:10 PM11/2/18
to opencog

Linas Vepstas

unread,
Nov 2, 2018, 6:07:02 PM11/2/18
to opencog
I skimmed it in 30 seconds. We should probably/almost certainly build a tinker-pop style interface onto the atomspace. I think its very very doable, I think many valuable lessons could be learned. It would make the atomspace better. It would be an excellent project for a strong developer to morph into a master developer.

At the moment, I have no clue at all what sort of interesting algos such an API allows. What could we do, that we can't already do? What would be the killer app that makes it important to have this API?  A skim of the paper doesn't say. 

Let me rephrase: the atomspace does not yet have a tinkerpop-like API, because nothing we've done so far needed this API.  Is this because we haven't been imaginative enough? Too busy doing other things? What really cool thing could we build with this?

Amirouche -- this means you. Care to take a stab at actually implementing this? Care to recruit and convince someone else to do this? Or at least, for starters, care to take a stab at answering the earlier questions?  Cause its always worth knowing why something should be done, before going too far down the road of doing it...

--linas
 

On Fri, Nov 2, 2018 at 3:20 PM Amirouche Boubekki <amirouche...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/09200c55-4f0d-4197-8835-a21859a11442%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
cassette tapes - analog TV - film cameras - you

Linas Vepstas

unread,
Nov 3, 2018, 7:02:52 PM11/3/18
to opencog
So:
Here's a quick, unstructured, randomized review of TinkerPop vs. the AtomSpace.

* There are many similarities.  For example, both tinkerpop and the atomspace have a key-value store per vertex/edge.  Tinkerpop edges have valency-2 (one vertex at each end of the edge) and are untyped. Atomspace edges have any valency and are typed. (an atomspace edge aka link, can have two vertexes in it .. or 1 or 3 or 0 or 23... also, a link can contain links. The atomspace stores hypergraphs.)

* Tinkerpop4, when it's available, will be hostable by "any" suitable database platform.  The AtomSpace has already played in this area: an unsuccessful hosting on memcachedb, a successful hosting on postgres, an unsuccessful one on hypertable, an unsuccessful one on neo4j.  The failure reasons are highly variable: memcachedb was too slow. The hypertable developer fundamentally misunderstood the problem.  neo4j was too slow (had too large a communications overhead).

* Both the atomspace and tinkerpop4 benefit from underlying DB technology: Postgres is highly scalable, yay! Someday, Atomspace will have an Apache Ignite backend, which is also highly scalable. Yay!

* Tinkerpop has a MUCH larger development community than the AtomSpace. Which means that they've done stuff long ago that are still in planning stages for us. For example, "the property graph model", which the Atomspace needs but doesn't have (We have real customers for this: the AGI-BIO guys want this!  No one is working on it!)  (So, for example, key-value pairs are permission-based; AGI-bio wants to overload values, based on the permissions that a given user has, so e.g. there is a read-only version of genomic data, and multiple read-write layers on top of it, that different researchers update. Someone needs to work on this!)

* The Gremlin traversal language is almost exactly like a an atomspace pattern with a single clause. There is no concept of a multi-clause traversal in Gremlin.

After this, the differences between the two compound and diverge. 

* The Gremlin traversal language can be compiled to bytecode, and shipped off to be executed remotely. Could we do something similar? Yeah, I guess. But its never been the goal of the atomspace to be a generic wrapper on top of existing OLAP/OLTP systems, so we've never given this much thought.

My biggest question/frustration:

How can we increase the user-base for the AtomSpace? It's kind of frustrating that the adoption rate for the AtomSpace remains low, even as graph databases become ever more popular.  It feels like we're getting left in the dust, and yet, whenever I look around, it feels like we're two steps ahead of everyone else. So I can't figure out if we're winning or loosing. Increasing adoption would really really help...

-- Linas


 

Linas Vepstas

unread,
Nov 3, 2018, 9:55:43 PM11/3/18
to opencog
I added https://github.com/opencog/atomspace/issues/1893 to describe more VirtualLinks, that would make it easier to create a gremlin-like API to the pattern matcher Again: a gremlin traversal is like a single-clause pattern match. so, in that sense, we can already run gremlin traversals. Just that, currently, writing them in the style that they write them is awkward. Adding some new pretty VirtualLinks would make it easier/less-awkward/prettier.

--linas

Roman Treutlein

unread,
Nov 5, 2018, 6:20:48 AM11/5/18
to opencog
I think if we want to increase the user-base for the AtomSpace we need to market it more/differently. I just tried to find the Atomspace by googling combinations of {graph,hypergrah, databse,comparison,...} and I can't. Googling "atomspace graph database" only has 4000 results whereas "neo4j graph database" has 466000 results. So it doesn't really matter how advanced the Atomspace is if nobody can find it.

And I don't think adding more features like a Tinkerpop like API will help in that regard. So far we are a bunch of software engineers working on the Atomspace so we can use it for OpenCog and I don't know of anybody that uses the AtomSpace outside of OpenCog. So to get more user we would need people that work on and try to push/sell the Atomspace as an independent product with its own Website and stuff. No idea who would/could do something like that. But this is my opinion on the topic.

Jeff Thompson

unread,
Nov 5, 2018, 9:53:45 AM11/5/18
to ope...@googlegroups.com
If you keep showing the ovlbviously fake chatbot vaudville show of Sophia as the demonstration of Atomspace technology, then what do.you expect?

Ben Goertzel

unread,
Nov 5, 2018, 11:30:22 AM11/5/18
to opencog
Hmm.... Sophia is no AGI at this point, but she's also no more fake
than Alexa or Siri or Google Assistant .... Creating chatbots is a
perfectly valid (and commercially valuable) use of a hypergraph
database and associated pattern matching tools... though not as
AGI-relevant as, say, Nil's current work on inference meta-learning or
Linas's recent work on unsupervised grammar induction...

-- Ben G
> To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACOF_x%3D6D-H2NgAkZmNddGzchtZDB4g7WN18SNMMo_LB7Ug3EQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



--
Ben Goertzel, PhD
http://goertzel.org

"The dewdrop world / Is the dewdrop world / And yet, and yet …" --
Kobayashi Issa

Linas Vepstas

unread,
Nov 5, 2018, 4:53:17 PM11/5/18
to opencog
Roman, yes. Having it's own website might actually be a very good start. -- linas


For more options, visit https://groups.google.com/d/optout.

Linas Vepstas

unread,
Nov 5, 2018, 4:58:42 PM11/5/18
to opencog
Arghhhh. That's absurd on so many different levels. I don't even know where to begin. Jeff, I hope that you're trying to be funny or provocative, and not serious...

--linas


For more options, visit https://groups.google.com/d/optout.

Ben Goertzel

unread,
Nov 5, 2018, 7:46:48 PM11/5/18
to opencog
IMO the main thing that would get OpenCog more adoption would be a
really wizzy, easy to use experimentation interface like Tensorflow
has...

Putting something like that together is, however, a lot of work and
draws on different skills and thought patterns than are currently
prevalent in the OpenCog community...

ben
> To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34NvfzZVKgeq3yjk78ZBA50pWqzaqOp58iGwAOAgz%3DFsA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



--

Linas Vepstas

unread,
Nov 5, 2018, 8:21:19 PM11/5/18
to opencog
But its a chicken-and-egg problem respectable open-source projects have respectable, even slick websites. The atomspace doesn't. To the outsider, that implies its somehow not ready, its still some garage-tinkering project. If it doesn't get out of the garage, the user-base doesn't grow, and its a feedback cycle; without a user base, there's not enough headcount. 

--linas


For more options, visit https://groups.google.com/d/optout.

Matt Chapman

unread,
Nov 5, 2018, 8:40:23 PM11/5/18
to ope...@googlegroups.com
I don't think it's a chicken-egg problem. Developers don't choose open-source projects to contribute to because they are respectable or have fancy websites. They choose them because they scratch an itch the developer has. What itch does the Atomspace scratch? ...besides Ben's itch to have an in-memory data store tightly-coupled to the CogPrime Architecture? (or whatever it's called these days...)

If we're actually ready to promote Atomspace as a stand-alone graph db, I suggest marketing messages of the form, "If you like Neo4j / ArangoDB / JanusGraph, but it's too _____ for your use case, and/or your data records are  ______ than average, you should consider using Atomspace." I suspect the blanks are, "slow" and "smaller & more connected," but I haven't been paying much attention the last couple years, so I could be wrong.

All the Best,

Matt

--
Standard Disclaimer:
Please interpret brevity as me valuing your time, and not as any negative intention.


Linas Vepstas

unread,
Nov 5, 2018, 10:13:37 PM11/5/18
to opencog
Matt,

Sigh. Joel tried to sell the atomspace as a "graph DB" about ten years ago, and everybody was like "graph DB? wtf is that? how can a graph be a DB? that doesn't even make sense." ... time marches on, and now they are a dime-a-dozen.  So clearly, we missed the boat on that one.

What's the atomspace today? Well, its got features no one else does. The pattern matcher makes gremlin/tinkerpop look like a child's toy. The new "matrix" slices and dices extremely large extremely sparse graphs.  No one else has anything like it.

Could we compete with arangoDB, neo4j, janus graph? Don't be absurd. We could be a layer on top of those, providing services and features that those don't provide.  The atomspace is not like those, it offers a completely different class of features and functions.

Is the atomspace ready for the big-leagues? Not really-ish. It's probably big, bloated and slow - no one is investing time in making it slim and fast. There's no easy-to-grok API; its got more bells-n-whistles you can shake a stick at, this makes people's heads spin; they have trouble understanding what it is.  How hard is it to back-end onto other people's datasets? Very. Why? Because we've given this problem exactly zero thought. Brainstorming would yield more missing basic features that ordinary people want; meanwhile, it has too many advanced features that make people dizzy. How can we move forward?

But no one is going to work on the missing features, fix the obvious deficiencies, if no one is using it. And no one is using it because its obscure. Its obscure in part because its hard to understand -- We can use you as exhibit one --  you're a long-time regular, but from your email, its clear you don't know what it is.  How can I begin to fix something as simple as that?  I'm figuring out that a good website, that explains in bullet points and maybe a white paper or two, what it actually is -- that would be a good start.

--linas


For more options, visit https://groups.google.com/d/optout.

Amirouche Boubekki

unread,
Nov 6, 2018, 6:41:36 PM11/6/18
to ope...@googlegroups.com
I will respond in a single mail to all the topics you raised in the other emails.

a) thanks for the kind words, even if I could implement tinkerpop4 in c++ and all the requirements listed in the README, I would not do it, because I am sure the future is not c++ (nor Java).

b) from my opinionated point of view, atomspace is not the kind of database I want to work with to develop an intelligent system that should handle the mixed workload of both analytics and user-facing features. The requirements to load everything into ram is a no-go for me. What do you do when there is no more RAM? I mean you can have 12TB or 100TB maybe but you can not store commoncrawl.

c) Like I said previously, I think what you need is foundationdb

d) tinkerpop4 was surprising because they mention several times RDF which is the formalism I chose to work with.

e) I think it's not a good idea to mix ACL with data, you end up with a very confusing machinery and ACL is a very complex topic. No modern database embed ACL machinery and leave that to upper layers.

f) triple store for the win

g) gremlin is what inspired my work, I will look into tinkerpop4 closely even if I don't buy the compiler thing, yet. Basically, I don't believe the gremlin runtime/compiler can automagically guess what indices to use before knowing about the queries. Just take a simple TF-IDF or BM25 full-text search, every time you change the formula you need to update the indices. This is not feasible to know upfront all the indices you will need, that is why I recommend a flexible storage engine like foundationdb. Where you can build a triple store (and graph or hypergraph) and go to the lower level key-value storage to speed up specific queries. Even if, you index all the things, like I do and like RDF data store do, you need at some point a clever trick to handle sorted range scan (everything between 42 and 1337 in order) or more complex queries like geospatial search or 2d + time. Also, AFAIK, there is no way to index segments (to answer queries like segment overlap with segment e.g. "give all events that happened between 2017 and 2018) in a key-value store without relying on an out-of-key-value-store data structure which means it breaks transactions semantic and you end up with some part of the data that is eventually consistent... Maybe PostgreSQL can help in this regard.

h) gremlin gives false hope (at least from my readings) that will handle all the graph algorithms you need. You can not implement an optimal A* algorithm or shortest path only in gremlin and even if you could the gremlin engine AFAIK can not stream results. Gremlin is mostly for neighborhood search. For deep traversal that must be guided, you must issue several (gremlin) small queries ie. there is no point of a gremlin compiler.

i) being able to stream results is very important because in a real-world scenario, where the computation of a solution might take an infinite amount of time. It must be possible for the user to ask for partial results and input new data to narrow the currently occurring search maybe that's what https://en.wikipedia.org/wiki/Beam_search is. Interactive / conversational (re)search.

j) https://github.com/graphaware/neo4j-nlp that can be of interest.

k) again regarding, gremlin, do they plan to handle bigger than RAM results? I am not aware of such work. AFAIK the result must be able to stay in RAM.

All the best!


Linas Vepstas

unread,
Nov 6, 2018, 9:38:52 PM11/6/18
to opencog
Hi,

Thanks for the thoughtful reply. ... but its also clear that we have very different concerns and goals.

All the bullet-points you make are entirely valid for non-AI systems architected for commercial use. They're kind of useless if the goal is to do AI.

The only thing I'm really interested in is the AI. Nothing in your list is useful or helpful for that. So, to pick an example: you mentioned neo4j-nlp. Cute, ... sure. Read through the "feature matrix" they post there. Zero useful features.  I mean, great, if you've got a specific commercial product in mind, but useless for AI. Arguably more-than-useless; its a distraction, a time-waste.

And perhaps this is why opencog remains in the back-waters -- no one is trying to do anything like this -- they're pursuing other goals. Which - again, that's nice if that's what you want to do, but for me, I'm interested in the science of creating a thinking machine that can actually think.

There are approximately zero tools out there for this.  I went on a hunt for such a decade ago: https://linas.org/agi.html and found .. zero tools. Here I am, ten years later, haven't updated the list, because there are still ... zero useful tools. So opencog is my best attempt to cobble together something non-useless.  Progress is slow.

-- Linas


For more options, visit https://groups.google.com/d/optout.

Amirouche Boubekki

unread,
May 13, 2019, 7:47:33 AM5/13/19
to opencog


On Wednesday, November 7, 2018 at 12:41:36 AM UTC+1, Amirouche Boubekki wrote:
I will respond in a single mail to all the topics you raised in the other emails.

a) thanks for the kind words, even if I could implement tinkerpop4 in c++ and all the requirements listed in the README, I would not do it, because I am sure the future is not c++ (nor Java).


c) Like I said previously, I think what you need is foundationdb
 
I just want to ponder a bit my thinking, because since then I tried to actual load
some data into foundationdb. I both key max size and value max size. There is
some cookbook recipe for handling that case: https://apple.github.io/foundationdb/blob-java.html
but in my case it is an extra indirection and I don't need that petabytes scale, yet.


To be fair, there is a new storage engine that is in the work, that should remove some
of those limitations, but not ETA yet.

I will continue to experiment with WiredTiger. Even if I also hit some limits related to the fact that
I try to store many small bytes 20 000 keys of 150 bytes which leads the database engine to
deadlock during the transaction.

I tried the same workload on leveldb using hoply it loads 14G without crashing. But LevelDB
does not support transactions. I know, I know that transactions are not required all the time
but I prefer to be ready for the case where it is required. Another solution is to use rocksdb
which is a fork of leveldb with support for transactions.


All the best,


Amirouche ~ amz3

Amirouche Boubekki

unread,
May 13, 2019, 8:00:16 AM5/13/19
to opencog
Le lun. 13 mai 2019 à 13:47, Amirouche Boubekki <amirouche...@gmail.com> a écrit :


On Wednesday, November 7, 2018 at 12:41:36 AM UTC+1, Amirouche Boubekki wrote:
I will respond in a single mail to all the topics you raised in the other emails.

a) thanks for the kind words, even if I could implement tinkerpop4 in c++ and all the requirements listed in the README, I would not do it, because I am sure the future is not c++ (nor Java).


c) Like I said previously, I think what you need is foundationdb
 
I just want to ponder a bit my thinking, because since then I tried to actual load
some data into foundationdb. I both key max size and value max size. There is
some cookbook recipe for handling that case: https://apple.github.io/foundationdb/blob-java.html
but in my case it is an extra indirection and I don't need that petabytes scale, yet.


I forgot to mention that I have made bindings for GNU Guile:


Good luck!

Linas Vepstas

unread,
May 14, 2019, 2:58:47 PM5/14/19
to opencog
On Mon, May 13, 2019 at 6:47 AM Amirouche Boubekki <amirouche...@gmail.com> wrote:

c) Like I said previously, I think what you need is foundationdb

That website provvokes a browser error, and cannot be displayed. 
 
I will continue to experiment with WiredTiger. Even if I also hit some limits related to the fact that
I try to store many small bytes 20 000 keys of 150 bytes which leads the database engine to
deadlock during the transaction.

I've said this before, so you know it already .. but ...  for the atomspace, there are tens-of-millions of atoms in a "typical" case. Each atom is approx 50 to 200-ish bytes, typical.

take a  look at grakn.ai  -- its a high-level graph database for knowledge stores.

--linas

Amirouche Boubekki

unread,
May 24, 2019, 5:34:58 PM5/24/19
to opencog


On Tuesday, May 14, 2019 at 8:58:47 PM UTC+2, linas wrote:


On Mon, May 13, 2019 at 6:47 AM Amirouche Boubekki <amirouch...@gmail.com> wrote:

c) Like I said previously, I think what you need is foundationdb

That website provvokes a browser error, and cannot be displayed. 

Here is a link to the documentation https://apple.github.io/foundationdb/
 
 
I will continue to experiment with WiredTiger. Even if I also hit some limits related to the fact that
I try to store many small bytes 20 000 keys of 150 bytes which leads the database engine to
deadlock during the transaction.

I've said this before, so you know it already .. but ...  for the atomspace, there are tens-of-millions of atoms in a "typical" case.

Thanks!
 
Each atom is approx 50 to 200-ish bytes, typical.

What is the typical hardware people run the atomspace on? SSD? RAM? CPU?
 

take a  look at grakn.ai  -- its a high-level graph database for knowledge stores.

Thanks!!! It is a long time I did not read that much!

Linas Vepstas

unread,
May 24, 2019, 7:00:45 PM5/24/19
to opencog
On Fri, May 24, 2019 at 4:35 PM Amirouche Boubekki <amirouche...@gmail.com> wrote:
 
Each atom is approx 50 to 200-ish bytes, typical.

That is what it would cost to STORE an atom to disk. I guess. Depends on what indexes you'd need to keep.  The In-RAM atoms are about 1.5KBytes each, because they keep indexes to speed graph traversal.  viz, all the egdes to walk the graph quickly.  The atoms themselves are small, but there's lots of extra bytes used to cache stuff to speed traversal.


What is the typical hardware people run the atomspace on? SSD? RAM? CPU?

The atomspace itself is an in-RAM database. You can save& restore individual atoms to disk (currently via postgres; I'm waiting for other DB drivers to get written.)  Since postgres is distributed, you can run the atomspace distributed (see the examples dir, its really really easy).  Postgres runs faster on SSD.

On my machine, when crunching the language-learning code, which typically touches up to 5-50 million atoms, I found that processing was i/o limited. With SSD disks, postgres runs at approx 80% of the rated SSD performance, and at about 50% of the SATA link speed.  (raid-mirror setup) So basically it comes close to maxing out the I/O subsystem and the I/O buses.

For smaller datasets, the atomspace does run on a Raspberry PI (just fine, I'm told).

It does not run on Android; the main blocker for this is the lack of guile for Android.  As of last week, it runs on MacOS. No clue if that means it also runs on iOS. Future Chromebooks will run linux ... so... also windows ...

--linas

Reply all
Reply to author
Forward
0 new messages