Needs help in Implementing a TinkerPop-Enabled graph DBs

92 views

Skip to first unread message

Dong Dai

unread,

Jul 14, 2016, 10:21:31 AM7/14/16

to Gremlin-users

Hi all,

I am planning to implement a TinkerPop-Enabled distributed graph database. It has several design points. Hence I am wondering whether it can be well supported by TinkerPop?

1) this db may place edges from the same vertex on different servers (i.e., it can do both vertex-cut and edge-cut)

2) this db may move edges and vertices to a different server dynamically (i.e., the locations of vertices and edges are not fixed)

3) this db may run on multiple servers, deployed like Titan with embedded cassandra.

As I want to run graph traversal using Gremlin on this db, do you think Gremlin can run efficiently on it? I mean it should recognize that the edges of some vertices are placed into multiple servers and send traversal to them accordingly.

Also, it should be able to run on the server side, like having multiple Gremlin engines running on multiple servers?

If it is possible, could you please let me know is there any template I should learn and follow?

Thanks very much! Any comment will be appreciated.

- Dong

Marko Rodriguez

unread,

Jul 14, 2016, 11:17:12 AM7/14/16

to gremli...@googlegroups.com

Hello,

I am planning to implement a TinkerPop-Enabled distributed graph database. It has several design points. Hence I am wondering whether it can be well supported by TinkerPop?

1) this db may place edges from the same vertex on different servers (i.e., it can do both vertex-cut and edge-cut)

TinkerPop OLTP will get a reference to a vertex. When you do vertex.edges(), it is the role of the underlying graph database to fetch all the edges accordingly. Note that vertex.edges() returns an iterator so you can stream from other partitions. Moreover, vertex.edges(“knows”) will only iterate “knows” edges so if you aren’t partitioning on “knows” then you have more information to be smart about where to fetch you cuts from.

TinkerPop OLAP currently only supports “star graphs.” That is, a vertex and all its incident edges is the atomic unit of computing. In a few releases, we will be introducing PartitionedVertex to OLAP so that the “star graph” can be split and the underlying OLAP engine will take care of everything accordingly. This also means that certain traversals won’t compile to certain partitions as local traversals (e.g. by(outE().count())) won’t get accurate counts. We plan to provide an API so that graph providers can say which vertex (by label) and which edges (by label) are partitioned in order to only reject local traversals on vertices that are known to be partitioned.

2) this db may move edges and vertices to a different server dynamically (i.e., the locations of vertices and edges are not fixed)

That is fine. That has nothing to do with TinkerPop. That is all up to you.

3) this db may run on multiple servers, deployed like Titan with embedded cassandra.

Again, that is fine. This has nothing to do with TinkerPop. However, note that GremlinServer has some neat features whereby it can learn where the data is so it can talk to the right machine. Moreover, (in the future) traversal routing will allow for the traversal to be sent around the cluster gathering data so the data isn’t pulled to a single machine. The traversals will walk the machines of the cluster in much the same way traversers walk the vertices of the graph.

As I want to run graph traversal using Gremlin on this db, do you think Gremlin can run efficiently on it? I mean it should recognize that the edges of some vertices are placed into multiple servers and send traversal to them accordingly.
Also, it should be able to run on the server side, like having multiple Gremlin engines running on multiple servers?

TinkerPop at the graph provider level is an API of Vertex, Edge, Graph, etc. At the Gremlin level, its a compiler that compiles Gremlin to talk to Vertex, Edge, Graph, etc. Gremlin has various traversal strategies that is applies at compile time to make things efficient, but ultimately, its about the provider knowing the best way to fetch data. As such, providers typically implement their own Traversal strategies (compiler rules) that will take a Traversal and mutate it accordingly to take advantage of the underlying graph databases’ unique features — e.g. global indices, vertex centric indices, partitioned vertices, push down predicates, etc. The interplay between TinkerPop and graph database is at the traversal strategy level.

If it is possible, could you please let me know is there any template I should learn and follow?

You should read the docs and study the reference implementation — TinkerGraph.

http://tinkerpop.apache.org/docs/current/reference/

https://github.com/apache/tinkerpop/tree/master/tinkergraph-gremlin

http://tinkerpop.apache.org/providers.html

Then from there, you could study Titan to see how its smart about distributed data and indices … again, its all about Titan providing TraversalStrategy implementations.

Good luck,

Marko.

Reply all

Reply to author

Forward

0 new messages