My thoughts on the future of Apache TinkerPop.

944 views
Skip to first unread message

Marko Rodriguez

unread,
Feb 26, 2019, 7:16:34 AM2/26/19
to gremli...@googlegroups.com
Hello everyone,

My sabbatical gave me the opportunity to step away from TinkerPop/Gremlin codebase and reflect on the “data space” as a whole and to think through the types of problems I want to tackle next. I have come to a bit of a crossroads and I would like to get peoples’ feedback on my thoughts below.

1. The Primary Benefits of TinkerPop3

TinkerPop3 is a massive body of code that is light years ahead of TinkerPop2 and TinkerPop1. I am thankful that DataStax gave me the opportunity to focus full-time on TinkerPop3 for three straight years. Moreover, I am beyond ecstatic that that I was able to work day in and day out with Stephen and Kuppitz. Our collaborations have birthed some beautiful ideas. A technical write up of the key features of TinkerPop3 is provided at https://arxiv.org/abs/1508.03843. Of particular import are the following developments:

a. The Gremlin virtual machine
- the idea that there is a language agnostic bytecode that any query/programming language can compile to.
- the idea that the virtual machine can interact with any graph database.
- the idea that the virtual machine can be executed by any data processor.
b. The Gremlin language
- a delightfully expressive, self-consistent fluent-style query language.
- the idea that Gremlin can be embedded in (hosted by) any programming language that supports function concatenation and nesting.

2. The Primary Benefits of TinkerPop4

When writing the TinkerPop4 paper with Stephen at the end of last year (https://zenodo.org/record/1476234), two key proposals were made:

1. Remove the structure API: graph providers simply need to implement custom V(), out(), property(), etc. steps. 
- Thus, there will be no more graph.vertex(), vertex.outEdges(), edge.properties(), etc.
- The only way to interact with the graph is via Gremlin.
2. Easily support any data processor: the OLTP/OLAP distinction will blur as we make it easier for other data processors to integrate with TinkerPop.
- Example data processors include Akka, Kafka, Flink, Spark, Apex, JavaRX, Storm, etc.
- Gremlin is a data flow language and any data flow/stream processor should be able to naturally execute it. 

3. Data Agnosticism in TinkerPop4

In this section, I want to discuss a potential future that is a radical re-thinking of TinkerPop4 and, ultimately, what Apache TinkerPop could mean to the data community as a whole.

One late fall night I was circumnavigating Isla Espiritu Santo and it dawned on me that Gremlin has a natural algebraic representation. With further thought, I realized that this algebraic structure is a ring. With even further thought, I realized that this ring has nothing to do with graphs, but in fact, is data structure agnostic. The ring simply describes how data flows through functions. This swath of ideas led to the development of the stream ring theory:  https://zenodo.org/record/2565243. The article’s algebra nicely describes Gremlin, but interestingly enough, the paper does not discuss “graphs.” Since writing it, I have consider this paper “the death of Gremlin” and “the birth of Gremlin.”

Since day 1, TinkerPop has been focused on providing the graph community a provider agnostic query language. However, I no longer see “graph” as the most important aspect of TinkerPop. For instance, there are very few steps in the Gremlin language that are graph specific. These include: V(), out(), in(), outE(), property(), inV(), etc. The other steps in Gremlin are data structure agnostic. For example: select(), where(), match(), as(), sack(), repeat(), project(), is(), math(), choose(), coalesce(), group(), etc.

Now, after the development of the stream ring algebra, I believe that Gremlin is poised to break out of its graph shell and become a universal query language and virtual machine that supports:

1. Any query language: any query language can compile to its bytecode.
2. Any data storage system: any data structure can flow through its steps.
3. Any data processor: any message-passing/stream-based system can integrate with it.

4. Gremlin Beyond Graph

There are numerous stream processing frameworks in existence today. Most of their APIs are similar to Gremlin in that they support the map/filter/flatmap-fluent style. 


From what I can gather, Gremlin is much more expressive (supporting variables, branching, looping, nesting, pattern matching, etc.). Moreover, Gremlin has an algebraically sound compiler and can be embedded into most any programming language. It is these aspects together that take Gremlin away from being just a “fluent API” to being a “Turing Complete query language.”

The focus of TinkerPop has been on graphs (databases) and I believe, to our detriment thus far, we have ignored the stream community and the awesome technologies they bring to the table. TinkerPop3 only supports Java iterators (OLTP) and Spark (OLAP). If we tap into these other stream processors:

1. The stream community gets a powerful, expressive fluent query language.
2. The graph community can seamlessly leverage more data processors.
3. The data community, in general, can Gremlin query any data — not just graphs.

Thus, I believe that Gremlin, in TinkerPop4, should be broken up into language subsets:

1. gremlin-core: select(), as(), match(), where(), is(), project(), group(), fold(), etc.
2. gremlin-graph: V(), outE(), in(), property(), etc.
3. gremlin-relational: R(), join(), etc.
4. gremlin-document: D(), etc.
…?

If you are working with graphs, then you “import” gremlin-core and gremlin-graph and off you go. If you are pulling data from a relational database and processing that data to then put it into a graph database, import gremlin-core, gremlin-relational, and gremlin-graph. Finally, consider Josh Shinavier’s recent work on the categories of graph (https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012). He is staged to generalize these ideas to the categories of data. What is a vertex? — a map with literal property values and nested list/edge elements. What is a relational database row? — a map. What is a document? — a nested map with literal, map, and list elements. Data is data is data. There is little distinction between these data structures. In the ends its all just literals, lists, and maps.

5. The Components of TinkerPop4

 I propose TinkerPop4 be a complete rewrite of TinkerPop3. The components of this new body of code would include:

1. Gremlin language and bytecode specifications.
- gremlin-core, gremlin-graph, gremlin-document, gremlin-files, …
2. Bytecode strategies for compiling and optimizing bytecode.
- gremlin-core has its strategies.
- gremlin-graph extends it with graph specific strategies.
- data system providers extend it with database/storage specific optimizations.
3. Gremlin traversal machine that is designed for any processor.
- kafka, spark, flink, storm, javarx, apex, …
- bytecode comes in, a stream topology is created, executed, and results are streamed back.
4. A binary serialization format that is data structure agnostic.
- gremlin-graph would have graph specific serialization extensions.
- gremlin-document would have document specific serialization extensions.
- etc. … or maybe its all just maps, lists, literals that are called “vertices” “edges” “documents” and “rows” … easy.
5. A simple I/O server for sending Gremlin queries to the virtual machine and streaming back results to the user.
6. A Gremlin REPL console for terminal control.

And nothing else. Thats it. Gremlin in, results back.

This proposal is identical to the recently written TinkerPop4 paper, save that now Gremlin is data structure agnostic.

6. Conclusion

There is no reason that a Gremlin query must always start g.V().

g.R(“people”).join(R(“addresses”)).by(“ssn”).
  select(“country”).
  groupCount().order(local).by(value).unfold().limit(1).
  addV(“country”).property(“name”,select(“name”)); // relational -> graph

The TinkerPop community is the only community I know capable of developing a system like this. We know how to develop distributed virtual machines. We know how to compile bytecode. We know how to design a Turing Complete data flow language. Thus, I propose:

Apache TinkerPop
A Graph Computing Framework

==becomes==>

Apache TinkerPop
A Distributed Computing Virtual Machine and Language

Thank you for reading,
Marko.

Jack Park

unread,
Feb 26, 2019, 10:39:25 AM2/26/19
to gremlin-users
Marko,

Thank you for this!
I can already imagine gremlin-topicmap - a graph of a slightly different nature than V[] and E[]
Perhaps, a stretch, but gremlin-conceptualgraph as well.

Cheers,

Jack

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/436E3C45-846E-4AF1-8966-C6AB49C6A976%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Fred Eisele

unread,
Feb 26, 2019, 11:40:22 AM2/26/19
to Gremlin-users
I am using TinkerPop3 to represent categories as defined by category theory.
Every directed graph can be seen as free-category:
    * graph-node = cat-object 
    * graph-dir_edge = cat-arrow
The weak spot in TinkerPop is the difficulty/impossibility of treating edges/arrows as nodes/objects.
In order to form a functor is it necessary to have arrows connecting arrows.

# Represent Arrows as Nodes

What I am presently doing presently with TinkerPop3 is creating nodes for every arrow.

# Property Graph Database as Categorical Database

A more interesting approach would be to rethink ThinkerPop to be a categorical database API.
Your proposal is heading in this direction.

The strength of the TinkerPop approach is embodied in the 
algorithms which accompany the graph formalism.
Each formalism suggests/enables additional algorithmic techniques;
the stream-ring formalism is a rich source of useful algorithms.

Every node is a 0-path, every arrow is a 1-path.


Marko Rodriguez

unread,
Feb 27, 2019, 2:45:18 PM2/27/19
to gremli...@googlegroups.com
Hi,

While doing some research and I noticed:

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf (Google’s VLDB article before turning technology over to Apache)

Apache Beam is doing for stream processors what Apache TinkerPop is doing for graph databases. That is, provide an abstraction layer so users can develop to one API and then pick and choose their backend depending on the time/space-constraints of their application.

Note that Apache Beam is a bit behind when it comes to a high-level processing language. They have a Java-specific, fluent-style DSL called Euphoria.

If Gremlin rode atop Apache Beam, then Apache TinkerPop would support the following stream and batch processors:
Apache Apex
Apache Flink
Apache Gearpump
Apache Nemo
Apache Samza
Apache Spark
Google Cloud Dataflow

I believe I read somewhere, but I can’t find it now, there there is an Apache Beam “runner” (adaptor) for Apache Kafka in the making.

Apache Beam is great for TinkerPop3. We could automatically incorporate all these stream/batch systems into our “OLAP" offering. However, in this mailing thread's TinkerPop4 proposal, if we generalize Gremlin out of “graph,” then Apache TinkerPop is poised to become a useful language in stream/batch computing along side Streaming SQL. Moreover, given our realization that there are only so many “graph steps” (V(), out(), properties(), etc.) that providers need to implement to access data in their respective database, other data structure steps will start to follow suit (e.g. document database-specific steps).

From there, Gremlin could pull data from any data source, execute on any processing system (local/stream/batch), and with the bytecode specification, enable other languages to compile to it.

To summarize:

1. Apache Beam makes Apache TinkerPop’s access to the stream space much easier.
2. Gremlin (as a language and virtual machine) is more advanced than what is offered by the streaming community.

Thanks for reading,
Marko.

Russell Jurney

unread,
Feb 27, 2019, 2:59:04 PM2/27/19
to gremli...@googlegroups.com
This sounds really cool.

Would the core of Gremlin syntax change? How long would this take you and how long until it’s stable?

Is there anything possible to make Gremlin syntax more approachable? In terms of popularity that’s the barrier I see. It is incredibly powerful but is hard to learn compared to SQL. The awesome Gremlin book has helped enormously but things like lookbacks are intimidating. Not sure how much of this would occur during stream/relational processing.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Marko Rodriguez

unread,
Feb 27, 2019, 3:19:29 PM2/27/19
to gremli...@googlegroups.com
Hello,

Would the core of Gremlin syntax change? 

No. The idea, in theory at this point, is that there would be language subsets.

gremlin-core: where(), is(), select(), match(), as(), repeat(), choose(), coalesce(), groupCount(), sum(), path(), coin(), etc.
gremlin-graph: V(), out(), in(), outE(), inV(), properties(), etc.
gremlin-??

Thus, what you currently think of as Gremlin would be gremlin-core + gremlin-graph. Here is a made up Gremlin Console session showing how this would work in principle:
 
gremlin> g = Gremlin.traversal(‘gremlin-core’,’gremlin-graph’).
     source(TinkerFactory.createClassic()).
     processor(FlinkProcessor.class)
==>traversalsource[[gremlin-core,gremlin-graph],tinkergraph[vertices:6 edges:6],flink]

Now, imagine source(file), source(mongodb), source(kafkatopic), etc. Likewise, imagine processor(AkkaProcessor.class), processor(RxJavaProcessor.class), processor(SparkProcessor.class), etc.

How long would this take you and how long until it’s stable?

How much money you got? 

If this seems like this is “the way forward,” then I bet I could rip out a solid alpha in 2 months time. Probably less.

Is there anything possible to make Gremlin syntax more approachable? In terms of popularity that’s the barrier I see. It is incredibly powerful but is hard to learn compared to SQL.

If you look at the fluent APIs provided by the stream processing frameworks, they are all pretty limp. Basically, just map, flatmap, map, filter, group, etc. style functions that take a Java8 function for closure. Gremlin is a more complicated language because it supports looping, conditional branching, mutation histories, local and global variables, etc. This complexity is what makes Gremlin Turing Complete and also what I believe will make it successful not only as a general-purpose database query language but also a general purpose stream/batch processing language. Moreover, Gremlin is not bound to any programming language and can be easily hosted in Groovy, Scala, Python, JavaScript, C#, etc. because of its bytecode abstraction layer and its only requirement being that the hosting language support function nesting and concatenation (trivial requirement).

Gremlin is part of a “functional movement” in computing these days and I believe the younger generation is getting more comfortable with the monadic-fluent style and understanding that data flows through functions. SQL is a completely different style of language. I don’t have a good answer to making it easier to learn than SQL. Perhaps you or others could offer thoughts?

The awesome Gremlin book has helped enormously but things like lookbacks are intimidating. Not sure how much of this would occur during stream/relational processing.

Kelvin knocked that one out of the park. If this proposal moves forward, we will definitely be looking to him to be the voice for the developer community.

Marko.




Marko Rodriguez

unread,
Feb 27, 2019, 4:05:43 PM2/27/19
to gremli...@googlegroups.com
Hi,

Over lunch, I had another thought.

If you want Gremlin to be “the standard language” for graph, you generalize it.

"I’m a developer at CoolCo, Inc. and I’ve been working on a project that does stream processing. We use Gremlin to define our data flows as its expressive and has bindings to the multiple programming languages we use on the project. The project is done now I’ve been tasked to work on a new graph database project. Hmm, lots of different graph databases languages out there. Oh, whoa! Gremlin is a graph query language too with V(), out(), etc. primitives. Awesome. I guess I will just keep using Gremlin.”

“I’m a developer at Hip, LLC. and I have to map a big relational database we have to a graph database. I decided to use Gremlin because gremlin-relational can pull data from the relational database and gremlin-graph can insert the data into the graph database. I did this with Spark being the underlying Gremlin VM execution engine. It is so cool that Gremlin can be used to do all my data manipulation and ingestion. I’m definitely going to keep using it for our real-time and batch/streaming queries.”

In essence, if Gremlin is available to stream people, batch people, graph people, etc., then there is even more of a reason to learn Gremlin and make it your go to language.

There are very few pressures against Gremlin working in all these environments. Its almost a natural next step.

Marko.

Joshua Shinavier

unread,
Feb 28, 2019, 9:11:11 AM2/28/19
to gremli...@googlegroups.com
I think support for data streams is a very natural next step for Gremlin (although I do think TinkerPop should maintain its focus as "A Graph Computing Framework"). At Data Day Seattle in 2017, I was rather inspired by Tyler Akidau's talk on the Dataflow Model (for an intro, see here and here, or see the paper). This is how Apache Beam combines streaming with the relational model. Although there have been quite a few previous solutions of this kind in the academic sphere, this one has a robust modern implementation. I thought at the time what a great fit it would be for graph processing -- what you need to make this work are a) a well-defined property graph data model, and b) a well-defined, invertible mapping between that model and the relational model.

Josh


Stephen Mallette

unread,
Feb 28, 2019, 10:04:39 AM2/28/19
to gremli...@googlegroups.com
This thread has the feel of the early days of TinkerPop to me - nostalgic. :)

I already expressed this thought to Marko privately, but I thought I'd point it out here too since Josh said something similar about staying focused on being "A Graph Computing Framework". We've spent about a decade trying to position TinkerPop with that "A Graph Computing Framework" tagline and while I understand that (1) there is a natural path/fit available to data streams and (2) Graph is NOT being shucked to the side to become a second class citizen, we should be careful in how this kind of change is messaged. I'm no expert in marketing/branding/etc stuff, but I sense that that there should be some caution there.

Aside from that, I think it's an interesting future body of work that could expand our community in interesting ways. Perhaps it opens up a new body of committers to us the way that GLVs have attracted engineers from other programming languages. We wouldn't just have graph heads hanging about but also start to see perspectives from other areas of computing expertise. 



jbda...@gmail.com

unread,
Mar 1, 2019, 3:18:31 AM3/1/19
to Gremlin-users
Hi,

As a GLV bad pupil, I can share my experience. Bad pupil because to use Gremlin query language with Python, I develop a kind of workaround to avoid driver. I post a howto last year How to pythonize TinkerPop.
As a recent Gremlin user but not recent Data user in IT solutions, I'd say that there is similitude with past IT solutions and roadmaps.
JDBC drivers: Is the future of Gremlin driver the same as JDBC drivers ?
SQL: In a way, "A Distributed Computing Virtual Machine and Language" aim Gremlin language becomes the new SQL.
(Jungle of) Database provider: What is the opensource roadmap of a language as the (best) (unique) cipher of heavy weight graph database providers ? 

TinkerPop is a top IT solution, but Gremlin language can be so simple as so complicated that the creator have to be forever online to deliver the solution.
Next release focuses to TinkerPop's DNA: API structure, driver to connect to legacy graph database... In this way why TK don't aim to increase market share in high performance Graph providers...

Michael Pollmeier

unread,
Mar 1, 2019, 4:15:03 PM3/1/19
to Gremlin-users
I love the outside-our-box thinking that's happening here! You're proposing some radical changes in direction, so let me dish up some counter arguments:

1) Focus: I'm afraid that we will spread too wide and thin, rather than doing one thing really well. This community has a strong interest and knowledge in the graph model, and we should make sure we leverage that.

2) Resources: your proposal sounds like a lot of work. There's only a few select Tinkerpop members who can dedicate significant time to do this. And that's probably a good thing, because if it was a big team, a lot of time is spent on coordination. So this comes back to focus.

3) Inheriting cruft: Each of these territories you're listing comes with it's own history and way of thinking, and as we tap into them we need to find a balance how much of that cruft we can (or need to) take on board. That said, we already have experience with diversity (databases, prog languages, ...) and handled it quite well.

I sound like grumpy old granddad who always knew that it's never gonna work, but it's better to think this through at the start rather than mid-way in.
If we focus on getting gremlin-core and gremlin-graph right, while having gremlin-[relational|document|keyvalue] in mind, we can leverage the power of this community, have a realistic goal and work towards a great vision for TinkerPop4.

Michael

Marko Rodriguez

unread,
Mar 1, 2019, 5:06:34 PM3/1/19
to gremli...@googlegroups.com
Hi,

I agree with the three points you bring up. And, as you conclude, yes — I think it would primarily start as:

1. gremlin-core + gremlin-graph language.
2. streaming, batch, local execution engines.

That in a nutshell is Gremlin as a graph traversal language with support for the various execution engines. This is what TP4 was proposed to be and is the natural next step. However, by designing it right, the language can be left open to be tapped by document, relational, etc. database people down the road. Thus, I would agree that we don’t go all out on trying to get into these other spaces, but like we have done with Gremlin bytecode (w/ trying to get other languages to compile to Gremlin), demonstrate that its possible (in theory and practice).

Thanks Michael,
Marko.
--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Derek Williams

unread,
Mar 1, 2019, 5:36:36 PM3/1/19
to gremli...@googlegroups.com
I share some of Michaels concerns, and I am just an end user of the stack.  However, I really do like the direction that Marko is proposing.  Would it be possible to create a very simple proof of concept of the concepts outlined in Marko's Ring Theory paper with working code?  The advantage being we could potentially that we might be able to get a handle on concrete advantages.  In my mind the proof of concept could do something as simple as use in memory database, potentially with async processing of subgraphs (if that exposes any interesting behavior).  It seems to me that a developer that really understood the concepts in the paper could put together such a proof of concept.

Just my two cents; and for what it's worth I'm only 2/3 through my first read of the  paper.

Derek


For more options, visit https://groups.google.com/d/optout.


--
Derek Williams
Cell: 970.214.8928

Reply all
Reply to author
Forward
0 new messages