Hello everyone,
My sabbatical gave me the opportunity to step away from TinkerPop/Gremlin codebase and reflect on the “data space” as a whole and to think through the types of problems I want to tackle next. I have come to a bit of a crossroads and I would like to get peoples’ feedback on my thoughts below.
1. The Primary Benefits of TinkerPop3
TinkerPop3 is a massive body of code that is light years ahead of TinkerPop2 and TinkerPop1. I am thankful that DataStax gave me the opportunity to focus full-time on TinkerPop3 for three straight years. Moreover, I am beyond ecstatic that that I was able to work day in and day out with Stephen and Kuppitz. Our collaborations have birthed some beautiful ideas. A technical write up of the key features of TinkerPop3 is provided at
https://arxiv.org/abs/1508.03843. Of particular import are the following developments:
a. The Gremlin virtual machine
- the idea that there is a language agnostic bytecode that any query/programming language can compile to.
- the idea that the virtual machine can interact with any graph database.
- the idea that the virtual machine can be executed by any data processor.
b. The Gremlin language
- a delightfully expressive, self-consistent fluent-style query language.
- the idea that Gremlin can be embedded in (hosted by) any programming language that supports function concatenation and nesting.
2. The Primary Benefits of TinkerPop4
1. Remove the structure API: graph providers simply need to implement custom V(), out(), property(), etc. steps.
- Thus, there will be no more graph.vertex(), vertex.outEdges(), edge.properties(), etc.
- The only way to interact with the graph is via Gremlin.
2. Easily support any data processor: the OLTP/OLAP distinction will blur as we make it easier for other data processors to integrate with TinkerPop.
- Example data processors include Akka, Kafka, Flink, Spark, Apex, JavaRX, Storm, etc.
- Gremlin is a data flow language and any data flow/stream processor should be able to naturally execute it.
3. Data Agnosticism in TinkerPop4
In this section, I want to discuss a potential future that is a radical re-thinking of TinkerPop4 and, ultimately, what Apache TinkerPop could mean to the data community as a whole.
One late fall night I was circumnavigating Isla Espiritu Santo and it dawned on me that Gremlin has a natural algebraic representation. With further thought, I realized that this algebraic structure is a ring. With even further thought, I realized that this ring has nothing to do with graphs, but in fact, is data structure agnostic. The ring simply describes how data flows through functions. This swath of ideas led to the development of the stream ring theory:
https://zenodo.org/record/2565243. The article’s algebra nicely describes Gremlin, but interestingly enough, the paper does not discuss “graphs.” Since writing it, I have consider this paper “the death of Gremlin” and “the birth of Gremlin.”
Since day 1, TinkerPop has been focused on providing the graph community a provider agnostic query language. However, I no longer see “graph” as the most important aspect of TinkerPop. For instance, there are very few steps in the Gremlin language that are graph specific. These include: V(), out(), in(), outE(), property(), inV(), etc. The other steps in Gremlin are data structure agnostic. For example: select(), where(), match(), as(), sack(), repeat(), project(), is(), math(), choose(), coalesce(), group(), etc.
Now, after the development of the stream ring algebra, I believe that Gremlin is poised to break out of its graph shell and become a universal query language and virtual machine that supports:
1. Any query language: any query language can compile to its bytecode.
2. Any data storage system: any data structure can flow through its steps.
3. Any data processor: any message-passing/stream-based system can integrate with it.
4. Gremlin Beyond Graph
There are numerous stream processing frameworks in existence today. Most of their APIs are similar to Gremlin in that they support the map/filter/flatmap-fluent style.
From what I can gather, Gremlin is much more expressive (supporting variables, branching, looping, nesting, pattern matching, etc.). Moreover, Gremlin has an algebraically sound compiler and can be embedded into most any programming language. It is these aspects together that take Gremlin away from being just a “fluent API” to being a “Turing Complete query language.”
The focus of TinkerPop has been on graphs (databases) and I believe, to our detriment thus far, we have ignored the stream community and the awesome technologies they bring to the table. TinkerPop3 only supports Java iterators (OLTP) and Spark (OLAP). If we tap into these other stream processors:
1. The stream community gets a powerful, expressive fluent query language.
2. The graph community can seamlessly leverage more data processors.
3. The data community, in general, can Gremlin query any data — not just graphs.
Thus, I believe that Gremlin, in TinkerPop4, should be broken up into language subsets:
1. gremlin-core: select(), as(), match(), where(), is(), project(), group(), fold(), etc.
2. gremlin-graph: V(), outE(), in(), property(), etc.
3. gremlin-relational: R(), join(), etc.
4. gremlin-document: D(), etc.
…?
If you are working with graphs, then you “import” gremlin-core and gremlin-graph and off you go. If you are pulling data from a relational database and processing that data to then put it into a graph database, import gremlin-core, gremlin-relational, and gremlin-graph. Finally, consider Josh Shinavier’s recent work on the categories of graph (
https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012). He is staged to generalize these ideas to the categories of data. What is a vertex? — a map with literal property values and nested list/edge elements. What is a relational database row? — a map. What is a document? — a nested map with literal, map, and list elements. Data is data is data. There is little distinction between these data structures. In the ends its all just literals, lists, and maps.
5. The Components of TinkerPop4
I propose TinkerPop4 be a complete rewrite of TinkerPop3. The components of this new body of code would include:
1. Gremlin language and bytecode specifications.
- gremlin-core, gremlin-graph, gremlin-document, gremlin-files, …
2. Bytecode strategies for compiling and optimizing bytecode.
- gremlin-core has its strategies.
- gremlin-graph extends it with graph specific strategies.
- data system providers extend it with database/storage specific optimizations.
3. Gremlin traversal machine that is designed for any processor.
- kafka, spark, flink, storm, javarx, apex, …
- bytecode comes in, a stream topology is created, executed, and results are streamed back.
4. A binary serialization format that is data structure agnostic.
- gremlin-graph would have graph specific serialization extensions.
- gremlin-document would have document specific serialization extensions.
- etc. … or maybe its all just maps, lists, literals that are called “vertices” “edges” “documents” and “rows” … easy.
5. A simple I/O server for sending Gremlin queries to the virtual machine and streaming back results to the user.
6. A Gremlin REPL console for terminal control.
And nothing else. Thats it. Gremlin in, results back.
This proposal is identical to the recently written TinkerPop4 paper, save that now Gremlin is data structure agnostic.
6. Conclusion
There is no reason that a Gremlin query must always start g.V().
g.R(“people”).join(R(“addresses”)).by(“ssn”).
select(“country”).
groupCount().order(local).by(value).unfold().limit(1).
addV(“country”).property(“name”,select(“name”)); // relational -> graph
The TinkerPop community is the only community I know capable of developing a system like this. We know how to develop distributed virtual machines. We know how to compile bytecode. We know how to design a Turing Complete data flow language. Thus, I propose:
Apache TinkerPop
A Graph Computing Framework
==becomes==>
Apache TinkerPop
A Distributed Computing Virtual Machine and Language
Thank you for reading,
Marko.