Gremlin-Storm

79 views

Skip to first unread message

Matt Frantz

unread,

May 12, 2015, 6:46:39 PM5/12/15

to gremli...@googlegroups.com

TL;DR:

There is a small conceptual gap in Gremlin today that can be filled by defining a "persistent" traversal engine. The Apache Storm platform complements Gremlin in many important ways to realize this goal.

Background:

The Gremlin "language", particularly the anonymous traversal, is a stream processing language. An anonymous traversal can be understood by considering its behavior when a blob of data with a particular shape flows into one end of it, the side-effects that happen along the way, and what is emitted from the other end.

The anonymous traversal is the building block for other anonymous traversals; it is composable. A "graph traversal" is simply an anonymous traversal that has been "anchored" to an underlying graph in a specific way that provides the blobs of data that feed the anonymous traversal.

The underlying graph is also accessed in specific side-effects, allowing mutations to occur conveniently. This is an important part of the Gremlin concept, but one which is in some way orthogonal to its capacity as a stream processing language.

Gremlin is, of course, designed for processing graph data structures. Its execution model allows for a degree of concurrency that mirrors the underlying graph structure. The degree to which this concurrency manifests is vendor-specific to a large degree, although we also have two abstractions, OLTP and OLAP, that frame our concurrency expectations. Each of these abstractions provides a context for typical duration as well: OLTP is designed for short-running "sips" of data, while OLAP is for long-running "gulps". Both are typically finite in duration.

For those unfamiliar with Storm, a brief sidebar. In a nutshell, Storm has the concept of "spouts" which produce data that can be processed by "bolts". The connections between various spouts and bolts forms a computational topology (itself a directed graph). As a Storm developer, one can specify the degree of concurrency that each spout or bolt will have, and where (in a distributed computing environment) this computation can take place.

Overview:

A Gremlin-Storm combination would consist of spouts and bolts defined by specific traversals. The possibility of upcoming "prepared traversals" opens the possibilities further. The following pieces come to mind:

A spout that consists of a graph traversal
A spout that consists of a prepared graph traversal that executes when parameter sets are delivered out-of-band
A bolt that consists of an anonymous traversal
A bolt that consists of an anonymous traversal that produces parameter sets for repeatedly executing a prepared traversal

This would allow the creation of "persistent traversals" from a third engine to complement "standard" (OLTP) and "computer" (OLAP). Persistent traversals would execute continuously until explicitly terminated. They would produce a stream of data from one or more "leaf" traversals. An implementation could use Storm to achieve this, but other implementations are feasible.

Importantly, the graph traversals that would form the spouts are themselves executed by an engine that may be of any suitable type. For example, one could run a prepared traversal via "standard" to generate a "micro-batch" of data that feeds the persistent traversal downstream. One could even run an embedded persistent topology!

With Storm, one could imagine a TraversalStrategy that produces a Storm topology. The natural concurrency present in the Gremlin execution model could be exploited to generate Storm topologies. For example, edge traversal steps could manifest as a seam between two bolts, and of the many edges emanating from a vertex could be followed in a separate Storm job in the downstream bolt. Branches could literally branch in the Storm topology. More complex strategies could make use of statistical models to optimize the topology. There is room for vendor innovation.

In addition to the obvious use case of real-time, streaming analytics, the idea of standing up a (paradoxically named) "temporary persistent" traversal instead of a "computer" traversal for batch processing has some appeal. Specifically, in contrast to the current API supporting computer traversals, a persistent traversal that is based on a composable computational topology is more flexible and scales more readily to more complex traversals. It is easier to imagine a path to supporting the complete traversal API via the persistent engine than via the computer engine.

P.S.:

Storm was mentioned in another topic in this forum, but not in the same context:

https://groups.google.com/d/topic/gremlin-users/XO76HY-rTSs/discussion

Marko Rodriguez

unread,

May 12, 2015, 6:54:45 PM5/12/15

to gremli...@googlegroups.com

Hi,

I didn't read your full email, but got the gist. Here is a ticket that might be related to your thoughts:

https://issues.apache.org/jira/browse/TINKERPOP3-571

Marko.

http://markorodriguez.com

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/930e4fbe-fd79-4d07-9519-887fcca1c933%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages