Traversal ByteCode and Translators (The TINKERPOP-1278 Saga)

752 views
Skip to first unread message

Marko Rodriguez

unread,
Jul 2, 2016, 10:04:49 AM7/2/16
to gremli...@googlegroups.com, d...@tinkerpop.apache.org
Hello,

So TINKERPOP-1278 (aka “Gremlin-Python”) has introduced the notion of Traversal ByteCode. 


In essence, ByteCode is the construction history of a traversal and is of the form:

[string, object*]* // a list of (operator, arguments[])

The traversal 

g.V(1).outE(‘created’).inV().
  repeat(out(‘created’).in(‘created’)).times(5).
  valueMap(’name’,’age’) 

has a ByteCode representation as below:


[
[V, 1]
[outE, ‘created’]
[inV]
[repeat, [
[out, ‘created’]
[in, ‘created’]
]
[times, 5]
[valueMap, name, age]
]

Again, Gremlin is a simple language based on function concatenation and nesting. Thats all there is to it. Thus, it forms a tree and trees are easy to encode, distribute, serialize, decode, prune/optimize, search, etc. Moreover, every programming language supports function composition and nesting and thus, Gremlin is able to be hosted in any programming language. [http://tinkerpop.apache.org/gremlin.html]

The benefit of ByteCode as it applies to TINKERPOP-1278 is that a Translator is able to access the ByteCode of the traversal and then use that linear-nested structure (wide-tree) to generate a traversal representation in another language — e.g. Gremlin-Python, Gremlin-Ruby, Gremlin-JavaScript, etc. 

Here is the Gremlin-Python translator that will turn ByteCode into Gremlin-Python:
Here is the Gremlin-Groovy translator that will turn ByteCode into Gremlin-Groovy:

Pretty simple, eh? So, why would you want Gremlin-Java to translate to Gremlin-Groovy? Well, so you can code in Gremlin-Java and then have it execute on GremlinServer via the GremlinGroovy JSR223 ScriptEngine. However, one can imagine a Gremlin-Java->Gremlin-Java translator! What is that?! Well, it would use reflection (or some more efficient mechanism) to reconstruct the Gremlin-Java traversal from ByteCode generated from Gremlin-Java and thus, the entire cluster/sever infrastructure is simply migrating ByteCode around as opposed to worrying about language specific representation — e.g. it has nothing to do with the JVM! Also, assume a Python-based graph database exists that implements the GreminVM — Gremlin-Java can easily talk to it via ByteCode.

To ensure ByteCode generation is not costly, here are the runtimes for construction and compilation of a “fairly complex” traversal in both master/ and TINKERPOP-1278/

master/

gremlin> clock(5000){g.V().repeat(out()).times(2).as("a").union(both(),out()).dedup().as("b").select("a","b").identity() }
==>0.004184923
gremlin> clock(5000){g.V().repeat(out()).times(2).as("a").union(both(),out()).dedup().as("b").select("a","b").identity().applyStrategies() }
==>0.0264354126

TINKERPOP-1278/

gremlin> clock(5000){g.V().repeat(out()).times(2).as("a").union(both(),out()).dedup().as("b").select("a","b").identity() }
==>0.0048526357999999995
gremlin> clock(5000){g.V().repeat(out()).times(2).as("a").union(both(),out()).dedup().as("b").select("a","b").identity().applyStrategies() }
==>0.0245924514

Finally, there are various entailments that come from this ByteCode representation that have started to surface in my mind:

1. We can now optimize at the ByteCode level and not at the step level — ByteCodeStrategies could be a new TraversalStrategy class that comes before DecorationStrategies.
2. Gremlin-Java can support bindings so that its Gremlin-Groovy (e.g.) compiled form uses bindings. How? By simply rewriting the ByteCode prior to compilation and replacing values with variables!
3. GraphSON can now easily support Traversal serialization — ByteCode in JSON is natural. Gremlin-GraphSON anyone? (get it?)
[g, V : [1], outE : [created], inV : repeat : [out : [created], in : [created]], times : 5], valueMap : [name, age]]

This is what makes Gremlin so powerful — its syntax is crazy simple as its just functions and thus, it can naturally exist in any language — even XML! … ByteCode is what is going to free Gremlin from the JVM…as we are already now on the CPython VM with relative ease:

PythonGraphTraversal
JythonTranslator (Gremlin-Python to Gremlin-Jython)
GroovyTranslator (Gremlin-Python to Gremlin-Groovy)

*** NOTE: I have yet to introduce the ByteCode concepts to gremlin-python/ package so use your imagination when seeing how PythonGraphTraversal will construct ByteCode and the respective translators will translate it. Finally, realize that people can now compile Python ByteCode into Python-based steps and thus, Gremlin can live and execute on the Python VM against Python-based graph systems. Its really that easy (though, lots of work to implement the 30 some standard steps in Python). What this means though is that, in the future, Gremlin can just move between languages/VMs … between Python, JVM, Ruby, JavaScript, C, etc.-based graph processing systems. One language, tailored to its host, and agnostic to the underlying virtual machine.

Enjoy,
Marko.

Marko Rodriguez

unread,
Jul 2, 2016, 1:25:38 PM7/2/16
to gremli...@googlegroups.com, d...@tinkerpop.apache.org
HI,

For fun, I created a play "toString()” that emits a JSON representation of ByteCode.

g.V().as("a").out().repeat(both("created")).times(10).as("b").values("name”).range(0,2).select("a", "b").by("name”)

becomes —>

[["withComputer",[]],["V",[]],["as",["a"]],["out",[]],["repeat",[[["both",["created"]]]]],["times",[10]],["as",["b"]],["values",["name"]],["range",[0,2]],["select",["a","b"]],["by",["name"]]]

which in “pretty print” JSON is:

[
    [
        "withComputer",
        []
    ],
    [
        "V",
        []
    ],
    [
        "as",
        [
            "a"
        ]
    ],
    [
        "out",
        []
    ],
    [
        "repeat",
        [
            [
                [
                    "both",
                    [
                        "created"
                    ]
                ]
            ]
        ]
    ],
    [
        "times",
        [
            10
        ]
    ],
    [
        "as",
        [
            "b"
        ]
    ],
    [
        "values",
        [
            "name"
        ]
    ],
    [
        "dedup",
        []
    ],
    [
        "select",
        [
            "a",
            "b"
        ]
    ],
    [
        "by",
        [
            "name"
        ]
    ]
]

Which then via gets translated into Python via PythonTranslator as:

g.V().as('a').out().repeat(both('created')).times(10).as('b').name[0,2].select('a','b').by('name’)

Pretty simple…. once we get a standard ?JSON? serialization format for ByteCode, then we can simply have one Translator for each Gremlin variant — ByteCode -> Gremlin-XXX. What we need to do proper is typing — long, String, double, float, P, Order, Lambda, etc. Once that is square we should be good.

Take care,
Marko.

Stephen Mallette

unread,
Jul 4, 2016, 7:38:05 AM7/4/16
to Gremlin-users, d...@tinkerpop.apache.org
We're pretty close to getting GraphSON 2.0 merged for 3.2.1 on TINKERPOP-1274:


2.0 ensures full typing, makes it easier for non-jvm languages to generate/consume it and has a smaller byte size footprint. It would be interesting to see how the JSON for a traversal would look using that.




--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/C8DF5760-9CC0-41A4-9CE6-3BA025CF4BE6%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

Benjamin Anderson

unread,
Jul 5, 2016, 10:39:03 AM7/5/16
to gremli...@googlegroups.com
Hi Marko, this is neat stuff. Do you have any designs on making the
"ByteCode" construct the source of truth for Gremlin semantics?

That is to say, generating an AST is nice but I would be hesitant to
build and support a non-JVM implementation of Gremlin unless the
semantics of the bytecode was specified and the JVM implementation was
simply the reference implementation. As you suggest it'd be
significant work to implement the standard steps in, e.g., Python;
it'd also be significant work to keep up with the evolution of the JVM
implementation.

Cheers,
--
b

Marko Rodriguez

unread,
Jul 6, 2016, 5:23:38 PM7/6/16
to gremli...@googlegroups.com
Hi,

Hi Marko, this is neat stuff. Do you have any designs on making the
"ByteCode" construct the source of truth for Gremlin semantics?

Yes and no. Bytecode will have two “standard serializations” (GraphSON and Gryo). From there, any gremlin-xxx will have a “new Bytecode(json)”-style constructor to create a Bytecode object in the respective language. From there, the respective language can construct a traversal object as it sees fit (either via reflection or via script string creation). For instance:

JavaTranslator uses reflection:
GroovyTranslator uses script string: 

Next, the only reason you would need to use GroovyTranslator is because of a lambda. If the Bytecode you submit to GremlinServer (or any RemoteConnection implementation) doesn’t have lambdas in it, it would be more efficient to use JavaTranslator and thus, bypass the overhead of String compilation and Groovy dispatch-based meta-classes.

That is to say, generating an AST is nice but I would be hesitant to
build and support a non-JVM implementation of Gremlin unless the
semantics of the bytecode was specified and the JVM implementation was
simply the reference implementation.

So the semantics of the byte code are specified by the semantics of the step(arguments). 

out(“created”)
V()
path()
etc.

 TinkerPop’s ProcessStandardSuite and ProcessComputerSuite verify the semantics of every step and thus, can be used to verify that the Bytecode generated in another language (e.g. gremlin-python) has valid semantics. Here is how I verify Gremlin-Python generated Bytecode.


Moreover, via some trickery, it would be possible to use TinkerPop’s same test infrastructure to verify the Gremlin-Python native traversals (via step-code implementations in Python) are valid.?? would have to think on that…. :/

The general thinking is this:

Graph -> data in the respective VM (e.g. JVM, CPython, V8, etc.).
Traversal -> program in the respective VM.
Steps -> machine code in the respective VM (think assembly)
Bytecode -> virtual machine code for consumption by any VM (think Java bytecode)

If a traversal’s evaluation does not require a translation (e.g. no RemoteStrategy), then the Bytecode does nothing. In fact, Gremlin-Java Traversals generate both “machine code” and “bytecode” during construction. Thus, there is no need for a runtime or JIT interpreter. This costs memory as you have two representations of the traversal, but it speeds up execution (—traversals are not large like programs in a programming language so their size is relatively small and thus worth the memory cost).

As you suggest it'd be
significant work to implement the standard steps in, e.g., Python;
it'd also be significant work to keep up with the evolution of the JVM
implementation.

Yes. It would take some dedication. However, the beauty is that if this were done then Gremlin-XXX and respective graph databases/processors written in XXX language can intercommunicate with Gremlin-YYY via Bytecode. That is, Gremlin-Java traversals can talk to a Gremlin-Python graph database. Gremlin-Python traversals can talk to a Gremlin-JavaScript graph database (or a Gremlin-Go graph database (thinking Google’s Cayley), etc.).

The cool thing about Gremlin is that its really simple to implement a respective VM in a programming language. There really aren’t that many moving parts. In fact, Gremlin-Python is currently stubbed so that it can, if someone was so inclined, be extended with step-code (“machine code”) and not just bytecode creation.

RemoteGraph only right now — Gremlin-Python can only talk to graph databases/processors outside of Python (i.e. JVM-based systems)
TraversalStrategies
TraversalStrategy
RemoteStrategy
Bytecode

…as you can see Gremlin-Python is mirroring Gremlin-Java…. Pretty neat eh?

Finally, check out this Python session using Gremlin-Python to talk to GremlinServer (JVM).

See line #28 and #31 … the Gremlin-Python API is identical to Gremlin-Java. 

If we keep this pattern going, then Gremlin will nicely extend to all languages… its pretty trivial to simply provide a Bytecode traversal in Gremlin-XXX. Not much code in Gremlin-Python:


…where GraphTraversal is source generated via Java reflection.
— creates —>
… so fat fingering all that stuff in is not necessary.

Easy peasy.

Any who, hope that provides a good rundown of this line of work and where we could take it.

Benjamin Anderson

unread,
Jul 14, 2016, 10:11:57 PM7/14/16
to gremli...@googlegroups.com
Thanks for the reply, Marko - it ended up lost in the depths of my inbox...

> TinkerPop’s ProcessStandardSuite and ProcessComputerSuite verify the semantics of every step and thus, can be used to verify that the Bytecode generated in another language (e.g. gremlin-python) has valid semantics.

This is what I'm concerned about/interested in. I'm reminded of the
issues that JRuby/Rubinius have long dealt with - where the source of
truth for the semantics of the language is/was MRI's test suite, which
any alternative implementations must conform to, rather than a clear
and agreed-upon specification. I'll admit that my familiarity with the
TP test suite is somewhat limited, but the discussion of types in the
GraphSON 2.0 thread gives me the impression that there might be
undefined behavior lurking. ;)

Cheers,
--
b
> https://groups.google.com/d/msgid/gremlin-users/5DDA0A95-7DC7-478A-9463-A33055F618F6%40gmail.com.

Dylan Millikin

unread,
Jul 19, 2016, 11:39:12 AM7/19/16
to gremli...@googlegroups.com
I'm really late to the party here but I've been keeping these threads religiously in order to get to them eventually. I'm still catching up on my inbox and the series of discussions around GraphSON 2.0 + GLVs + Bytecode so I might ask something that is answered elsewhere, but here I go.

The premise for this is really cool. I can see Bytecode going a really long way as it for one can completely remove string manipulation for GLVs but also allow Gremlin compatible graphs in other languages. 

Like Benjamin mentioned. I feel like a clear specification would be a requirement here. Something along the lines of BNF grammar for declarative languages. But in this case for gremlin, not really sure how that would work.... Sounds like a headache. It would definitely help in translating from gremlin->other query language. 

Also, How are lambdas currently being handled? I feel like there's another thread somewhere about this so I'll go have a look. 

Cheers.

Mark Henderson

unread,
Jul 19, 2016, 12:19:20 PM7/19/16
to Gremlin-users
This is interesting and should make interoperability a bit easier. Is it right to assume that binding parameters is unnecessary? And will sending byte code in stead of Gremlin-Groovy to the Gremlin Server be an option in the near future?

Stephen Mallette

unread,
Jul 19, 2016, 3:40:16 PM7/19/16
to Gremlin-users
This is interesting and should make interoperability a bit easier. Is it right to assume that binding parameters is unnecessary?

Bindings still have importance and they are effectively part of the Bytecode sent to Gremlin Server. Gremlin Server is going to play an important part in determining what to do with the Bytecode sent to it. It will analyze the it and then decide whether a ScriptEngine is required or if it can deserialize the Bytecode directly to a Java Traversal. If the latter, I don't think the bindings will matter too much. However, if it can't go directly to a Java Traversal then a ScriptEngine will be required (e.g. write some Gremlin in python with a lambda) and the bindings will end up being useful in the same way they always have.
 
And will sending byte code in stead of Gremlin-Groovy to the Gremlin Server be an option in the near future?

It's already working in TINKERPOP-1278 branch. the Java RemoteGraph has been altered to support Bytecode (note the server is still backward compatible to accept a java serialized Traversal object so nothing should be broken there). We still need to hook up gremlin-python to do this - it is still using the "groovy string" method. I wouldn't say the work is "complete" but at least we can communicate over Bytecode at this point.



Marko Rodriguez

unread,
Jul 19, 2016, 5:34:44 PM7/19/16
to gremli...@googlegroups.com
Hello,

Also, How are lambdas currently being handled? I feel like there's another thread somewhere about this so I'll go have a look. 

Currently, a lambda is serialized as:

{ @type:lambda, arguments:1, language:gremlin-python, value:”lambda x: len(x.get().value(‘name’))” }

So when a RemoteConnection receives Traversal bytecode, it will analyze it to determine if there are lambdas. If there are no lambda it will use Gremlin-Java as the language given that it doesn’t need to create a string representation of the Gremlin traversal, it just uses Java reflection:


However, if it sees a lambda, then it knows which “language” (ScriptEngine) to use. It will then generate a String representation of the traversal in the respective language.


HOWEVER, I’m working to make it so that the ScriptEngine is used JUST for the lambda and thus, Gremlin-Java is used for everything else. What does that mean?

g.V().map(new ScriptEngineLambda(“gremlin-python”, “lambda x : len(x.get().value(‘name’))”).sum()

Gremlin-Java is everything and the lambda is just a wrapper to the ScriptEngine.

HTH,
Marko.

Dylan Millikin

unread,
Jul 20, 2016, 8:02:01 AM7/20/16
to gremli...@googlegroups.com
Thanks for the info Marko. I guess that's going to be a bit of a sore spot. Though the need for lambdas has decreased significantly in TP3

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages