I spent some time this morning to see if I could execute a Spark-based traversal remotely in Gremlin Server - and......it worked! Didn't even have to really make any code changes for it to happen either though I did make some minor adjustments to streamline some things, but all-in-all, there wasn't much to it.
Here's my step-by-step of the process. First of all, these instructions are based on the latest 3.1.0-SNAPSHOT (master branch). You need to be sure you have Hadoop 2.x running in psuedo distributed mode - in other words, be sure that you can execute a spark traversal locally from Gremlin Console (if that works, it should work for Gremlin Server).
To get started, you'll need to open two terminals - one for Gremlin Server and the other for Gremlin Console. I had to set the CLASSPATH in both:
export CLASSPATH=/hadoop-2.7.1/etc/hadoop
and in the appropriate terminal (server or console) set HADOOP_GREMLIN_LIBS:
export HADOOP_GREMLIN_LIBS=/apache-gremlin-console-3.1.0-SNAPSHOT/spark-gremlin/lib
export HADOOP_GREMLIN_LIBS=/apache-gremlin-server-3.1.0-SNAPSHOT/ext/spark-gremlin/lib
I then started up bin/gremlin.sh and installed the spark plugin:
gremlin> :install org.apache.tinkerpop spark-gremlin 3.1.0-SNAPSHOT
I restart the console as instructed and activate my plugins:
gremlin> :plugin use tinkerpop.hadoop
==>tinkerpop.hadoop activated
gremlin> :plugin use tinkerpop.spark
==>tinkerpop.spark activated
I then copy my graph data to hdfs:
gremlin> hdfs.copyFromLocal('data/tinkerpop-modern.kryo','tinkerpop-modern.kryo')
==>null
==>rw-r--r-- smallette supergroup 781 tinkerpop-modern.kryo
Then we switch gears to the terminal that will run Gremlin Server and "install" spark:
bin/gremlin-server.sh -i org.apache.tinkerpop spark-gremlin 3.1.0-SNAPSHOT
which will copy down appropriate dependencies the same way the Gremlin Console :install command does. Then we start gremlin server with:
bin/gremlin-server.sh conf/gremlin-server-spark.yaml
This new config file is now packaged with the Gremlin Server distribution when you build it. It's pretty well documented and should point you to how stuff works. You can see it here:
Now in the Gremlin Server terminal you should see the standard startup logging which should include some lines like this:
[INFO] GraphManager - Graph [graph] was successfully configured via [conf/hadoop-gryo.properties].
...
[INFO] GremlinExecutor - Initialized gremlin-groovy ScriptEngine with scripts/spark.groovy
...
[INFO] ServerGremlinExecutor - A GraphTraversalSource is now bound to [g] with graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
If you see that much, you should be good to go, head back to the Gremlin Console terminal and do:
gremlin> :remote connect tinkerpop.server conf/remote.yaml
gremlin> :> g.V().count()
==>6
gremlin> :> g.V().out().out().values('name')
==>lop
==>ripple
It was good to confirm that this works as expected. This information should be especially useful to those not on the JVM who need a way to execute OLAP based traversals.
Stephen