make SparkGraphComputer work on a YARN cluster [TinkerPop 3.1.0-incubating]

Ruslan Mavlyutov

unread,

Feb 5, 2016, 6:54:52 PM2/5/16

to Gremlin-users

Hi there,

I am trying to make SparkGraphComputer work on a remote Yarn cluster.

Properties file:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.inputLocation=tinkerpop-modern.kryo
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.outputLocation=output
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
####################################
# Spark Configuration #
####################################
spark.master=yarn-client
spark.executor.memory=1g
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
####################################
# SparkGraphComputer Configuration #
####################################
gremlin.spark.graphInputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.InputRDDFormat
gremlin.spark.graphOutputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.OutputRDDFormat
gremlin.spark.persistContext=true

(for some reason it does not recognize spark.master=yarn)

HADOOP_CONF_DIR and HADOOP_HOME are set in the environment.
HADOOP_GREMLIN_LIBS is pointing to spark 1.6.0 shared libraries.

Meantime I get an exception during setting the spark environment:

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal(computer(SparkGraphComputer))
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
java.lang.NoSuchMethodError: org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V
Display stack trace? [yN] y
java.lang.IllegalStateException: java.lang.NoSuchMethodError: org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V
        at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.ComputerResultStep.processNextStart(ComputerResultStep.java:82)
        at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.hasNext(AbstractStep.java:140)
        at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.hasNext(DefaultTraversal.java:144)
        at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)
        at org.apache.tinkerpop.gremlin.console.Console$_closure3.doCall(Console.groovy:205)
...

The method definition for CommandBuilderUtils.addPermGenSizeOpt is actually exists.

It seem that there is some mismatch between libraries versions.

My question: is there a clear explanation how to run SparkGraphComputer on a remote Yarn cluster (what version of Spark to take,

where to put, what environment variables to setup, etc)?

Thank you!

Jason Plurad

unread,

Feb 5, 2016, 7:24:06 PM2/5/16

to Gremlin-users

Hi Ruslan,

Welcome to the TinkerPop list!

I have some notes on running SparkGraphComputer with YARN here: https://github.com/pluradj/ambari-vagrant/tree/tp3/ubuntu14.4/tp3

TinkerPop 3.1.0 uses Spark 1.5.1 (spark-gremlin/pom.xml) so you should make sure you keep your dependencies aligned. I have run across serialization errors when trying to run SparkGraphComputer against different versions of Spark. Your stack trace is a new error for me, but I hadn't tried against Spark 1.6 previously.

-- Jason

HadoopMarc

unread,

Feb 6, 2016, 7:53:10 AM2/6/16

to Gremlin-users

Hi Ruslan,

Please look at the work I did on
https://github.com/vtslab/incubator-tinkerpop

in particular the pom.xml files, the startup environment

https://github.com/vtslab/incubator-tinkerpop/blob/3.1.0-hdp-2.3.2.0-2950/gremlin-console/bin/gremlinhdp.sh

and the properties files (with the jdk8 directives):

https://github.com/vtslab/incubator-tinkerpop/tree/3.1.0-hdp-2.3.2.0-2950/hadoop-gremlin/conf

Cheers, Marc

Kedar Mhaswade

unread,

Jun 16, 2017, 12:38:27 PM6/16/17

to Gremlin-users

Hi Marc,

Were you able to run gremlin queries on a Yarn cluster after you made these changes?

For some reason I am running into a slew of errors the latest of which is [1].

Any idea?

Regards,

Kedar

[1]

[Stage 0:> (0 + 0) / 20276]org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 0.0 failed 4 times, most recent failure: Lost task 18.3 in stage 0.0 (TID 45, hadoopworker632): java.lang.IllegalStateException: unread block data

at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)

at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)

at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Reply all

Reply to author

Forward