Tinkerpop 3.1 with Spark and Yarn: easiest way to configure access to Spark with Yarn Jar?

Jen

unread,

Dec 21, 2015, 6:27:25 AM12/21/15

to Gremlin-users

Hi,

We are in the process of getting up and running with Tinkerpop 3.1 with Spark and Yarn on a Cloudera Hadoop cluster. We can connect to our cluster by setting the usual environment variables, e.g. SPARK_HOME, JAVA_HOME, HADOOP_GREMLIN_LIBS, HADOOP_HOME, HADOOP_YARN_HOME, and the CLASSPATH. However, the only way we can get Gremlin to recognize the yarn-enabled spark jar is by copying this jar ('spark-assembly-1.5.1-hadoop2.6.0.jar') into the 'ext/spark-gremlin/plugin' directory. This jar is in our $SPARK_HOME/lib directory and is on the CLASSPATH. With this jar copied, we can successfully load and query the example graph in our cluster with Spark-Gremlin and YARN. Without this jar, we get an error when running e.g. g.V().count() (see below).

Is there a recommended configuration to get access to this jar, which would be a cleaner solution than copying it?

Jen

gremlin> g.V().count()

java.lang.ExceptionInInitializerError

java.lang.IllegalStateException: java.lang.ExceptionInInitializerError

at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.ComputerResultStep.processNextStart(ComputerResultStep.java:82)

at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.hasNext(AbstractStep.java:140)

at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.hasNext(DefaultTraversal.java:144)

at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)

at org.apache.tinkerpop.gremlin.console.Console$_closure3.doCall(Console.groovy:205)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)

at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)

at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:292)

at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1016)

at org.codehaus.groovy.tools.shell.Groovysh.setLastResult(Groovysh.groovy:441)

at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)

at org.codehaus.groovy.tools.shell.Groovysh.execute(Groovysh.groovy:185)

at org.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:119)

at org.codehaus.groovy.tools.shell.ShellRunner.work(ShellRunner.groovy:94)

at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)

at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)

at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1207)

at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:130)

at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:150)

at org.codehaus.groovy.tools.shell.InteractiveShellRunner.work(InteractiveShellRunner.groovy:123)

at org.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:58)

at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)

at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)

at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1207)

at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:130)

at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:150)

at org.codehaus.groovy.tools.shell.InteractiveShellRunner.run(InteractiveShellRunner.groovy:82)

at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)

at org.apache.tinkerpop.gremlin.console.Console.<init>(Console.groovy:144)

at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)

at org.apache.tinkerpop.gremlin.console.Console.main(Console.groovy:303)

Caused by: java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError

at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)

at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1887)

at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.ComputerResultStep.processNextStart(ComputerResultStep.java:80)

... 44 more

Caused by: java.lang.ExceptionInInitializerError

at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2042)

at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:97)

at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:173)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:345)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)

at org.apache.spark.SparkContext.<init>(SparkContext.scala:450)

at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)

at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)

at org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submit$21(SparkGraphComputer.java:137)

at org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer$$Lambda$50/1760378672.get(Unknown Source)

at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1582)

at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1574)

at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)

at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)

at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1689)

at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

Caused by: org.apache.spark.SparkException: Unable to load YARN support

at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)

at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:387)

at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)

... 17 more

Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:348)

at org.apache.spark.util.Utils$.classForName(Utils.scala:173)

at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:388)

... 19 more

Jason Plurad

unread,

Dec 21, 2015, 9:31:40 AM12/21/15

to Gremlin-users

Hi Jen,

I've started kicking the tires on Spark and YARN, and I've had the same experience as you -- copying the spark-assembly*.jar in seemed to be the only way I could get the job to run.

There was a previous thread similar to this, and Marc has submitted a pull request to do some better cleanup around the YARN support.
https://github.com/apache/incubator-tinkerpop/pull/170

Here's the hadoop-gryo.properties I've been using. How does it compare with yours?

spark.master=yarn-client
# name of the application, seen in the YARN cluster UI (http://c6702.ambari.apache.org:8088/cluster)
spark.app.name=tvp-tinkerpop-modern-kryo
# not sure how this is used, but setting this gets rid of this warning:
#   Using default name DAGScheduler for source because spark.app.id is not set
spark.app.id=tvp-tinkerpop-modern-kryo

# Cache the Spark jar in HDFS so that it doesn't need to be distributed each time an application runs (optional)
spark.yarn.jar=hdfs://c6701.ambari.apache.org:8020/user/ambari-qa/share/lib/spark/spark-assembly-1.5.1-hadoop2.6.0.jar

# the Spark YARN ApplicationManager needs this to resolve classpath it sends to the executors
spark.yarn.appMasterEnv.JAVA_HOME=/usr/jdk64/jdk1.8.0_40
spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/etc/hadoop/conf
spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark/conf
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.2.0-2950 -Djava.library.path=/usr/hdp/2.3.2.0-2950/hadoop/lib/native

# the Spark Executors (on the work nodes) needs this to resolve classpath to run Spark tasks
spark.executorEnv.JAVA_HOME=/usr/jdk64/jdk1.8.0_40
spark.executorEnv.HADOOP_CONF_DIR=/etc/hadoop/conf
spark.executorEnv.SPARK_CONF_DIR=/etc/spark/conf
spark.executor.extraJavaOptions=-Dhdp.version=2.3.2.0-2950 -Djava.library.path=/usr/hdp/2.3.2.0-2950/hadoop/lib/native
spark.executor.memory=512m

# integrate with the YARN Spark History Server
spark.yarn.services=org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.yarn.historyServer.address=http://c6703.ambari.apache.org:18080
spark.history.provider=org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.history.ui.port=18080
spark.history.kerberos.keytab=none
spark.history.kerberos.principal=none

-- Jason

Jen

unread,

Dec 22, 2015, 7:24:42 AM12/22/15

to Gremlin-users

My config is pretty basic (hadoop-graphson.properties). I have the relevant jars (spark, hadoop, yarn) added through the CLASSPATH.

# the graph class
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn-client
spark.yarn.appMasterEnv.JAVA_HOME=/usr/java/latest/jre
spark.executorEnv.JAVA_HOME=/usr/java/latest/jre

MeteorMarc

unread,

Dec 26, 2015, 5:40:24 AM12/26/15

to Gremlin-users

Hi Jen and Jason,

Good to see more interest in the Gremlin on Spark-Yarn issue! As I noted in my pull request, there is more to this than just the properties files. My github repo also contains a gremlinhdp.sh which adds configs to the classpath (so that you do not need to copy the spark-assembly). It also contains changes to the pom file to build with the Hortonworks jars (which I supposed were needed to work on a secure cluster and also allow to use spark-1.4.1).

https://github.com/vtslab/incubator-tinkerpop/blob/3.1.0-hdp-2.3.2.0-2950/gremlin-console/bin/gremlinhdp.sh

Cheers,

Marc

Op dinsdag 22 december 2015 13:24:42 UTC+1 schreef Jen:

Jason Plurad

unread,

Dec 28, 2015, 9:05:18 AM12/28/15

to Gremlin-users

Hi Marc,

Have you been able to run with yarn-cluster instead of yarn-client as the spark.master?

-- Jason

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/76cb9df8-4f20-4a7f-ba5c-a1373dc9c78d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MeteorMarc

unread,

Dec 29, 2015, 4:57:43 PM12/29/15

to Gremlin-users

Hi Jason,

I have not tried the yarn-cluster option as I have no permissions for that. Yarn-client and yarn-cluster mode are almost the same, with the exception of the location of the spark application master.

Regards, Marc

Op maandag 28 december 2015 15:05:18 UTC+1 schreef Jason Plurad:

Reply all

Reply to author

Forward