Post Job to Spark via YARN from VM on a virtual network

zachk...@gmail.com

unread,

Jun 28, 2016, 4:43:45 PM6/28/16

to Gremlin-users

I have a function Spark cluster that can accept jobs via YARN (i.e. I can start a job by running spark-submit and specify the master as yarn-client).

Now, I have a virtual machine set up on a virtual network with this Spark cluster. On the VM, I am working with Titan DB, which in its configuration allows us to set spark.master. If I set it as local[*], everything runs well. However, if I set spark.master as yarn-client, I get the following error:

java.lang.IllegalStateException: java.lang.ExceptionInInitializerError
 at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.ComputerResultStep.processNextStart(ComputerResultStep.java:82)
 at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.hasNext(AbstractStep.java:140)
 at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.hasNext(DefaultTraversal.java:117)
 at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)
 at org.apache.tinkerpop.gremlin.console.Console$_closure3.doCall(Console.groovy:205)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)
 at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:292)
 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1016)
 at org.codehaus.groovy.tools.shell.Groovysh.setLastResult(Groovysh.groovy:441)
 at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)
 at org.codehaus.groovy.tools.shell.Groovysh.execute(Groovysh.groovy:185)
 at org.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:119)
 at org.codehaus.groovy.tools.shell.ShellRunner.work(ShellRunner.groovy:94)
 at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)
 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1207)
 at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:130)
 at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:150)
 at org.codehaus.groovy.tools.shell.InteractiveShellRunner.work(InteractiveShellRunner.groovy:123)
 at org.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:58)
 at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)
 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1207)
 at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:130)
 at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:150)
 at org.codehaus.groovy.tools.shell.InteractiveShellRunner.run(InteractiveShellRunner.groovy:82)
 at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)
 at org.apache.tinkerpop.gremlin.console.Console.<init>(Console.groovy:144)
 at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:215)
 at org.apache.tinkerpop.gremlin.console.Console.main(Console.groovy:303)
Caused by: java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
 at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.ComputerResultStep.processNextStart(ComputerResultStep.java:80)
 ... 44 more
Caused by: java.lang.ExceptionInInitializerError
 at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1873)
 at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:105)
 at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:180)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:308)
 at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
 at org.apache.spark.SparkContext.<init>(SparkContext.scala:240)
 at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
 at org.apache.tinkerpop.gremlin.hadoop.process.computer.spark.SparkGraphComputer.lambda$submit$31(SparkGraphComputer.java:111)
 at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
 at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582)
 at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
 at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
 at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
 at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
 at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:211)
 at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:206)
 at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
 ... 14 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:264)
 at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:207)
 ... 16 more

Clearly, there is some configuration to do (I believe on the VM end). I suspect we will have to install YARN on the VM, and have that yarn-client communicate with the yarn-client on the Spark cluster, but I am not sure.(For clarification, the VM doesn't have Spark or Hadoop installed.) Can you point me in the right direction or tell me how to configure the VM and/or Spark so I can successfully submit this job? I can provide any more information that might be helpful.

HadoopMarc

unread,

Jun 29, 2016, 7:11:10 AM6/29/16

to Gremlin-users

Hi Zachk...

You are right that you have to install the hadoop + spark + yarn clients according to the versions of your cluster. After that, see the post below for some guideluines for your runscript (or gremlin REPL wrapper):

https://groups.google.com/forum/#!searchin/gremlin-users/HadoopMarc/gremlin-users/lBdhRpRTMys/h7dwMbZkAgAJ

Regards, Marc

Op dinsdag 28 juni 2016 22:43:45 UTC+2 schreef zachk...@gmail.com:

zachk...@gmail.com

unread,

Jun 29, 2016, 2:18:52 PM6/29/16

to Gremlin-users

Thanks so much for the reply Marc. I decided to change course a bit: I am now trying to install gremlin on the Spark cluster directly, since that already has hadoop, spark, and yarn installed, but I am running into issues.

Here's what I have done:

1. Installed Gremlin 3.2.0-incubating from https://www.apache.org/dyn/closer.lua/incubator/tinkerpop/3.2.0-incubating/apache-gremlin-console-3.2.0-incubating-bin.zip
2. Created a script called gremlinhdp.sh (taken from the link you posted). The only changes I made is to edit the HDP_VERSION, and add a line to set JAVA_HOME (so it references java 1.8 and not java 1.7, which is also installed).

#!/bin/bash


HDP_VERSION=2.4.2.0-258


if [[ `ls -l /usr/hdp/current` != *"$HDP_VERSION"* ]]
then
  echo "HDP_VERSION config in this script does not match active HDP stack"
  exit 1
fi
export HADOOP_HOME=/usr/hdp/current/hadoop-client
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export YARN_HOME=/usr/hdp/current/hadoop-yarn-client
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_HOME=/usr/hdp/current/spark-client
export SPARK_CONF_DIR=$SPARK_HOME/conf


source "$HADOOP_CONF_DIR"/hadoop-env.sh
source "$YARN_CONF_DIR"/yarn-env.sh


export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export JAVA_OPTIONS="$JAVA_OPTIONS -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native -Dtinkerpop.ext=ext -Dlog4j.configuration=conf/log4j-console.properties -Dhdp.version=$HDP_VERSION"


# for gremlin to use spark plugin
GREMLINHOME=$HOME/lib/apache-gremlin-console-3.2.0-incubating
export HADOOP_GREMLIN_LIBS=$GREMLINHOME/ext/spark-gremlin/plugin:$GREMLINHOME/ext/hadoop-gremlin/plugin:$GREMLINHOME/ext/gremlin-groovy/plugin:$GREMLINHOME/lib


# for gremlin to connect to cluster hdfs
export HADOOP_HOME=/usr/hdp/current/hadoop-client/client
export CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HADOOP_HOME/etc/hadoop


# for gremlin to connect to cluster yarn with spark
export YARN_HOME=/usr/hdp/current/hadoop-yarn-client
export CLASSPATH=$GREMLINHOME/lib/*:$YARN_HOME/*:$YARN_CONF_DIR:$SPARK_HOME/lib/*:$SPARK_CONF_DIR:$CLASSPATH


cd $GREMLINHOME
exec $GREMLINHOME/bin/gremlin.sh $*

3. Take the hadoop-script.properties files from your link and replace mine with it. The only changes I made is that I copied everything from spark-defaults.jar and pasted it at the end (I am deploying this cluster through Microsoft Azure, so I didn't actually add anything myself to spark-defaults.conf, it was pre-filled when deployed).

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.inputLocation=dummyin
#gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.spark.graphOutputRDD=org.apache.tinkerpop.gremlin.spark.structure.io.OutputFormatRDD
gremlin.hadoop.outputLocation=dummyout
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
gremlin.spark.persistStorageLevel=DISK_ONLY
gremlin.spark.persistContext=true
gremlin.hadoop.jarsInDistributedCache=true


####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn-client
spark.app.id=gremlin
spark.ui.port=4051
spark.yarn.appMasterEnv.CLASSPATH=$CLASSPATH:/usr/hdp/current/hadoop-mapreduce-client/*:/usr/hdp/current/hadoop-mapreduce-client/lib/*
spark.executor.extraJavaOptions=-Dhdp.version=2.4.2.0-258
spark.executor.instances=11


spark.executor.memory=8g
spark.executor.userClassPathFirst=true
spark.storage.memoryFraction=0.4
spark.shuffle.memoryFraction=0.4
spark.yarn.executor.memoryOverhead=4096
#spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer


# Below copy additional spark configs from spark-defaults.conf
# spark-defaults.conf is read from the spark-submit java classes, not by gremlin
spark.driver.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkdriver -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,DRFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp -Dlog4jspark.log.file=sparkdriver_${user.name}.log -Dlog4j.configuration=file:/usr/hdp/current/spark-client/conf/log4j.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
spark.eventLog.dir wasb:///hdp/spark-events
spark.eventLog.enabled true
spark.executor.cores 2
spark.executor.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkexecutor -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,DRFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp -Dlog4jspark.log.file=sparkexecutor_${user.name}.log -Dlog4j.configuration=file:/usr/hdp/current/spark-client/conf/log4j.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
spark.executor.instances 2
spark.executor.memory 3072m
spark.history.fs.logDirectory wasb:///hdp/spark-events
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.master yarn-client
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address hn0-sparku.5kqe1203omgenjv5o0uu4d5nfg.dx.internal.cloudapp.net:18080
spark.yarn.jar local:///usr/hdp/current/spark-client/lib/spark-assembly.jar
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.services
spark.yarn.submit.file.replication 3

When I run bash gremlinhdp.sh:

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-script.properties')
GraphFactory could not find [org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph] - Ensure that the jar is in the classpath

What's wrong? Is there more I need to install, or am I not setting the classpath correctly?

zachk...@gmail.com

unread,

Jun 29, 2016, 2:36:31 PM6/29/16

to Gremlin-users

Another, possibly related issue. If I try to install hadoop-gremlin, I get an error as well.

$ bash gremlinhdp.sh
...
gremlin> :install org.apache.tinkerpop hadoop-gremlin 3.2.0-incubating
==>Error grabbing Grapes -- [unresolved dependency: com.github.jeremyh#jBCrypt;jbcrypt-0.4: not found]

This is odd, since this file exists: /home/sparkssh/lib/apache-gremlin-console-3.2.0-incubating/lib/jBCrypt-jbcrypt-0.4.jar.

zachk...@gmail.com

unread,

Jun 29, 2016, 4:08:54 PM6/29/16

to Gremlin-users

I was able to get these issues answered (for posterity: https://stackoverflow.com/questions/38108198/unable-to-install-hadoop-and-spark-through-gremlin-shell).

I am now having an issue loading the hadoop-gremlin plugin.

gremlin> :plugin use tinkerpop.hadoop
No FileSystem for scheme: wasb

Any help would be appreciated.

Jason Plurad

unread,

Jun 29, 2016, 4:55:09 PM6/29/16

to Gremlin-users

What are your HADOOP environment variables? How about your hadoop core-site.xml? I have no idea where "wasb" is coming from.

Keep in mind that Titan 1.0 is built using TinkerPop 3.0.1. As far as I know, it is not compatible with TinkerPop 3.2.0. Are you building titan11 branch from source?

-- Jason

zachk...@gmail.com

unread,

Jun 29, 2016, 4:59:41 PM6/29/16

to Gremlin-users

Thanks for the reply Jason. I didn't realize the versioning issue. Thought I don't suspect that is the direct cause of the issue, I'm going to switch over to 3.0.1 to avoid problems later on.

By the way, wasb is Windows Azure Storage Blob. Turns out there are some jars that are required in the classpath to avoid the exception.

HadoopMarc

unread,

Jun 30, 2016, 8:04:54 AM6/30/16

to Gremlin-users

The missing grapes has to do with some grails setting, search for grapes on the forum or in the gremlin reference documentation.

You can also use an older titan11 fork which was posted on the titan forum a few months ago (from ???David Graben???, I write from memory). It has hadoop-gremlin and spark-gremlin pre-installed to make things easy. I still use it.

Marc

Op woensdag 29 juni 2016 22:59:41 UTC+2 schreef zachk...@gmail.com:

Reply all

Reply to author

Forward