OLAP queries using spark graph computer

254 views
Skip to first unread message

Sai Supraj R

unread,
May 11, 2021, 10:09:21 AM5/11/21
to Gremlin-users
Hi, 

I am trying to run spark graph computer for OLAP queries. My EMR cluster and gremlin server are on different machines. Did anyone configured spark graph computer using java.
I tried to configure but running in to error:

Exception in thread "main" java.lang.IllegalStateException: org.apache.spark.SparkException: Application application_1618505307369_0742 failed 2 times due to AM Container for appattempt_1618505307369_0742_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-05-11 12:44:20.059]File file:/home/hadoop/.sparkStaging/application_1618505307369_0742/__spark_libs__147566482392876626.zip does not exist
java.io.FileNotFoundException: File file:/home/hadoop/.sparkStaging/application_1618505307369_0742/__spark_libs__147566482392876626.zip does not exist

Thanks
Sai

HadoopMarc

unread,
May 12, 2021, 1:55:20 AM5/12/21
to Gremlin-users
Hi Sai,

Can you explain more about what you did? How do the properties for opening the HadoopGraph look like? How do the environment variables for the gremlin-serer.sh script look like? Did you use https://tinkerpop.apache.org/docs/current/recipes/#olap-spark-yarn for inspiration?

Best wishes,    Marc

Op dinsdag 11 mei 2021 om 16:09:21 UTC+2 schreef suprajr...@gmail.com:

Sai Supraj R

unread,
May 12, 2021, 8:37:48 AM5/12/21
to gremli...@googlegroups.com
Hi,

This is the spark-yarn.property file:

FYI this file is taken from aws EMR spark-defaults.conf
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=false
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

gremlin.hadoop.inputLocation=tinkerpop-modern.json
gremlin.hadoop.outputLocation=output

storage.backend=cql
storage.hostname=*****
storage.cql.keyspace=*****
storage.batch-loading=true
spark.master=yarn-client
spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/*:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/*:/docker/usr/share/aws/emr/emrfs/conf:/docker/usr/share/aws/emr/emrfs/lib/*:/docker/usr/share/aws/emr/emrfs/auxlib/*:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/*:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/*:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/*:/docker/usr/share/aws/emr/emrfs/conf:/docker/usr/share/aws/emr/emrfs/lib/*:/docker/usr/share/aws/emr/emrfs/auxlib/*:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/*:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath=usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///var/log/spark/apps
spark.history.fs.logDirectory=hdfs:///var/log/spark/apps
spark.sql.warehouse.dir=hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes=com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address=ip-***.ec2.internal:18080
spark.history.ui.port=18080
spark.shuffle.service.enabled=true
spark.yarn.dist.files=/etc/spark/conf/hive-site.xml
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/usr/lib:/docker/usr/lib:ro,/usr/share:/docker/usr/share:ro,/mnt/s3:/mnt/s3:rw,/mnt1/s3:/mnt1/s3:rw
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/usr/lib:/docker/usr/lib:ro,/usr/share:/docker/usr/share:ro,/mnt/s3:/mnt/s3:rw,/mnt1/s3:/mnt1/s3:rw
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:+ExitOnOutOfMemoryError
spark.dynamicAllocation.enabled=true
spark.blacklist.decommissioning.enabled=true
spark.blacklist.decommissioning.timeout=1h
spark.resourceManager.cleanupExpiredHost=true
spark.stage.attempt.ignoreOnDecommissionFetchFailure=true
spark.decommissioning.timeout.threshold=20
spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:+ExitOnOutOfMemoryError
spark.hadoop.yarn.timeline-service.enabled=false
spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS=$(hostname -f)
spark.files.fetchFailure.unRegisterOutputOnHost=true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem=2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem=true
spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds=50000
spark.sql.parquet.output.committer.class=com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
spark.sql.emr.internal.extensions=com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.executor.memory=9486M
spark.executor.cores=4
spark.yarn.executor.memoryOverheadFactor=0.1875
spark.driver.memory=2048M

I am connecting to janus graph server from java:
String query = "g.V().count()";
ComputerResult result = graph.compute(SparkGraphComputer.class)
.result(
GraphComputer.ResultGraph.NEW)
.persist(
GraphComputer.Persist.EDGES)
.program(
TraversalVertexProgram.build().traversal(graph.traversal().withComputer(SparkGraphComputer.class),
"gremlin-groovy", query)
.create(
graph))
.submit()
.get();
System.out.println("result of above query >>>>>>>>> " + result);
Spark and Janus graph are on 2 different machines.

Thanks
Sai

--
You received this message because you are subscribed to a topic in the Google Groups "Gremlin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gremlin-users/5ohW4pYRwfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d7347abf-2b9c-4666-bced-eb3b1f036de5n%40googlegroups.com.

HadoopMarc

unread,
May 12, 2021, 11:43:28 AM5/12/21
to Gremlin-users
Hi Sai,

What do you mean with "janusgrah server" and different machines? Your code looks like you have an embedded janusgraph instance in your java client. Your first post says you use Gremlin Server, but the code you list does not connect to Gremlin Server.
Some other terminology: spark is not "on a machine". Spark is an application that runs as a job on the EMR Yarn cluster. Gremlin-java on its turn uses spark as a distributed computation framework.

Also not clear to me: do you use spark-submit for your app? That may not handy, because than you have no control over the spark version and you need compatible janusgraph and spark versions. You can do without spark-submit.

Steps you can take:
  • first try to run the standard SparkPi example on EMR, that is with spark.master=yarn-clienthttps://spark.apache.org/docs/2.4.0/
  • be sure to pass all your dependencies, including tinkerpop, janusgraph, spark and transitive deps in the spark properties, like explained in the tinkerpop recipes linked above. Janusgraph has all deps in its lib folder, conveniently.
Best wishes,    Marc



Op woensdag 12 mei 2021 om 14:37:48 UTC+2 schreef suprajr...@gmail.com:

Venkat Dasari

unread,
May 13, 2021, 11:42:45 AM5/13/21
to Gremlin-users
Marc,
 Doesn't this work with Spark 2.4.4? Is there a way that I can specify a version in gremlin config that would allow it to run with some different spark version?
This is what I see when run with some basic configs from the Janusgraph docs.

at org.apache.tinkerpop.gremlin.console.GremlinGroovysh.execute(GremlinGroovysh.groovy:83)
at org.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:120)
at org.codehaus.groovy.tools.shell.Shell$leftShift$1.call(Unknown Source)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1217)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:144)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:43)
at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:190)
at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:58)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:160)
at org.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:57)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1217)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:144)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:164)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.run(InteractiveShellRunner.groovy:97)
at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:234)
at org.apache.tinkerpop.gremlin.console.Console.<init>(Console.groovy:168)
at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:234)
at org.apache.tinkerpop.gremlin.console.Console.main(Console.groovy:502)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.VertexProgramStep.processNextStart(VertexProgramStep.java:68)
... 75 more
Caused by: java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
at org.apache.tinkerpop.gremlin.spark.structure.Spark.create(Spark.java:52)
at org.apache.tinkerpop.gremlin.spark.structure.Spark.create(Spark.java:60)
at org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submitWithExecutor$1(SparkGraphComputer.java:313)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

HadoopMarc

unread,
May 14, 2021, 2:54:47 AM5/14/21
to Gremlin-users
The stacktrace you included seems to show two things:
  1. your code runs from gremlin console. Did you just run bin/gremlin.sh from the janusgraph distribution (which version?) and open HadoopGraph with the properties file you showed earlier?
  2. gremlin console cannot create a SparkContext. There many possible reasons for this, but it is not related to the spark version.
So, please specify exactly what you did.

As to your question: for yarn it is easy to deal which various spark versions, for TinkerPop/JanusGraph it is not, because of all the library conflicts that must be solved to generate a binary distribution with all dependencies included.

Best wishes,     Marc
Op donderdag 13 mei 2021 om 17:42:45 UTC+2 schreef dasar...@gmail.com:
Reply all
Reply to author
Forward
0 new messages