Hello,
I'm trying to bulkload a large CSV file using ScriptInputFormat into Titan/HBase via Spark. Part-way through the job I get the following error:
ERROR org.apache.spark.scheduler.TaskSetManager - Task 0 in stage 5.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 6, hd03): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException
To eliminate as many variables as possible, I decided to try to load the very simple toy graph given in the Script I/O Format section of the Tinkerpop documentation
here. Obviously this is overkill for a bulk load, but I wanted to keep things simple. Sure enough, I got the same exception. When I inspected the graph I can see that some (maybe all?) vertices are present, but no edges were loaded.
I'm at a loss as to how to proceed. What could be causing this error? Any clue as to how to troubleshoot?
Thanks,
Jerrell
Component versions
HDP 2.3.4.0-3485
Spark 1.5.2
HBase 1.1.2
Titan 1.0
Tinkerpop 3.1.1-incubating
Here are the gremlin commands I used to load the data, along with the relevant property files:
gremlin> graph = GraphFactory.open('/opt/titan/config/hadoop-load-files.properties')
gremlin> blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph('/opt/titan/config/titan-files-bulkload.properties').create(graph)
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
# /opt/titan/config/hadoop-load-files.properties
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.inputLocation=hdfs://kbprod/user/jschivers/titan/titansparktest
gremlin.hadoop.outputLocation=dummyoutput
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.scriptInputFormat.script=/user/jschivers/titan/titansparktest.groovy
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
gremlin.spark.persistStorageLevel=DISK_ONLY
gremlin.spark.persistContext=true
spark.master=yarn-client
spark.executor.memory=4g
spark.executor.instances=10
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.4.0-3485
spark.driver.extraJavaOptions=-Dhdp.version=2.3.4.0-3485
# /opt/titan/config/titan-files-bulkload.properties
gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
storage.backend=hbase
storage.hostname=hm01,hm02,hm03
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.backend=elasticsearch
index.search.hostname=hm02
storage.hbase.table=titansparktest
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseOutputFormat
titan.hadoop.output.conf.storage.backend=hbase
titan.hadoop.output.conf.storage.hostname=hm01,hm02,hm03
titan.hadoop.output.conf.storage.port=2181
titan.hadoop.output.conf.storage.hbase.table=titansparktest
titan.hadoop.output.conf.storage.batch-loading=true
titan.hadoop.output.conf.storage.hbase.region-count=10