"FastNoSuchElementException" when bulk loading into Titan/HBase using SparkGraphComputer

357 views
Skip to first unread message

Jerrell Schivers

unread,
Apr 3, 2016, 10:09:48 PM4/3/16
to Aurelius
Hello,

I'm trying to bulkload a large CSV file using ScriptInputFormat into Titan/HBase via Spark.  Part-way through the job I get the following error:

ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 4 times; aborting job
org
.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 6, hd03): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

To eliminate as many variables as possible, I decided to try to load the very simple toy graph given in the Script I/O Format section of the Tinkerpop documentation here.  Obviously this is overkill for a bulk load, but I wanted to keep things simple.  Sure enough, I got the same exception.  When I inspected the graph I can see that some (maybe all?) vertices are present, but no edges were loaded.

I'm at a loss as to how to proceed.  What could be causing this error?  Any clue as to how to troubleshoot?

Thanks,
Jerrell

Component versions

HDP 2.3.4.0-3485
Spark 1.5.2
HBase 1.1.2
Titan 1.0
Tinkerpop 3.1.1-incubating
I'm using the "titan1withtp3.1" fork of Titan from https://github.com/graben1437/titan1withtp3.1.

Here are the gremlin commands I used to load the data, along with the relevant property files:

gremlin> graph = GraphFactory.open('/opt/titan/config/hadoop-load-files.properties')
gremlin
> blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph('/opt/titan/config/titan-files-bulkload.properties').create(graph)
gremlin
> graph.compute(SparkGraphComputer).program(blvp).submit().get()

# /opt/titan/config/hadoop-load-files.properties

gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin
.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin
.hadoop.inputLocation=hdfs://kbprod/user/jschivers/titan/titansparktest
gremlin
.hadoop.outputLocation=dummyoutput
gremlin
.hadoop.deriveMemory=false
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.scriptInputFormat.script=/user/jschivers/titan/titansparktest.groovy
gremlin
.spark.graphStorageLevel=MEMORY_AND_DISK
gremlin
.spark.persistStorageLevel=DISK_ONLY
gremlin
.spark.persistContext=true
spark
.master=yarn-client
spark
.executor.memory=4g
spark
.executor.instances=10
spark
.yarn.am.extraJavaOptions=-Dhdp.version=2.3.4.0-3485
spark
.driver.extraJavaOptions=-Dhdp.version=2.3.4.0-3485


# /opt/titan/config/titan-files-bulkload.properties

gremlin
.graph=com.thinkaurelius.titan.core.TitanFactory
storage
.backend=hbase
storage
.hostname=hm01,hm02,hm03
cache
.db-cache = true
cache
.db-cache-clean-wait = 20
cache
.db-cache-time = 180000
cache
.db-cache-size = 0.5
index
.search.backend=elasticsearch
index
.search.hostname=hm02
storage
.hbase.table=titansparktest
titan
.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseOutputFormat
titan
.hadoop.output.conf.storage.backend=hbase
titan
.hadoop.output.conf.storage.hostname=hm01,hm02,hm03
titan
.hadoop.output.conf.storage.port=2181
titan
.hadoop.output.conf.storage.hbase.table=titansparktest
titan
.hadoop.output.conf.storage.batch-loading=true
titan
.hadoop.output.conf.storage.hbase.region-count=10


Daniel Kuppitz

unread,
Apr 4, 2016, 12:50:25 AM4/4/16
to aureliu...@googlegroups.com
Hi Jerrell,

mostly, when that happens, people forgot to define edges in both directions. Each vertex entry in your input file has to be aware of its in- and out- edges, otherwise message passing algorithms will wail and you'll see FastNoSuchElementExceptions.

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/ed5215a4-e18f-4b1c-b9db-0c5afad190af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jerrell Schivers

unread,
Apr 4, 2016, 2:32:27 PM4/4/16
to Aurelius
Hi Daniel,

Thanks for the feedback.

I remember last year when I was using Faunus there was a "titan.hadoop.input.edge-copy-direction" property which I believe did this (please correct me if I'm wrong).  Is there something equivalent I can use?  So in other words, let's say I wanted to load this toy graph from Tinkerpop's documentation:

1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4
2:person:vadas:27
3:project:lop:java
4:person:josh:32 created:3:0.4,created:5:1.0
5:project:ripple:java
6:person:peter:35 created:3:0.2

Here vertex 1 has 3 outgoing edges, to vertices 2, 4, and 3.  How do I make these vertices aware of the equivalent incoming edges?  I'm using the parse() function defined here: http://tinkerpop.apache.org/docs/3.1.2-SNAPSHOT/reference/#script-io-format

Thanks,
Jerrell

Daniel Kuppitz

unread,
Apr 4, 2016, 5:01:44 PM4/4/16
to aureliu...@googlegroups.com

Jerrell Schivers

unread,
Apr 4, 2016, 7:43:15 PM4/4/16
to Aurelius
Hi Daniel,

Thanks for that example.  Although the lack of EdgeCopy certainly complicates my situation, at least now I know what's going on and can figure out a way forward.

--Jerrell

Jerrell Schivers

unread,
Apr 5, 2016, 10:56:15 AM4/5/16
to Aurelius
Hello Daniel,

One quick followup question.  Is there a way to turn the "FastNoSuchElementException" into a non-fatal error?  My import will result in a graph with over 1 billion edges, and it would be problematic for the entire bulk import to fail because of a few missing edges.  I can live with the data being imperfect.

It would help if I could see the specific data that was generating this error when it occurred, since that would help me in correcting the problem.  But so far I haven't figured out how to do this.

Thanks,
Jerrell

Daniel Kuppitz

unread,
Apr 5, 2016, 12:55:23 PM4/5/16
to aureliu...@googlegroups.com
The edge writing stage assumes that all vertices are present. I think you'll have to write your own BulkLoader implementation if you want to handle missing vertices. You can start by copying the existing IncrementalBulkLoader implementation and then tweak the methods used in this part:


Cheers,
Daniel


Jerrell Schivers

unread,
Apr 5, 2016, 5:13:47 PM4/5/16
to Aurelius
Thanks again Daniel.  I think I can work with this.

--Jerrell
Reply all
Reply to author
Forward
0 new messages