Heartbeat error when using BulkUploader

236 views
Skip to first unread message

ste...@indisputable.io

unread,
Jul 24, 2018, 12:47:12 AM7/24/18
to JanusGraph users
I'm consistently having issues using the bulkUploadVertexProgram to load a very large (800GB) graph into Janusgraph. I have it split up into 110 files (time-based edge separation) so the files are all about 8gb, and I am iterating over the files one by one to upload. I haven't been able to get past the first file. I have 64 cores with spark executer and worker memory of 6g each (416GB on the machine). I've tried a number of different configurations to no avail. What is really killing productivity is how long it takes for the system to fail (about 5 hours), which makes it hard to iterate on debugging. The latest error I'm having is:



[Stage 5:==03:13:44 WARN  org.apache.spark.rpc.netty.NettyRpcEnv  - Ignored message: HeartbeatResponse(false)
03:14:31 WARN  org.apache.spark.rpc.netty.NettyRpcEndpointRef  - Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f7a6c8,BlockManagerId(driver, localhost, 40607))] in 1 attempts
org
.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval


This is an urgent issue for me and any help is greatly appreciated.

Debasish Kanhar

unread,
Jul 24, 2018, 2:13:14 AM7/24/18
to JanusGraph users
Hi,

Can you share full stack trace or better logs? Maybe that will give us more clarity why you are facing this error. This can be due to any number of reasons. I also remember getting Heartbeat timed out even when my connection to backend Cassandra was failing.

Also, maybe your configuration you are specifying will help.

ste...@indisputable.io

unread,
Jul 24, 2018, 2:24:31 AM7/24/18
to JanusGraph users
Thanks for the response Debasish. Here is my configuration for  the read graph:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin
.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true


gremlin
.hadoop.inputLocation=data/sample-bulk-import-data
gremlin
.hadoop.scriptInputFormat.script=scripts/bulk-import.groovy
storage
.batch-loading=true
gremlin
.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer


gremlin
.hadoop.outputLocation=/path/to/persist/location
gremlin
.spark.graphStorageLevel=DISK_ONLY
gremlin
.spark.persistStorageLevel=DISK_ONLY




#
# SparkGraphComputer Configuration
#
spark
.master=local[*]
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.executor.memory=6g
spark
.driver.memory=6g
spark
.local.dir=/janusgraph/external/spark

Here is the configuration for the write graph:

gremlin.graph=org.janusgraph.core.JanusGraphFactory

storage
.backend=cassandrathrift
storage
.batch-loading=true
storage
.cassandra.frame-size-mb=1000
schema
.default=none

ids
.block-size=25000

storage
.hostname=<three IPs for cassandra ring>
storage
.cassandra.keyspace=test_graph
storage
.read-time=200000
storage
.write-time=20000

#I've commented the next two out, but they were used to build the keyspace
#storage.cassandra.replication-strategy-options=asia-southeast1_asia_cassandra,2
#storage.cassandra.replication-strategy-class=org.apache.cassandra.locator.NetworkTopologyStrategy
storage
.cassandra.write-consistency-level=ONE
storage
.cassandra.read-consistency-level=ONE
#storage.cassandra.atomic-batch-mutate=false


index
.edge.backend=lucene
index
.edge.directory=/janusgraph/data/edgeindex


# Whether to enable JanusGraph's database-level cache, which is shared
# across all transactions. Enabling this option speeds up traversals by
# holding hot graph elements in memory, but also increases the likelihood
# of reading stale data.  Disabling it forces each transaction to
# independently fetch graph elements from storage before reading/writing
# them.
cache
.db-cache = true
cache
.db-cache-clean-wait = 20
cache
.db-cache-time = 180000
cache
.db-cache-size = 0.5

Here is the script to run the bulk upload:

import groovy.io.FileType


folder
= new File('/janusgraph/external/import/adjacency-list')
done_folder
= new File('/janusgraph/external/import/done')
folder
.eachFileRecurse FileType.FILES,  { file ->
 
if (file.name.endsWith(".csv")) {
    println
(file.absolutePath)


    graph
= GraphFactory.open("conf/coral/read-graph.properties")
    graph
.configuration().setInputLocation(file.absolutePath)
    graph
.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "/janusgraph/scripts/bulk-import.groovy")
    blvp
= BulkLoaderVertexProgram.build().intermediateBatchSize(10000).writeGraph('conf/coral/write-graph.properties').create(graph)
    graph
.compute(SparkGraphComputer).program(blvp).submit().get()
    graph
.close()
    file
.renameTo(new File(done_folder, file.getName()))
 
}
}

ste...@indisputable.io

unread,
Jul 24, 2018, 2:28:56 AM7/24/18
to JanusGraph users
I am using Cassandra. Is it possibly something on the cassandra side that is failing? In a previous configuration I noticed a GC error that appeared to come from Cassandra.


On Monday, July 23, 2018 at 11:13:14 PM UTC-7, Debasish Kanhar wrote:

ste...@indisputable.io

unread,
Jul 24, 2018, 2:37:07 PM7/24/18
to JanusGraph users

It happened again, so I've included a screenshot of the error:



On Monday, July 23, 2018 at 11:13:14 PM UTC-7, Debasish Kanhar wrote:

ste...@indisputable.io

unread,
Jul 24, 2018, 3:07:25 PM7/24/18
to JanusGraph users
Also, the longest line in the adjacency list is: 17,179,092 characters long. That equates to about 128,000 edges for that particular vertex.

jcbms

unread,
Jul 27, 2018, 1:25:53 AM7/27/18
to JanusGraph users
if you use the bulkloadprogram to load data ,it will take a lot of memory. it seems spark try to cache all in vertexes and out vertexes in memory. so if you have a node which has a high degree , this node will use up your memory, so i give up the bulkloadprogram 

在 2018年7月25日星期三 UTC+8上午3:07:25,ste...@indisputable.io写道:
Reply all
Reply to author
Forward
0 new messages