load data to janusgraph through Tinkerpop spark-gremlin

56 views
Skip to first unread message

Zhu qi

unread,
Jan 9, 2020, 10:41:37 PM1/9/20
to JanusGraph users
We have a billion vertices and hope to import data through spark-gremlin. Can anyone give us some examples?
For example, by constructing InputRdd or other methods.

marc.de...@gmail.com

unread,
Jan 11, 2020, 8:47:03 AM1/11/20
to JanusGraph users
Hi Zhu,

Indeed, compared to systems like ElasticSearch or jdbc-connected SQL databases, JanusGraph/TinkerPop lags behind in terms of spark connectivity. The old JanusGraph docs used to have a section on spark based data ingestion using the BulkLoaderVertexProgram, but its use was often problematic and it was removed after deprecation of TinkerPop's BulkLoaderVertexProgram. You might have some luck with it nevertheless; it is still available in the TinkerPop API's.

I think it should not be too difficult to write something yourself, if you keep the following in mind:
  • prevent multiple ingestion of the same vertex because of the distributed nature of the system (realised most easily by inserting all vertices first)
  • make spark tasks the size of one batch transaction and use rdd.mapPartitions, to be sure that a task is run exactly once and only if the transaction was successfully committed
  • use the suggestions on batch loading from the janusgraph ref docs, in particular the id block size
  • have each spark executor make a single backend connection using the singleton design pattern (so do not connect and disconnect for each spark task)
  • study the way JanusGraph passes properties to the spark executors with the janusgraphmr.ioformat.conf.storage..... mechanism (see the conf/hadoop-graph examples in the janusgraph distribution)
This assumes you use a JVM based language. The story would be different when using gremlin server with one of the supported language variants (from what I remember to have read before this would be more difficult to scale and tune, although the cloud-based TinkerPop-compatible services managed to get this right).

HTH,    Marc

Op vrijdag 10 januari 2020 04:41:37 UTC+1 schreef Zhu qi:
Reply all
Reply to author
Forward
0 new messages