Hi Zhu,
Indeed, compared to systems like ElasticSearch or jdbc-connected SQL databases, JanusGraph/TinkerPop lags behind in terms of spark connectivity. The old JanusGraph docs used to have a section on
spark based data ingestion using the BulkLoaderVertexProgram, but its use was often problematic and it was removed after deprecation of TinkerPop's BulkLoaderVertexProgram. You might have some luck with it nevertheless; it is still available in the TinkerPop API's.
I think it should not be too difficult to write something yourself, if you keep the following in mind:
- prevent multiple ingestion of the same vertex because of the distributed nature of the system (realised most easily by inserting all vertices first)
- make spark tasks the size of one batch transaction and use rdd.mapPartitions, to be sure that a task is run exactly once and only if the transaction was successfully committed
- use the suggestions on batch loading from the janusgraph ref docs, in particular the id block size
- have each spark executor make a single backend connection using the singleton design pattern (so do not connect and disconnect for each spark task)
- study the way JanusGraph passes properties to the spark executors with the janusgraphmr.ioformat.conf.storage..... mechanism (see the conf/hadoop-graph examples in the janusgraph distribution)
This assumes you use a JVM based language. The story would be different when using gremlin server with one of the supported language variants (from what I remember to have read before this would be more difficult to scale and tune, although the cloud-based TinkerPop-compatible services managed to get this right).
HTH, Marc
Op vrijdag 10 januari 2020 04:41:37 UTC+1 schreef Zhu qi: