Graph graph = GraphFactory.open("conf/hadoop-graph/hadoop-load.properties");
JanusGraph resultGraph = JanusGraphFactory.open("conf/janusgraph-hbase.properties");
Vertex a1 = graph.addVertex(T.label, "person", "name", "Jack");
Vertex a2 = graph.addVertex(T.label, "person", "name", "Jill";
Vertex a3 = graph.addVertex(T.label, "person", "name", "Jack";
MergeVertexProgram mvp = MergeVertexProgram.builder().create(resultGraph);
ComputerResult result = graph.compute(SparkComputer.class).program(mvp).submit().get();
Assert.assertSame(result.graph().traversal().V().toList().size(),2);//One Jack, One Jill
I have seen this could be done similar to the bulkloaderVertexProgram where a graph traversal queries the graph at every vertex for a matching vertexIdProperty: g.V().has(vertex.label(), getVertexIdProperty(), vertex.id().toString()). From my testing this has been really slow and since my data set will require me to do multiple merges on 100's of millions of nodes I was hoping there was another way to approach the problem. I had hoped I could use the MapReduce integration for something like Map emit<mergeId,Vertex> Reduce emit<MergedVertex> but if I understand correctly, you cannot get a ResultGraph from the MapReduce, just a reference to a memory object.
Thank you, and any guidance would be very much appreciated.
Given these issues in reusing the existing graph, is an upfront merge before the bulkload an option?
HTH, Marc