Merge similar vertexes with OLAP

Rod Paulk

unread,

Mar 6, 2018, 4:21:37 PM3/6/18

to JanusGraph users

Hello

I am new at JanusGraph so please forgive me if this is obvious. I have a graph loaded from different sources about the same subject matter. With this I have multiple vertexes that represent the same entities that have no edges between them and I would like to have a vertexProgram to merge these entities.

A simple case of what I am trying to achieve is below where I would be merging vertexes on name matching:

Graph graph = GraphFactory.open("conf/hadoop-graph/hadoop-load.properties");

JanusGraph resultGraph = JanusGraphFactory.open("conf/janusgraph-hbase.properties");

Vertex a1 = graph.addVertex(T.label, "person", "name", "Jack");

Vertex a2 = graph.addVertex(T.label, "person", "name", "Jill";

Vertex a3 = graph.addVertex(T.label, "person", "name", "Jack";

MergeVertexProgram mvp = MergeVertexProgram.builder().create(resultGraph);

ComputerResult result = graph.compute(SparkComputer.class).program(mvp).submit().get();

Assert.assertSame(result.graph().traversal().V().toList().size(),2);//One Jack, One Jill

I have seen this could be done similar to the bulkloaderVertexProgram where a graph traversal queries the graph at every vertex for a matching vertexIdProperty: g.V().has(vertex.label(), getVertexIdProperty(), vertex.id().toString()). From my testing this has been really slow and since my data set will require me to do multiple merges on 100's of millions of nodes I was hoping there was another way to approach the problem. I had hoped I could use the MapReduce integration for something like Map emit<mergeId,Vertex> Reduce emit<MergedVertex> but if I understand correctly, you cannot get a ResultGraph from the MapReduce, just a reference to a memory object.

Thank you, and any guidance would be very much appreciated.

HadoopMarc

unread,

Mar 8, 2018, 2:24:28 PM3/8/18

to JanusGraph users

Hi Rod,

To me the best track is not obvious at all. Problems you face:

How are you going to find merge candidates? I guess you need some groupBy (having a vertex program send NxN messages will not scale).
HadoopGraph does not support mutations (but as you noticed, vertex programs can write to a result graph)
The current HBaseInputFormat is known to not perform well.

Given these issues in reusing the existing graph, is an upfront merge before the bulkload an option?

HTH, Marc

Op dinsdag 6 maart 2018 22:21:37 UTC+1 schreef Rod Paulk:

Rod Paulk

unread,

Mar 9, 2018, 9:48:02 AM3/9/18

to JanusGraph users

Hi Mark,

This is a live graph and we preprocess per source, per load, to make adjacency list style inserts. This leaves us with disconnected subgraphs both within the sources and between the sources, which is the problem. Scalability is a primary concern and we are fine with having a workflow of incrementally disambiguated graphs to get us to a world view, but pushing the full property match and vertex merge task to preprocessing is undesirable. Without a graph framework, what I had done in the past was to read and write directly to the HFiles with a mapreduce workflow which allowed me to encode the match criteria in the emitted map key then merge vertexes in the reducer and do this in a lambda framework style. So that is where my head is now and why I was confused how the MapReduce integration worked.

You also mention that the HBaseImportFormat is inefficient because it is build on Scan. Is there a more effective approach if we swap out HBase for Cassandra? Or used something similar to Mizo (https://github.com/imri/mizo)?

Thanks,

Rod

HadoopMarc

unread,

Mar 10, 2018, 4:46:26 AM3/10/18

to JanusGraph users

Hi Rod,

I have not tried Cassandra yet myself on a large scale, but I expect better OLAP performance because Cassandra has default input splits of 64 MB while the current HBase 1.2 has an input split per region which is typically 1GB to 20GB (this will change in HBase 1.3/2.0), which is very difficult to handle for reasonably sized Spark executors. Mizo is not part of JanusGraph, so maintaing a Mizo-based app will be risky.

If you cannot do the merging upfront, maybe the following approach is feasible (just try with a small dataset):

Run an OLAP traversal (cq the TraversalVertexProgram) with group().by() to get the required vertex groups that need merging
Run a Spark job to transform the vertex groups into an incremental input graph
Run the incremental BulkloaderVertexProgram with the output of 2.

HTH, Marc

Op vrijdag 9 maart 2018 15:48:02 UTC+1 schreef Rod Paulk:

Reply all

Reply to author

Forward