Is it ok to use graphdb on a large dense graph

36 views

Skip to first unread message

Victor Genin Data Scientist

unread,

Mar 16, 2018, 12:07:41 PM3/16/18

to Gremlin-users

We want to present our data in a graph and thought about using one of graphdbs. During our vendor investigation process, one of the experts suggested that using graphdb on dense graph won't be efficient and we'd better off with columnar-based db like cassandra.

I gave your use case some thought and given your graph is very dense (number of relationships = number of nodes squared) and that you seem to only need a few hop traversals from the particular node along different relationships. I’d actually recommend you also try out a columnar database.
Graph databases tend to work well when you have sparse graphs (num of relationships << num of nodes ^ 2) and with deep traversals - from 4-5 hops to hundreds of hops. If I understood your use-case correctly, a columnar database should generally outperform graphs there.

Our use case will probably end up with nodes connected to 10s of millions of other nodes with about 30% overlap between different nodes - so in a way, it's probably a dense graph. Overall there will be probably a few billion nodes.

Looking in Neo4j source code I found some reference of isDense flag on the nodes to differentiate the processing logic - not sure what that does. But I also wonder whether it was done as an edge case patch and won't work well if most of the nodes in the graph are dense.

Does anyone have any experience with graphdbs on dense graphs and should it be considered in such cases? I saw a few issues over the years both on Titan and Neo4j threads, which seemed to be taken into considerations by the development teams. But, again, not sure was is it an edge case patch or a real scalable solution.

Like here, "supernode" is a vertex with a disproportionately high number of incident edges. While supernodes are rare in natural graphs...

In our case it's not so rare.

All opinions are appreciated!

Ted Wilmes

unread,

Mar 16, 2018, 2:35:06 PM3/16/18

to Gremlin-users

Hi Victor,

Do you know what your read patterns will be like? For example, will you be routinely gathering all of vertices some number of hops

out from a vertex or making much more selective queries and retrieving a very small number of vertices/edges?

Generally speaking, from the storage and retrieval standpoint, you'll definitely run into the supernode problem with the density that you're talking about.

Two popular TinkerPop property graph dbs, DSE Graph and JanusGraph use Cassandra for storage. There are going to be at least two limiting factors in both

cases to your max vertex edge counts. First, how much you can store in a single Cassandra partition without running into operational issues and

second, depending on the type of analysis you're doing (hence my high vs low selectivity question above). The read paths are different for JanusGraph

and DSEGraph, but ultimately, even if you can read them off disk, if you're attempting to materialize 10's of millions of adjacent nodes into memory, you're going

to run into problems. Alternatively, you can use TinkerPop OLAP to pull the data up into Spark for analysis, but again, per vertex edge counts of that

will be very challenging and if they do work, it will be very, very resource intensive on the Spark side. You can also break up vertices manually in your

model for the purpose of limiting max edge counts, but I'd only suggest that if you can apply it to a small portion of your model, not if it's the whole graph.

On the point of the Cassandra recommendation you mentioned, the thing to keep in mind there is that you'll still need to have some sort of partitioning scheme.

You won't be able to insert 10's of millions of items into a single Cassandra partition. It's not to say that you can't come up with a data model that will work

by introducing bucketing and things like that, but you'll need to put some time into it and it will have to be designed to mimic your query patterns.

My advice would be to enumerate the queries that you plan on running and then use that as a guide. Others can chime in to correct me if I'm wrong, but I personally

do not know of a current property graph database that can gracefully handle the supernode edge counts that you're talking about.