Supernode problem in different storage backends

Florian Hockmann

unread,

Jan 10, 2018, 7:02:52 AM1/10/18

to JanusGraph users

Hi,

supernodes are probably one of the biggest problems for graph databases. So I wanted to ask what experiences you have with supernodes for the different storage backends of JanusGraph.

My main interest is currently in regards to ScyllaDB vs Cassandra, but experiences with other storage backends would of course also be interesting to hear about. If my understanding is correct, then supernodes will result in large partitions in Cassandra and ScyllaDB. Jon Haddad wrote about problems with large partitions in Cassandra in the context of time series data modeling: http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html

Scylla claims that they have good support for supernodes: http://www.scylladb.com/2016/08/25/large-partitions/

So my question is: Do you have any experiences with supernodes in ScyllaDB and does it handle supernodes (ergo large partitions) better than Cassandra?

The usual recommendation to deal with supernodes or large partitions is to use bucketing (explained in both blog posts I linked above, the basic idea is to split up a supernode into multiple nodes, one for each day). The really interesting question here from my point of view is: Does any storage backend (maybe ScyllaDB?) handle those large partitions good enough that such a bucketing isn't necessary anymore?

To start with our own experiences with supernodes and Cassandra as the storage backend: We currently have some supernodes as we don't use something like bucketing and this leads to problems like vertices getting too big to be effectively worked with. When we try to use Spark to work on our data it also fails to load our graph (apparently) due to too big partitions.
Therefore we are currently evaluating whether we change our schema to use bucketing or whether another storage backend would be a better choice than Cassandra.

Best regards and thanks
Florian

Ted Wilmes

unread,

Jan 10, 2018, 10:20:50 AM1/10/18

to JanusGraph users

Hi Florian,

I've used Scylla with Janus but haven't pushed the limit on partition size. I think Scylla's

large partition support will help with the case where you run OLTP queries against supernodes

that are still fairly constrained on their edge retrieval by a vertex-centric index. In that

case, with Scylla's improved large partition support the usual operational headaches should

be removed, and you'll be left with a maintainable system and low latency reads. What I don't think

it'll fix is the case where you're running queries that are largely unconstrained against those large partitions.

If Janus needs to materialize the full (or large % of a) adjacency list of a large partition into memory, we'll probably

still have issues even if the storage backend can handle it. Are your transactional queries pretty

constrained when they hit supernodes or do you need to do scan type operations? On a related note,

I think there have been large partition improvements in Cassandra 3, but I'd have to chase down the

JIRAs to see what they were.

--Ted

Florian Hockmann

unread,

Jan 11, 2018, 6:22:18 AM1/11/18

to JanusGraph users

Thanks for your input, Ted. Most of our queries either use a limit() to only get a small number of neighbors back or at least a timeLimit() to only get a best effort solution. That shouldn't be a big problem for JanusGraph when the storage backend is able to handle those partitions efficiently, or should it?

A bigger problem for us might be our way of inserting edges as we are actually doing upserts that check prior to adding the edge whether an edge already exists between the two vertices with the same edge label. That probably requires JanusGraph to iterate over the whole adjacency list of the first vertex. Unfortunately, I don't really see a way to perform such an upsert without scanning the whole adjacency list. Vertex-centric indices are probably too expensive for this use case as we would need to create one (or even two, to also support the other direction) index for most of our edge labels.

Might it be an option for JanusGraph to handle the supernode problem by splitting a vertex into more than one partition before it's getting too big?

Ted Wilmes

unread,

Jan 13, 2018, 11:59:08 AM1/13/18

to JanusGraph users

Hi Florian,

Good questions, I had to confirm my memory, but here's the storage layer behavior I think you'd see for a

few different scenarios. Let's say we have a vertex 'A', and it has 0.5 million adjacent vertices, only outbound to it.

g.V(a).outE().count(): storage adapter will retrieve all 0.5 million adjacent edges at once

using the default limit for queries, which is Integer.MAX_VALUE (2,147,483,647)

g.V(a).outE().limit(123): the limit will be set to 246 here, or 2x the limit you

set (see line 437 of BasicVertexCentricQueryBuilder if you're curious why it's 2x)

g.V(a).out().limit(123): like above, limit will be set to 246 because even though

it's an out vs. an outE, the edges have the inbound vertex ids so there is no need

to retrieve the vertices

g.V(a).out().valueMap().limit(123): limit is set to MAX_VALUE and all the edges

are retrieved even though we only need 123 of them. This occurs because we do

not have a strategy (yet) that will pull that limit leftwards. For the time being, you

could move that limit before the valueMap yourself and get the desired behavior.

g.V(a).out().range(100000, 110000): limit is set to 2x the upper bound (220,000).

Since there is no vertex-centric index on the edge, the edges aren't sorted in

anyway where we can page through them without always retrieving starting at 0 everytime

So, as you can see with these, the storage backend is part of it, but we also still

have scenarios where a large quantity of edges must be returned which will at

some point, cause issues on the Janus side, whether due to heap pressure or

the extra processing required to filter post retrieval.

With regards to splitting the adjacency lists up, you can create partitioned vertex

labels [1]. I have not used this feature so I'm not sure how well it works in practice,

but again, it may only help you at the storage layer if you have queries that you are

expecting to have to retrieve a large set of edges for. The bottleneck will just move

to the Janus JVM. If you can use vertex-centric indices (VCI), and constrain your

queries, you'll have less of an issue. The benefit of the VCI, is that it's one of the

few spots, where we can push down extra predicates to the storage layer, thereby

pruning unwanted data quickly, and greatly reducing the postprocessing in the JVM.

For the upsert case, Janus should be able to make that check without a full scan as long

as you formulate the query in Gremlin such that the AdjacentVertexFilterOptimizerStrategy

will optimize it. You also could use the low level Janus query API, but Gremlin is preferable

in my opinion. Here's an example to try out. My test has 510,000 adjacent edges.

The timing is included here from my laptop so you can see it's pretty snappy and this is with the global cache turned off.

gremlin> clockWithResult{g.tx().rollback(); adjVertex = g.V(356568).next(); g.V(356568).V(934072).outE("knows").where(inV().is(eq(adjVertex))).next()}

==>5.61414471

==>e[51qv-k0qg-1lh-7n4o][934072-knows->356568]

You can probably role that first vertex lookup into the Gremlin query if you want. Also, again if you don't

specify the edge label, it'll still pull everything back. If you run an explain on that query, you'll see the strategy in action.

Well, that turned out longer than expected but I hope there are a few things in there to try out.

--Ted

[1] http://docs.janusgraph.org/latest/graph-partitioning.html#_vertex_cut

Florian Hockmann

unread,

Jan 15, 2018, 7:34:55 AM1/15/18

to JanusGraph users

Thanks very much, Ted! It's really good to know when the whole adjacent list needs to be scanned and how much the limit and the range step help. Especially the implications of the range step are new to me. I thought that that step would enable efficient paging.

g.V(a).out().valueMap().limit(123): limit is set to MAX_VALUE and all the edges
are retrieved even though we only need 123 of them. This occurs because we do
not have a strategy (yet) that will pull that limit leftwards.

Interesting, I always thought that we have such a strategy in TinkerPop. Couldn't such a strategy simply move the limit to the left for all map steps?

With regards to splitting the adjacency lists up, you can create partitioned vertex
labels [1]

The problem with this approach is that it can only be configured for all vertices of a certain label. I don't know if there are many domains were this is really that helpful, but for our domain it probably isn't. We store for example information about domains and 99,9% of those only have a small number of edges, but a few domains are like google.com with >10 Mio. edges. This is quite similar to the usual example that you also used in your blog post to explain supernodes with twitter and the number of followers. Most twitter users have less than 1000 followers, but some famous users have millions. So partitioning them to lessen the supernode problem a bit would also partition all other users which means that now most traversals will involve multiple partitions.
Apart from that, the partitioning with vertex cuts doesn't really help for small clusters, but you also already mentioned that in your blog post.

Isn't it also a problem for JanusGraph that it stores the whole adjacency list of a vertex in a single row? Wouldn't it be an option to use more than one row for vertices with a large amount of edges? We could maybe add a (configurable) limit of edges that are stored in a single row and then create a new row for the vertex when its number of edges reaches that limit. I'm not a Cassandra expert (and certainly not an expert on HBase or any of the other storage backends), but wouldn't it be possible to put those rows automatically in different partitions?

For the upsert case, Janus should be able to make that check without a full scan as long
as you formulate the query in Gremlin such that the AdjacentVertexFilterOptimizerStrategy
will optimize it

That's really good to here. I just checked our upsert query and it triggers the AdjacentVertexFilterOptimizerStrategy. The times also look good for a supernode that I just tested (1.3 seconds for a vertex with ~1 Mio. edges). So upserts really shouldn't be a problem with supernodes.

Ted Wilmes

unread,

Jan 16, 2018, 6:46:48 PM1/16/18

to JanusGraph users

Hi Florian,

Interesting, I always thought that we have such a strategy in TinkerPop. Couldn't such a strategy simply move the limit to the left for all map steps?

Yes, I think this would be a good addition. I'll put an issue in.

Isn't it also a problem for JanusGraph that it stores the whole adjacency list of a vertex in a single row? Wouldn't it be an option to use more than one row for vertices with a large amount of edges? We could maybe add a (configurable) limit of edges that are stored in a single row and then create a new row for the vertex when its number of edges reaches that limit. I'm not a Cassandra expert (and certainly not an expert on HBase or any of the other storage backends), but wouldn't it be possible to put those rows automatically in different partitions?

The current partitioned vertex approach spreads the adjacency list across partitions, but as you are alluding

to, can be a detriment to the lower degree vertices because it then unnecessarily distributes their adjacency lists around

leading to extra back and forth. Janus could do something along the lines of what you are suggesting. It would require

statistics to be maintained so that a split point could be identified when an edge count threshold is breached. There would

also have to be some means to mark a vertex as partitioned or not. This wouldn't be trivial, but definitely would be possible.

Another adaptive approach, would be to store the adjacency list twice, in a write-tuned format, as it is now, and a read-tuned

format. The read copy could be asynchronously updated off of the write updates, kind of like a simple materialized view.

One example of this in the relational world would be a row oriented vs. column oriented format. The former ideal for OLTP, the

latter great for OLAP. Databases like MemSQL and Peloton are taking a hybrid approach that combines the two approaches

into a single hybrid transactional/analytical system This would definitely not be a trivial update...but would be pretty darn cool :)

That's really good to here. I just checked our upsert query and it triggers the AdjacentVertexFilterOptimizerStrategy. The times also look good for a supernode that I just tested (1.3 seconds for a vertex with ~1 Mio. edges). So upserts really shouldn't be a problem with supernodes.

Excellent, good to hear that was already in place!

--Ted

Florian Hockmann

unread,

Feb 2, 2018, 9:20:54 AM2/2/18

to JanusGraph users

Interesting, I always thought that we have such a strategy in TinkerPop. Couldn't such a strategy simply move the limit to the left for all map steps?

Yes, I think this would be a good addition. I'll put an issue in.

I just created TINKERPOP-1882 for this.

Another adaptive approach, would be to store the adjacency list twice, in a write-tuned format, as it is now, and a read-tuned
format. The read copy could be asynchronously updated off of the write updates, kind of like a simple materialized view.
One example of this in the relational world would be a row oriented vs. column oriented format. The former ideal for OLTP, the
latter great for OLAP.

Spark (or another OLAP system) would still get a complete, unpartitioned vertex with this approach, wouldn't it? We just did a quick evaluation to see when the supernode problem hits us with Spark and we were only able to load a supernode with 3 Mio. edges into Spark. With 4 Mio. edges, the Spark executors died with an OutOfMemoryException or because they spent 98% of their time in garbage collection. During our tests, the Spark executors had only 8 GB of heap space, but I assume that this scales linearly to bigger heaps.
So to really solve this supernode problem, we would probably also need to somehow support partitioned vertices in spark-gremlin.

...

Reply all

Reply to author

Forward