Supernode problem in different storage backends

654 views
Skip to first unread message

Florian Hockmann

unread,
Jan 10, 2018, 7:02:52 AM1/10/18
to JanusGraph users
Hi,

supernodes are probably one of the biggest problems for graph databases. So I wanted to ask what experiences you have with supernodes for the different storage backends of JanusGraph.

My main interest is currently in regards to ScyllaDB vs Cassandra, but experiences with other storage backends would of course also be interesting to hear about. If my understanding is correct, then supernodes will result in large partitions in Cassandra and ScyllaDB. Jon Haddad wrote about problems with large partitions in Cassandra in the context of time series data modeling: http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html

Scylla claims that they have good support for supernodes: http://www.scylladb.com/2016/08/25/large-partitions/

So my question is: Do you have any experiences with supernodes in ScyllaDB and does it handle supernodes (ergo large partitions) better than Cassandra?

The usual recommendation to deal with supernodes or large partitions is to use bucketing (explained in both blog posts I linked above, the basic idea is to split up a supernode into multiple nodes, one for each day). The really interesting question here from my point of view is: Does any storage backend (maybe ScyllaDB?) handle those large partitions good enough that such a bucketing isn't necessary anymore?

To start with our own experiences with supernodes and Cassandra as the storage backend: We currently have some supernodes as we don't use something like bucketing and this leads to problems like vertices getting too big to be effectively worked with. When we try to use Spark to work on our data it also fails to load our graph (apparently) due to too big partitions.
Therefore we are currently evaluating whether we change our schema to use bucketing or whether another storage backend would be a better choice than Cassandra.

Best regards and thanks
Florian

Ted Wilmes

unread,
Jan 10, 2018, 10:20:50 AM1/10/18
to JanusGraph users
Hi Florian,
I've used Scylla with Janus but haven't pushed the limit on partition size. I think Scylla's 
large partition support will help with the case where you run OLTP queries against supernodes
that are still fairly constrained on their edge retrieval by a vertex-centric index. In that 
case, with Scylla's improved large partition support the usual operational headaches should 
be removed, and you'll be left with a maintainable system and low latency reads. What I don't think 
it'll fix is the case where you're running queries that are largely unconstrained against those large partitions. 
If Janus needs to materialize the full (or large % of a) adjacency list of a large partition into memory, we'll probably 
still have issues even if the storage backend can handle it. Are your transactional queries pretty 
constrained when they hit supernodes or do you need to do scan type operations? On a related note,
I think there have been large partition improvements in Cassandra 3, but I'd have to chase down the
JIRAs to see what they were.

--Ted

Florian Hockmann

unread,
Jan 11, 2018, 6:22:18 AM1/11/18
to JanusGraph users
Thanks for your input, Ted. Most of our queries either use a limit() to only get a small number of neighbors back or at least a timeLimit() to only get a best effort solution. That shouldn't be a big problem for JanusGraph when the storage backend is able to handle those partitions efficiently, or should it?

A bigger problem for us might be our way of inserting edges as we are actually doing upserts that check prior to adding the edge whether an edge already exists between the two vertices with the same edge label. That probably requires JanusGraph to iterate over the whole adjacency list of the first vertex. Unfortunately, I don't really see a way to perform such an upsert without scanning the whole adjacency list. Vertex-centric indices are probably too expensive for this use case as we would need to create one (or even two, to also support the other direction) index for most of our edge labels.

Might it be an option for JanusGraph to handle the supernode problem by splitting a vertex into more than one partition before it's getting too big?

Ted Wilmes

unread,
Jan 13, 2018, 11:59:08 AM1/13/18
to JanusGraph users
Hi Florian,
Good questions, I had to confirm my memory, but here's the storage layer behavior I think you'd see for a 
few different scenarios. Let's say we have a vertex 'A', and it has 0.5 million adjacent vertices, only outbound to it.

g.V(a).outE().count(): storage adapter will retrieve all 0.5 million adjacent edges at once 
using the default limit for queries, which is Integer.MAX_VALUE (2,147,483,647)

g.V(a).outE().limit(123): the limit will be set to 246 here, or 2x the limit you 
set (see line 437 of BasicVertexCentricQueryBuilder if you're curious why it's 2x)

g.V(a).out().limit(123): like above, limit will be set to 246 because even though 
it's an out vs. an outE, the edges have the inbound vertex ids so there is no need 
to retrieve the vertices

g.V(a).out().valueMap().limit(123): limit is set to MAX_VALUE and all the edges 
are retrieved even though we only need 123 of them. This occurs because we do 
not have a strategy (yet) that will pull that limit leftwards. For the time being, you 
could move that limit before the valueMap yourself and get the desired behavior.

g.V(a).out().range(100000, 110000): limit is set to 2x the upper bound (220,000). 
Since there is no vertex-centric index on the edge, the edges aren't sorted in 
anyway where we can page through them without always retrieving starting at 0 everytime

So, as you can see with these, the storage backend is part of it, but we also still 
have scenarios where a large quantity of edges must be returned which will at 
some point, cause issues on the Janus side, whether due to heap pressure or 
the extra processing required to filter post retrieval.

With regards to splitting the adjacency lists up, you can create partitioned vertex 
labels [1]. I have not used this feature so I'm not sure how well it works in practice, 
but again, it may only help you at the storage layer if you have queries that you are 
expecting to have to retrieve a large set of edges for. The bottleneck will just move 
to the Janus JVM. If you can use vertex-centric indices (VCI), and constrain your 
queries, you'll have less of an issue. The benefit of the VCI, is that it's one of the 
few spots, where we can push down extra predicates to the storage layer, thereby 
pruning unwanted data quickly, and greatly reducing the postprocessing in the JVM.

For the upsert case, Janus should be able to make that check without a full scan as long 
as you formulate the query in Gremlin such that the AdjacentVertexFilterOptimizerStrategy 
will optimize it. You also could use the low level Janus query API, but Gremlin is preferable 
in my opinion. Here's an example to try out. My test has 510,000 adjacent edges.

The timing is included here from my laptop so you can see it's pretty snappy and this is with the global cache turned off.

gremlin> clockWithResult{g.tx().rollback(); adjVertex = g.V(356568).next(); g.V(356568).V(934072).outE("knows").where(inV().is(eq(adjVertex))).next()}
==>5.61414471
==>e[51qv-k0qg-1lh-7n4o][934072-knows->356568]

You can probably role that first vertex lookup into the Gremlin query if you want. Also, again if you don't 
specify the edge label, it'll still pull everything back. If you run an explain on that query, you'll see the strategy in action.

Well, that turned out longer than expected but I hope there are a few things in there to try out.

--Ted

Florian Hockmann

unread,
Jan 15, 2018, 7:34:55 AM1/15/18
to JanusGraph users
Thanks very much, Ted! It's really good to know when the whole adjacent list needs to be scanned and how much the limit and the range step help. Especially the implications of the range step are new to me. I thought that that step would enable efficient paging.

g.V(a).out().valueMap().limit(123): limit is set to MAX_VALUE and all the edges 
are retrieved even though we only need 123 of them. This occurs because we do 
not have a strategy (yet) that will pull that limit leftwards.

Interesting, I always thought that we have such a strategy in TinkerPop. Couldn't such a strategy simply move the limit to the left for all map steps?


With regards to splitting the adjacency lists up, you can create partitioned vertex 
labels [1]

The problem with this approach is that it can only be configured for all vertices of a certain label. I don't know if there are many domains were this is really that helpful, but for our domain it probably isn't. We store for example information about domains and 99,9% of those only have a small number of edges, but a few domains are like google.com with >10 Mio. edges. This is quite similar to the usual example that you also used in your blog post to explain supernodes with twitter and the number of followers. Most twitter users have less than 1000 followers, but some famous users have millions. So partitioning them to lessen the supernode problem a bit would also partition all other users which means that now most traversals will involve multiple partitions.
Apart from that, the partitioning with vertex cuts doesn't really help for small clusters, but you also already mentioned that in your blog post.

Isn't it also a problem for JanusGraph that it stores the whole adjacency list of a vertex in a single row? Wouldn't it be an option to use more than one row for vertices with a large amount of edges? We could maybe add a (configurable) limit of edges that are stored in a single row and then create a new row for the vertex when its number of edges reaches that limit. I'm not a Cassandra expert (and certainly not an expert on HBase or any of the other storage backends), but wouldn't it be possible to put those rows automatically in different partitions?


For the upsert case, Janus should be able to make that check without a full scan as long 
as you formulate the query in Gremlin such that the AdjacentVertexFilterOptimizerStrategy 
will optimize it

That's really good to here. I just checked our upsert query and it triggers the AdjacentVertexFilterOptimizerStrategy. The times also look good for a supernode that I just tested (1.3 seconds for a vertex with ~1 Mio. edges). So upserts really shouldn't be a problem with supernodes.

Ted Wilmes

unread,
Jan 16, 2018, 6:46:48 PM1/16/18
to JanusGraph users
Hi Florian,

Interesting, I always thought that we have such a strategy in TinkerPop. Couldn't such a strategy simply move the limit to the left for all map steps?
 
Yes, I think this would be a good addition. I'll put an issue in.  

Isn't it also a problem for JanusGraph that it stores the whole adjacency list of a vertex in a single row? Wouldn't it be an option to use more than one row for vertices with a large amount of edges? We could maybe add a (configurable) limit of edges that are stored in a single row and then create a new row for the vertex when its number of edges reaches that limit. I'm not a Cassandra expert (and certainly not an expert on HBase or any of the other storage backends), but wouldn't it be possible to put those rows automatically in different partitions?

The current partitioned vertex approach spreads the adjacency list across partitions, but as you are alluding 
to, can be a detriment to the lower degree vertices because it then unnecessarily distributes their adjacency lists around 
leading to extra back and forth. Janus could do something along the lines of what you are suggesting. It would require 
statistics to be maintained so that a split point could be identified when an edge count threshold is breached. There would
also have to be some means to mark a vertex as partitioned or not. This wouldn't be trivial, but definitely would be possible. 
Another adaptive approach, would be to store the adjacency list twice, in a write-tuned format, as it is now, and a read-tuned 
format. The read copy could be asynchronously updated off of the write updates, kind of like a simple materialized view. 
One example of this in the relational world would be a row oriented vs. column oriented format. The former ideal for OLTP, the 
latter great for OLAP. Databases like MemSQL and Peloton are taking a hybrid approach that combines the two approaches 
into a single hybrid transactional/analytical system This would definitely not be a trivial update...but would be pretty darn cool :)

That's really good to here. I just checked our upsert query and it triggers the AdjacentVertexFilterOptimizerStrategy. The times also look good for a supernode that I just tested (1.3 seconds for a vertex with ~1 Mio. edges). So upserts really shouldn't be a problem with supernodes.

Excellent, good to hear that was already in place!

--Ted 

Florian Hockmann

unread,
Feb 2, 2018, 9:20:54 AM2/2/18
to JanusGraph users

Interesting, I always thought that we have such a strategy in TinkerPop. Couldn't such a strategy simply move the limit to the left for all map steps?
 
Yes, I think this would be a good addition. I'll put an issue in.

I just created TINKERPOP-1882 for this.


Another adaptive approach, would be to store the adjacency list twice, in a write-tuned format, as it is now, and a read-tuned 
format. The read copy could be asynchronously updated off of the write updates, kind of like a simple materialized view. 
One example of this in the relational world would be a row oriented vs. column oriented format. The former ideal for OLTP, the 
latter great for OLAP.

Spark (or another OLAP system) would still get a complete, unpartitioned vertex with this approach, wouldn't it? We just did a quick evaluation to see when the supernode problem hits us with Spark and we were only able to load a supernode with 3 Mio. edges into Spark. With 4 Mio. edges, the Spark executors died with an OutOfMemoryException or because they spent 98% of their time in garbage collection. During our tests, the Spark executors had only 8 GB of heap space, but I assume that this scales linearly to bigger heaps.
So to really solve this supernode problem, we would probably also need to somehow support partitioned vertices in spark-gremlin.
...
Reply all
Reply to author
Forward
0 new messages