Repeated full GC in cluster after super nodes get too big

21 views
Skip to first unread message

Nigel Brown

unread,
Nov 22, 2016, 4:36:05 AM11/22/16
to Aurelius
I have a cluster of 20 nodes and a spark app writing to titan 0.5.4/cassandra.

Everything runs well for a few hours and then the whole system slows down by at least an order of magnitude then fails.

Some debugging shows that only one of the cassandra nodes is struggling. The problem is that the heap is being fragmented by large blocks and goes into a spin of repeated full GC. I have tried varying the sizes of old and young gen heap and various other parameters. I also tried the G1 garbage collector. The overall heap size is 8GB. The problem is with the data - I tried decommissioning the node and the behaviour moved to the node where the data was copied. After rejoining the cluster, the original machine was fine. The problem follows the data.

This is something to do with super nodes. There are some very busy nodes with tens of thousands of neighbours and more edges between them, so I thought it would be an idea to delete them. Unfortunately, I can't delete them or even count the nodes

g.v(41005056).both()[0..10].count() works but

g.v(41005056).both()[0..100].count() hangs


A heap dump on the machine says the heap is mostly memory buffers.


Also, if we don't up the size of the thrift payload size in the client (to hundreds of MB) we can't traverse any of these super nodes. 


Questions:

  1. What is it that gets big when super nodes are created?
  2. Is this behaviour the same in Titan 1?
  3. Has anyone else seen this?
  4. Does anyone have any suggestions for further things to try.


Stephen Mallette

unread,
Nov 23, 2016, 5:52:44 AM11/23/16
to Aurelius
As the supernode grows the row containing the vertex grows (i.e. with each added edge) and yes when you have very large supernodes the thrift payload size must increase. I'm semi-surprised that your talking about supernodes with just "tens of thousands of edges" being a problem. I've never experienced that. If you want to try to get the size of the supernodes, try to count edges instead of the adjacent vertices:

g.V(1).bothE().count()

To delete the supernode, I routinely found that pruning edges (rather than just trying to drop the vertex) worked better. I think the reason for this had something to do with the transaction size involved in dropping the supernode itself. Consider dropping a vertex with 1 million edges, titan first has to do that for your dropped vertex and then it has to go to all 1 million vertices and drop the edge from them. I usually pruned supernodes from the opposite side of the supernode. In other words get a list of all the vertices that are connected to the supernode then drop the edge between them. This typically required titan-hadoop given a graph of any reasonable size. I'm not sure of your graph schema, but if you had vertex centric indices in place for your edges I wonder if pruning would be allowed where you had some level of high selectivity with the index.

I'm pretty sure that nothing around this changes in Titan 1.0, though I think vertex partitioning was introduced there.

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraphs+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/38c255bd-b73b-439a-9ab2-fca5f78b35bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages