I have a cluster of 20 nodes and a spark app writing to titan 0.5.4/cassandra.
Everything runs well for a few hours and then the whole system slows down by at least an order of magnitude then fails.
Some debugging shows that only one of the cassandra nodes is struggling. The problem is that the heap is being fragmented by large blocks and goes into a spin of repeated full GC. I have tried varying the sizes of old and young gen heap and various other parameters. I also tried the G1 garbage collector. The overall heap size is 8GB. The problem is with the data - I tried decommissioning the node and the behaviour moved to the node where the data was copied. After rejoining the cluster, the original machine was fine. The problem follows the data.
This is something to do with super nodes. There are some very busy nodes with tens of thousands of neighbours and more edges between them, so I thought it would be an idea to delete them. Unfortunately, I can't delete them or even count the nodes
g.v(41005056).both()[0..10].count() works but
g.v(41005056).both()[0..100].count() hangs
A heap dump on the machine says the heap is mostly memory buffers.
Also, if we don't up the size of the thrift payload size in the client (to hundreds of MB) we can't traverse any of these super nodes.
Questions:
- What is it that gets big when super nodes are created?
- Is this behaviour the same in Titan 1?
- Has anyone else seen this?
- Does anyone have any suggestions for further things to try.