Driver memory leak with single instance of Netty HashedWheelTimer

376 views
Skip to first unread message

Sylvain Juge

unread,
Jun 25, 2015, 4:38:59 AM6/25/15
to java-dri...@lists.datastax.com

Hi,


We recently had a OOME on multiple nodes of our application (5 tomcat nodes connecting to 9 C2 nodes),

and heap dump shows that more than 50% of heap is associated with a single Netty object instance.


Our C2 cluster was working properly, but webapp instances failure happened in the same time frame +/- few hours.


Current driver version : 2.1.6 (updated previously from 2.1.5).


Screenshot of Eclipse MAT :



I can provide heap dump for analysis if required (270mb compressed).

Regards,
Sylvain.

Olivier Michallat

unread,
Jun 25, 2015, 10:05:04 AM6/25/15
to java-dri...@lists.datastax.com
Hi,

Do you have any custom configuration for the driver? Are speculative executions enabled?

Yes, I will look at the heap dump if you can make it available somewhere.

--

Olivier Michallat

Driver & tools engineer, DataStax


To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Sylvain Juge

unread,
Jun 25, 2015, 11:20:45 AM6/25/15
to java-dri...@lists.datastax.com
Heap dump is available here : https://drive.google.com/folderview?id=0BzjaTIvp9xXIfmhEMW5rYi11SmdaTy1tbC1SNjdxcVFEc0N0UjVSNDB4MlhiM3g1U1pPVnc&usp=sharing

Configuration is defined as follows :

Cluster.Builder builder = Cluster.builder();
builder.getConfiguration().getProtocolOptions().setCompression(ProtocolOptions.Compression.LZ4);
builder.withProtocolVersion(ProtocolVersion.V3);
builder.getSocketOptions().setReadTimeoutMillis(25000);
builder.getPoolingOptions().setPoolTimeoutMillis(25000);  
builder.getPoolingOptions().setMaxSimultaneousRequestsPerHostThreshold(HostDistance.LOCAL, 2048);
builder.withReconnectionPolicy(new ConstantReconnectionPolicy(25000));
builder.withRetryPolicy(new LoggingRetryPolicy(AlwaysIgnoreRetryPolicy.INSTANCE));
builder.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy(), true); 

Regards,
Sylvain.

Olivier Michallat

unread,
Jun 28, 2015, 12:58:11 PM6/28/15
to java-dri...@lists.datastax.com
The HashWheelTimer's management thread is blocked. This seems related to a ThreadLocal used in your application code, I'll send you more details privately.

We use HashWheelTimer for request timeouts. Each time we send a request, we schedule a HashedWheelTimeout to fail the request if we don't get an answer after SocketOptions.readTimeoutMillis.

When you schedule a timeout, Netty first stores it in a temporary queue. The management thread periodically polls that queue and transfers the timeouts to the timer's buckets.
Since the thread is blocked, that never happens and the timeouts stay in the queue forever, producing a leak. At the time of the heap dump, there are approximately 700K instances in the queue. Each instance has a reference back to the driver's internal representation of a request, and transitively to the Future, BoundStatement, etc., so these are also leaked.

I used this OQL query (in visualvm) to find out if the leaked timeouts were in a bucket or not (this returned almost all instances so they weren't):

select count(heap.objects("io.netty.util.HashedWheelTimer$HashedWheelTimeout"), function(t) {return t.bucket == null;});

Then I used this to find out if there was something in the queue:

select map(
    heap.objects("io.netty.util.HashedWheelTimer"),
    function(t) {
        var limit = 10000; /* iterating the 700K instances is too long, so limit */
        var node = t.timeouts.headRef, count = 1;
        while (node != null && count < limit) {
            node = node.next;
            count += 1;
        }
return "Found at least " + count + " pending timeouts";
    }
);

Finally I checked the thread dump (included in the heap dump).

--

Olivier Michallat

Driver & tools engineer, DataStax



Sylvain Juge

unread,
Jun 29, 2015, 7:50:07 AM6/29/15
to java-dri...@lists.datastax.com
Thanks Olivier for such wonderful feedback !!


For the record, the issue is very probably in our application code which is using bytecode instrumentation to analyse Runnable instances on the fly to collect metrics (and thus Netty own classes).

Thread dump shows that there is a clear contention point, which makes the timeout queue very huge.

Carlos Scheidecker

unread,
Jun 29, 2015, 2:40:24 PM6/29/15
to java-dri...@lists.datastax.com, Olivier Michallat
Olivier & all,

We are having the same problem for a few months now and I have been looking at the threads here related to that. We run out of Memory and driver times out with 2.2.0-rc1

It seemed that on 2.1.5 it got much stable and on 2.1.6 it got worse. If I recall well, on 2.1.5 something got changed due to this issue and I can go search the forum here for it.

It was not good with version 2.1.6 so we had chaanged to 2.2.0-rc1 yesterday.

The idea of the driver is to be able to self heal the connections, to have a pool and to be safe.

Back in the day I used to have my own Thrift code and my own connection pool.

Wonder if I can paste the logs from Tomcat instance with the errors or the code for the driver.

Therefore, let me know how is the best way to help to fix this soon.

Olivier Michallat

unread,
Jun 30, 2015, 5:09:53 AM6/30/15
to java-dri...@lists.datastax.com
Hi,

We are having the same problem

Well maybe same symptoms, but not the same problem. Sylvain's problem was in his application code.

Did you take a heap dump? Which objects use the most heap?

--

Olivier Michallat

Driver & tools engineer, DataStax



Reply all
Reply to author
Forward
0 new messages