Eureka thread failed to stop, memory leak errors

976 views
Skip to first unread message

Jamie

unread,
Feb 4, 2014, 11:32:17 AM2/4/14
to eureka_...@googlegroups.com
Hi, 

Two days ago I updated eureka to version 1.1.124 from a version that's over a year old.  A day later, Eureka crashed on all 3 of servers in our cluster with the following out-of-memory exception. 

SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container
java.lang.OutOfMemoryError: GC overhead limit exceeded

I've only got 512 meg allocated to eureka, so that's probably part of it, even though the servers have never had memory issues before.
I'm planning on doubling the allocated memory to 1 gig, unless more is recommended.

When looking at the logs I noticed a lot of these messages.  

SEVERE: The web application [/eureka] appears to have started a thread named [DiscoveryClient_Heartbeat] but has failed to stop it. This is very likely to create a memory leak.
Feb 04, 2014 5:35:26 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [/eureka] appears to have started a thread named [DiscoveryClient_ServiceURLUpdater] but has failed to stop it. This is very likely to create a memory leak.
Feb 04, 2014 5:35:26 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [/eureka] appears to have started a thread named [DiscoveryClient_InstanceInfo-Replictor] but has failed to stop it. This is very likely to create a memory leak.
Feb 04, 2014 5:35:26 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [/eureka] appears to have started a thread named [StatsMonitor-0] but has failed to stop it. This is very likely to create a memory leak.
Feb 04, 2014 5:35:27 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [/eureka] appears to have started a thread named [Eureka-JerseyClient-Conn-Cleaner2] but has failed to stop it. This is very likely to create a memory leak.

Is this typical or is this actually a memory leak? Is it a known issue with version 1.1.124, if so what version would be recommended?

- Jamie

Nitesh Kant

unread,
Feb 5, 2014, 12:26:33 PM2/5/14
to eureka_...@googlegroups.com
Hi Jamie,

The memory used by eureka is proportional to the registry size, so unless the instances increased, there isn't an obvious reason why the memory footprint will increase. Having said that if you can profile your application to see what is taking up the memory, we can look more into it.
There are no known issues pertaining to memory usage in that eureka release.
--
You received this message because you are subscribed to the Google Groups "eureka_netflix" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eureka_netfli...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jondo...@gmail.com

unread,
Feb 12, 2014, 8:10:55 PM2/12/14
to eureka_...@googlegroups.com
Hi Nitesh,

I've been helping to monitor this and have a bit more data. I did a heap dump (using jvisualvm) on Monday, shortly after Eureka started up, and then again today after it had been running for a few days. I found a few things that looked interesting.

A screenshot of the number of instances of some classes that, from what I can tell, appear to be leaking:
Monday
Wednesday

There seem to be a growing number of instances of com.netflix.servo.monitor.StatsMonitor$2 (and a few other subclasses of StatsMonitor) and com.netflix.servo.monitor.StatsTimer. On Monday there were 26 instances, but by Wednesday that had grown to 468 instances. The real memory impact seems to be that these StatsTimer instances hold references to very large long[] instances (there are over 1,000 instances of long[] with length > 800,000).

Here are some screenshots to help illustrate what I'm seeing:
Monday: https://www.dropbox.com/s/qjsx0yd50ld7ut7/monday_instances.png
Wednesday: https://www.dropbox.com/s/vvo9pf9igpuqjng/wednesday_instances.png
A single instance of the large long arrays, with references: https://www.dropbox.com/s/6biww32earuvwuv/long-arrays.png

Does this help? I have both heap dumps saved. The Wednesday dump is large (825MB) so I didn't make it available. Let me know if you think it would be helpful or if I can provide any additional data. This is causing us continued problems. Right now our workaround is to restart each Eureka instance every 3-4 days. Any help you can provide is appreciated.

Thanks,
Jon Dokulil

Nitesh Kant

unread,
Feb 13, 2014, 1:00:46 AM2/13/14
to eureka_...@googlegroups.com
Thanks Jon,

That's very helpful. Since you have the heap dump, can you share from which objects these StatsTimer objects are reachable, that will help me narrow down the leak easily.

jondo...@gmail.com

unread,
Feb 13, 2014, 8:53:17 AM2/13/14
to eureka_...@googlegroups.com
Nitesh,

Here's what i dug up. I've got three screenshots. One of the first StatsTimer created and one of a later StatsTimer instance. They are both referenced from a unique StatsMonitor$2 instance and from the ScheduledThreadPoolExecutor.

StatsTimer #1: https://www.dropbox.com/s/zppunt6mf055cqu/StatsTimer_1.png
StatsTimer #459 (the later instance): https://www.dropbox.com/s/binaubbaxgbvced/StatsTimer_459.png
Expanded look into instance #459: https://www.dropbox.com/s/5csjp5e7cb665s5/StatsTimer_459_Expanded.png

I hope that helps, I'm happy to keep digging. Let me know what else I can check into.

Jon

Nitesh Kant

unread,
Feb 17, 2014, 1:14:54 PM2/17/14
to eureka_...@googlegroups.com
Looking at it more, I see that we only use StatsTimer here: 


& it does not look like that these tasks will be created more than once. 

It will help if you can upload the heap dump file somewhere, I can download from.

I appreciate your help in digging on this.


jondo...@gmail.com

unread,
Feb 18, 2014, 10:26:06 PM2/18/14
to eureka_...@googlegroups.com
Nitesh,

Sorry for the delay. You can find the larger (Wednesday) heap dump here: https://s3.amazonaws.com/eureka-heap/heapdump-1392235290821-04.hprof

Please let me know when you've got it so I can delete it from our S3 account.

Jon

Nitesh Kant

unread,
Feb 20, 2014, 3:22:56 AM2/20/14
to eureka_...@googlegroups.com
Thanks Jon, I have downloaded the file, will get back to you with the results.

brharr...@gmail.com

unread,
Feb 21, 2014, 12:56:28 AM2/21/14
to eureka_...@googlegroups.com
Can you try setting a system property:

com.netflix.servo.stats.StatsConfig.sampleSize=100

Which version of servo-core are you getting? StatsTimer creates a buffer to keep track of data used for percentiles and other stats. The sample size is supposed to be configured based on the expected number of values being measured within the sampling interval. From the docs:

* The size for the buffer used to store samples is controlled using the sampleSize field, and the frequency
* at which stats are computed is controlled with the computeFrequencyMillis option. By default these are
* set to 100,000 entries in the buffer, and computation at 60,000 ms (1 minute) intervals.

This can be explicitly set on the stats config or using the default which depending on the servo-core version you have is either 100k or with more recent versions 1k. The property mentioned can be used to change the default sample size that will be used so it will be less expensive. Unless you really need all of the statistics we don't recommend using StatsTimer, in most cases the min/max provided by BasicTimer will give you enough of an idea of the distribution and the percentiles aren't really meaningful unless you have a high enough rate during a sampling interval anyway.

As to the number of StatsTimer instances, that would depend on how many are getting registered by the application. Based on the description it would seem like something in eureka keeps creating more of them, though it is curious that we haven't hit this problem internally. Nitesh are you able to reproduce this in a test environment?

Brian 

jondo...@gmail.com

unread,
Feb 21, 2014, 2:31:16 PM2/21/14
to eureka_...@googlegroups.com
Brian,

We'll try setting that system property. From what I understand that should slow down the memory leak... does that sound correct?

I see servo-core-0.4.44.jar in our Eureka deploy.

One thing we are doing that is probably different is running Eureka on a Windows Server 2012. We are planning to move them to Linux machines but have not yet done so.

Jon

Nitesh Kant

unread,
Feb 23, 2014, 5:55:44 PM2/23/14
to eureka_...@googlegroups.com
Brian,

The code using StatsTimer was recently added to eureka, I will verify whether we see the same behavior in test & may be I can change it to use BasicTimer instead.

Nitesh

unread,
Feb 24, 2014, 1:20:25 PM2/24/14
to eureka_...@googlegroups.com
I have made the changes to change StatsTimer to BasicTimer and will release eureka with these changes today. 
Jon,

Let me know if this fixes the issue.
To unsubscribe from this group and stop receiving emails from it, send an email to eureka_netflix+unsubscribe@googlegroups.com.

Nitesh Kant

unread,
Feb 24, 2014, 2:04:24 PM2/24/14
to eureka_...@googlegroups.com
eureka-version 1.1.127 will have this fix. It should be available in maven in around 12 hours.
Reply all
Reply to author
Forward
0 new messages