Have you seen this article? This may be some help:
SecureRandom has often turned up on very long pauses on systems:
Brad, can you confirm for us what’s telling you there are the long GC times? Just curious, since you say you can’t get to FR. Is it a report in FR after it awakes? Or its logs? Or some other native Java logging or UI?
Since you say there’s no CPU use during the time, I’d be inclined to wonder if there was some other explanation for things, but I realize if some tool says clearly there’s a long GC time, you’ll be inclined to trust it as saying “this is the problem”. Even so, sometimes that’s more the symptom than the true root cause, believe it or not.
Indeed, when I saw your mention first off of a virtualized environment, my first thought was along the lines of one of your conclusions in your first note: it could be something may be amiss in the hypervisor’s allocation of resources. Whenever there’s something that can’t be otherwise explained, and virtualization is involved, I can’t help but point a flashlight of suspicion at that being the issue. Of course it can be difficult to prove it, but it’s worth giving attention to the possibility at least to rule it out. (Worst case, one might have to see about backing up the system and restoring it–or recovering a VM snapshot--onto a real system, equivalently configured, to prove if it’s the VM or not.
One other (easier) thing along those lines; how many CPUs were allocated to the VM? I know you’re saying you don’t see CPU as an issue, but another problem I’ve seen with vm’s is that folks may allocate only a single CPU (perhaps powerful), when some GC algorithms (especially the default one CF specifies) are sensitive to that (work worse when there’s only one CPU). Of course the G1 algorithm being the newest, you’d think it would be the best. But I’ve just not yet heard of explored how it may fair in a single CPU setup. Just something to consider.
HTH.
/charlie
--You received this message because you are subscribed to the Google Groups "FusionReactor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fusionreacto...@googlegroups.com.
To post to this group, send email to fusion...@googlegroups.com.
Visit this group at http://groups.google.com/group/fusionreactor.
For more options, visit https://groups.google.com/d/optout.
Thanks, Brad. So is your next stop to talk to someone involved in the hypervisor (or research more yourself if you’re “him”), to see if perhaps the VM is under-allocated, or if there may be any logging of metrics at the hypervisor level to point to that possibility?
BTW, here are some tools for monitoring/managing vm’s that may be helpful: http://www.cf411.com/vmmon
/charlie
From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of Brad Wood
Sent: Tuesday, December 16, 2014 12:46 AM
To: fusionreactor
Subject: Re: [fusionreactor] Re: Astronomical GC times
Excellent questions Charlie. I hope you'll understand, I didn't address a lot of those in my initial post just to keep things short since people's eyes tend to glaze over if you give to much detail right away :) Since you asked though, I'll categorically address them now:
Brad, can you confirm for us what’s telling you there are the long GC times?
Three things:
<snip>
Subject: Re: [fusionreactor] Re: Astronomical GC times
cat /proc/sys/kernel/random/entropy_availOur systems were at around 100 - 150.
securerandom.
source
=
file
:
/dev/
urandom
to
securerandom.
source
=
file
:
/dev/
./urandom
--
You received this message because you are subscribed to the Google Groups "FusionReactor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fusionreacto...@googlegroups.com.
To post to this group, send email to fusion...@googlegroups.com.
Visit this group at http://groups.google.com/group/fusionreactor.
For more options, visit https://groups.google.com/d/optout.
|
This e-mail and any files transmitted with it are confidential and may be protected by legal privilege. If you are not the intended recipient, please notify the sender and delete the e-mail from your system. This e-mail has been scanned for malicious content but the internet is inherently insecure and Intergral GmbH cannot accept any liability for the integrity of this message or its attachments.
Thanks for sharing, Brad.
I’d wonder instead if it may be just that the first server/VM was over-allocated. I think this happens often on virtualized servers more often than people may realize, where the person in control of the hypervisor spreads the available host resources over several (if not dozens or hundreds of vms) and the vm’s “think” they have x GB of memory, for instance, but instead it’s dynamically allocated on demand from the hypervisor and if there are more VMs asking for more memory than the hypervisor really has, you experience odd slownesses on the VM.
Sadly, that’s about the only way “from within” the VM that you can suspect it’s happening. There’s no way to “know”, as far as I know.
Anyone have more specific experience on this matter?
/charlie
From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of Brad Wood
Sent: Tuesday, January 06, 2015 10:54 PM
To: fusion...@googlegroups.com
Subject: Re: [fusionreactor] Re: Astronomical GC times
Just a followup to this thread for anyone finding it in the future. We never were able to find a "smoking gun" per se as I prefer, but some testing with the hosting provider showed that moving our CF containers to another Virtuozzo host with much fewer other containers reduced the longest GC pauses down to around 4-5 seconds. While not stellar, a 5 second pause time is at least doable. My conclusion is that budget VPS "containers" like the ones that run on Virtuozzo are not optimal for Java-based applications with large heaps when on a host also servicing a large number of other containers. In this case, most of the other containers were PHP sites-- each of which probably used much smaller amounts of RAM.
--