JVM with 100 seconds minor GC

Abhishek Agarwal

unread,

Dec 21, 2017, 12:25:09 PM12/21/17

to mechanical-sympathy

Hello everyone,

Recently, I had one of the JVMs going into this weird state wherein minor gc start kicking almost every minute but what is most surprising that it is one minor GC is taking as much as 100 seconds. The heap size is around 12 gigs and young gen is about 30% of that. As soon as gc starts, understandably CPU usage started increasing as well. I am not able to figure out how would one minor GC take this much time. The machine does not have swapping enabled. I also ran sar and confirmed there was no swapping during the issue. We are using parallel GC for young gen and CMS for old gen GC. How can I go about troubleshooting this further? Any pointers will be appreciated.

Martin Thompson

unread,

Dec 21, 2017, 12:31:26 PM12/21/17

to mechanical-sympathy

Start by enabling GC logging and then analyse the log to see what it is doing during GC.

Erik Svensson

unread,

Dec 21, 2017, 3:36:30 PM12/21/17

to mechanica...@googlegroups.com

If it’s on a virtual host then I’ve seen the same behaviour when the underlying hw started swapping. I’ve also seen the same when the other hosts on the hw had high cpu demands (i.e. my host was cpu starved).

But, as martin said, gc logging always.

cheer

Erik Svensson

D

On 21 Dec 2017, at 18:25, Abhishek Agarwal <abhis...@gmail.com> wrote:

Hello everyone,
Recently, I had one of the JVMs going into this weird state wherein minor gc start kicking almost every minute but what is most surprising that it is one minor GC is taking as much as 100 seconds. The heap size is around 12 gigs and young gen is about 30% of that. As soon as gc starts, understandably CPU usage started increasing as well. I am not able to figure out how would one minor GC take this much time. The machine does not have swapping enabled. I also ran sar and confirmed there was no swapping during the issue. We are using parallel GC for young gen and CMS for old gen GC. How can I go about troubleshooting this further? Any pointers will be appreciated.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Josh Humphries

unread,

Dec 21, 2017, 3:41:28 PM12/21/17

to mechanica...@googlegroups.com

Like Erik said, check that the host isn't swapping. An old-gen GC can end up scanning parts of the heap that the OS may have paged out if it encountered memory pressure.

Another issue I've seen that causes this sort of thing is if you have GC logging turned on and encounter issues in I/O. GC logging is not async. I've seen the GC logging itself, to a drive that had a SCSI firmware issue, cause silly-long GC times. You can possibly corroborate by checking kernel logs to see if there are other signals of disk/driver issues.

----
Josh Humphries
jh...@bluegosling.com

D

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gary Mulder

unread,

Dec 21, 2017, 4:01:52 PM12/21/17

to mechanica...@googlegroups.com

As a data point:

I'm managing (or more watching and daily sacrificing a goat to) a JVM using Hotspot parallel GC with a 14GB heap and 4.5GB young generation. It is randomly taking 30-90 seconds to do a minor GC due to scheduled quartz job generates large XML document POST request that I suspect is around 4GB in size. There is no VM memory pressure or swapping, as the VM has 16GB of RAM.

As a general rule I've found young generations of more than about 2GB and JVM heap sizes of more than 8GB (i.e. large old generations) really do not perform well with the Hotspot parallel GC when there is a mix of small requests creating a lot of short-lived small objects and then a large slow request hits the young generation. As I can't yet reproduce this problem in a test environment I'm not sure whether the G1 collector might perform better.

Naturally, Azul System's Zing JVM would likely be a simple solution to dealing with this, but as this is a retail platform under a change freeze, so my goal is to keep it alive until Christmas...

Ho ho ho,

Gary

Kirk Pepperdine

unread,

Dec 21, 2017, 4:38:48 PM12/21/17

to mechanica...@googlegroups.com

Hi,

If you toss a GC log this way I maybe able to offer a suggestion.

Kind regards,

Kirk Pepperdine

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,

Dec 21, 2017, 4:42:53 PM12/21/17

to mechanica...@googlegroups.com

Hi,

If GC logging is giving you grief, you’ve got bigger problems because GC logs a very small amount of data. If this is an issue there will be a hint of it in the GC logs. At any rate, the problem is with GC and turning off GC logging leaves you blind to any possible solutions.

Kind regards,

Kirk Pepperdine

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Abhishek Agarwal

unread,

Dec 22, 2017, 1:29:01 AM12/22/17

to mechanica...@googlegroups.com

Thank you for the pointers. We have gc logs but didn't help much.

Few corrections:

Old gen size is ~ 20 Gigs

Young gen size is ~ 1.5 Gigs

The skew between young gen and old gen size has been fixed. Here is the snippet of gc logs when time increased to ~ 200 seconds. By the way the service is running on AWS.

59549.736: [GC (Allocation Failure) 59549.736: [ParNew: 1688278K->56721K(1887488K), 0.5518315 secs] 4474842K->3433110K(21480704K), 0.5526896 secs] [Times: user=1.92 sys=0.25, real=0.55 secs]

59581.163: [GC (Allocation Failure) 59581.164: [ParNew: 1734545K->43688K(1887488K), 0.4640033 secs] 5110934K->3941007K(21480704K), 0.4650333 secs] [Times: user=1.59 sys=0.26, real=0.46 secs]

59617.565: [GC (Allocation Failure) 59617.565: [ParNew: 1721512K->40500K(1887488K), 0.1154788 secs] 5618831K->3937818K(21480704K), 0.1166312 secs] [Times: user=0.44 sys=0.00, real=0.12 secs]

59645.910: [GC (Allocation Failure) 59645.911: [ParNew: 1718324K->46789K(1887488K), 0.1918905 secs] 5615642K->3944132K(21480704K), 0.1930935 secs] [Times: user=0.77 sys=0.00, real=0.20 secs]

59692.159: [GC (Allocation Failure) 59692.159: [ParNew: 1724613K->45533K(1887488K), 5.4467875 secs] 5621956K->3944000K(21480704K), 5.4480386 secs] [Times: user=21.86 sys=0.00, real=5.45 secs]

59716.348: [GC (Allocation Failure) 59716.348: [ParNew: 1723357K->49256K(1887488K), 8.2982698 secs] 5621824K->3948162K(21480704K), 8.2994285 secs] [Times: user=33.47 sys=0.00, real=8.30 secs]

59751.501: [GC (Allocation Failure) 59751.502: [ParNew: 1727080K->47156K(1887488K), 9.4546047 secs] 5625986K->3946152K(21480704K), 9.4558187 secs] [Times: user=37.88 sys=0.00, real=9.45 secs]

59779.293: [GC (Allocation Failure) 59779.293: [ParNew: 1724980K->31847K(1887488K), 98.3531267 secs] 5623976K->3949810K(21480704K), 98.3542271 secs] [Times: user=393.93 sys=0.00, real=98.35 secs]

59933.098: [GC (Allocation Failure) 59933.099: [ParNew: 1709671K->24851K(1887488K), 155.8798281 secs] 5627634K->3947185K(21480704K), 155.8810828 secs] [Times: user=629.60 sys=0.00, real=155.88 secs]

60138.148: [GC (Allocation Failure) 60138.148: [ParNew: 1702675K->22690K(1887488K), 162.4118957 secs] 5625009K->3945214K(21480704K), 162.4132521 secs] [Times: user=659.75 sys=0.00, real=162.42 secs]

60352.343: [GC (Allocation Failure) 60352.343: [ParNew: 1700514K->26504K(1887488K), 166.2566004 secs] 5623038K->3949247K(21480704K), 166.2577295 secs] [Times: user=672.08 sys=0.00, real=166.25 secs]

60576.273: [GC (Allocation Failure) 60576.274: [ParNew: 1704328K->49624K(1887488K), 168.2311204 secs] 5627071K->3972465K(21480704K), 168.2322492 secs] [Times: user=660.16 sys=0.00, real=168.23 secs]

60791.458: [GC (Allocation Failure) 60791.459: [ParNew: 1727448K->33138K(1887488K), 174.0129541 secs] 5650289K->3956199K(21480704K), 174.0139678 secs] [Times: user=680.44 sys=0.00, real=174.02 secs]

61022.223: [GC (Allocation Failure) 61022.223: [ParNew: 1710962K->29340K(1887488K), 176.6459303 secs] 5634023K->3952483K(21480704K), 176.6474561 secs] [Times: user=690.10 sys=0.00, real=176.64 secs]

61246.967: [GC (Allocation Failure) 61246.967: [ParNew: 1707164K->31914K(1887488K), 182.4714040 secs] 5630307K->3955486K(21480704K), 182.4724297 secs] [Times: user=721.68 sys=0.00, real=182.47 secs]

61483.620: [GC (Allocation Failure) 61483.621: [ParNew: 1709738K->27879K(1887488K), 187.2219554 secs] 5633310K->3951691K(21480704K), 187.2230896 secs] [Times: user=783.68 sys=0.00, real=187.23 secs]

61723.432: [GC (Allocation Failure) 61723.433: [ParNew: 1705703K->22446K(1887488K), 192.4606406 secs] 5629515K->3946498K(21480704K), 192.4618564 secs] [Times: user=721.75 sys=0.00, real=192.46 secs]

61965.560: [GC (Allocation Failure) 61965.561: [ParNew: 1700270K->30881K(1887488K), 197.1221819 secs] 5624322K->3955113K(21480704K), 197.1233286 secs] [Times: user=806.53 sys=0.00, real=197.13 secs]

62211.173: [GC (Allocation Failure) 62211.174: [ParNew: 1708705K->22526K(1887488K), 199.7864929 secs] 5632937K->3946917K(21480704K), 199.7875671 secs] [Times: user=811.19 sys=0.00, real=199.78 secs]

62464.936: [GC (Allocation Failure) 62464.937: [ParNew: 1700350K->47417K(1887488K), 203.1958095 secs] 5624741K->3972025K(21480704K), 203.1969591 secs] [Times: user=789.37 sys=0.00, real=203.20 secs]

62682.808: [GC (Allocation Failure) 62682.808: [ParNew: 1724605K->37951K(1887488K), 206.8942846 secs] 5649212K->3962723K(21480704K), 206.8948168 secs] [Times: user=821.45 sys=0.00, real=206.90 secs]

On Fri, Dec 22, 2017 at 3:12 AM, Kirk Pepperdine <ki...@kodewerk.com> wrote:

Hi,

If GC logging is giving you grief, you’ve got bigger problems because GC logs a very small amount of data. If this is an issue there will be a hint of it in the GC logs. At any rate, the problem is with GC and turning off GC logging leaves you blind to any possible solutions.

Kind regards,
Kirk Pepperdine

On Dec 21, 2017, at 9:40 PM, Josh Humphries <jh...@bluegosling.com> wrote:

Like Erik said, check that the host isn't swapping. An old-gen GC can end up scanning parts of the heap that the OS may have paged out if it encountered memory pressure.

Another issue I've seen that causes this sort of thing is if you have GC logging turned on and encounter issues in I/O. GC logging is not async. I've seen the GC logging itself, to a drive that had a SCSI firmware issue, cause silly-long GC times. You can possibly corroborate by checking kernel logs to see if there are other signals of disk/driver issues.

----
Josh Humphries
jh...@bluegosling.com

On Thu, Dec 21, 2017 at 3:36 PM, Erik Svensson <er...@phlogiston.se> wrote:

If it’s on a virtual host then I’ve seen the same behaviour when the underlying hw started swapping. I’ve also seen the same when the other hosts on the hw had high cpu demands (i.e. my host was cpu starved).
But, as martin said, gc logging always.

cheer

Erik Svensson

D

On 21 Dec 2017, at 18:25, Abhishek Agarwal <abhis...@gmail.com> wrote:

Hello everyone,
Recently, I had one of the JVMs going into this weird state wherein minor gc start kicking almost every minute but what is most surprising that it is one minor GC is taking as much as 100 seconds. The heap size is around 12 gigs and young gen is about 30% of that. As soon as gc starts, understandably CPU usage started increasing as well. I am not able to figure out how would one minor GC take this much time. The machine does not have swapping enabled. I also ran sar and confirmed there was no swapping during the issue. We are using parallel GC for young gen and CMS for old gen GC. How can I go about troubleshooting this further? Any pointers will be appreciated.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/AG-iG9JuIfU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Regards,
Abhishek Agarwal

Josh Humphries

unread,

Dec 22, 2017, 8:51:11 AM12/22/17

to mechanica...@googlegroups.com

On Thu, Dec 21, 2017 at 4:42 PM, Kirk Pepperdine <ki...@kodewerk.com> wrote:

Hi,

If GC logging is giving you grief, you’ve got bigger problems because GC logs a very small amount of data. If this is an issue there will be a hint of it in the GC logs. At any rate, the problem is with GC and turning off GC logging leaves you blind to any possible solutions.

I was in no way suggesting turning off GC logging. When we encountered this problem, the solution was to fix the drives, not turn off GC logging. (I can't remember if our HWOPS folks just replaced the drives or if there was a firmware upgrade they could apply).

Also, I mis-spoke when I suggested that GC logging uses sync I/O. But there are paths where an async write can still block in the kernel under heavy I/O or (in our case) there are problems with the I/O device. LinkedIn had a very detailed write-up: https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic

Kind regards,
Kirk Pepperdine

On Dec 21, 2017, at 9:40 PM, Josh Humphries <jh...@bluegosling.com> wrote:

Like Erik said, check that the host isn't swapping. An old-gen GC can end up scanning parts of the heap that the OS may have paged out if it encountered memory pressure.

Another issue I've seen that causes this sort of thing is if you have GC logging turned on and encounter issues in I/O. GC logging is not async. I've seen the GC logging itself, to a drive that had a SCSI firmware issue, cause silly-long GC times. You can possibly corroborate by checking kernel logs to see if there are other signals of disk/driver issues.

----
Josh Humphries
jh...@bluegosling.com

On Thu, Dec 21, 2017 at 3:36 PM, Erik Svensson <er...@phlogiston.se> wrote:

If it’s on a virtual host then I’ve seen the same behaviour when the underlying hw started swapping. I’ve also seen the same when the other hosts on the hw had high cpu demands (i.e. my host was cpu starved).
But, as martin said, gc logging always.

cheer

Erik Svensson

D

On 21 Dec 2017, at 18:25, Abhishek Agarwal <abhis...@gmail.com> wrote:

Hello everyone,
Recently, I had one of the JVMs going into this weird state wherein minor gc start kicking almost every minute but what is most surprising that it is one minor GC is taking as much as 100 seconds. The heap size is around 12 gigs and young gen is about 30% of that. As soon as gc starts, understandably CPU usage started increasing as well. I am not able to figure out how would one minor GC take this much time. The machine does not have swapping enabled. I also ran sar and confirmed there was no swapping during the issue. We are using parallel GC for young gen and CMS for old gen GC. How can I go about troubleshooting this further? Any pointers will be appreciated.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirk Pepperdine

unread,

Dec 22, 2017, 2:32:39 PM12/22/17

to mechanica...@googlegroups.com

Hi,

Can you send the *whole* log.

Kind regards,

Kirk Pepperdine

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,

Dec 22, 2017, 2:50:57 PM12/22/17

to mechanica...@googlegroups.com

Hi,

This log doesn’t really show anything interesting. I can’t see any real reason why these collections should be taking so long. There is nothing unusual about them. There is no hint of obivous interference from virtualization but one has to believe that this is the issue in this particular case. Too bad AWS doesn’t give you proper performance counters that would let you prove this point.

Please do send another log. Please turn on Tenuring Details and Reference Processing times as well as ApplicationGCStoppedTime and ApplicationGCConcurrentTime. This will help eliminate other sources of pause. If they look clean then you can blame AWS.

Kind regards,

Kirk Pepperdine

On Dec 22, 2017, at 7:28 AM, Abhishek Agarwal <abhis...@gmail.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Abhishek Agarwal

unread,

Dec 23, 2017, 10:05:26 AM12/23/17

to mechanica...@googlegroups.com

Here is the complete log

--

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/AG-iG9JuIfU/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Regards,
Abhishek Agarwal

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/AG-iG9JuIfU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Regards,
Abhishek Agarwal

jvmgc.logs

Jean-Philippe BEMPEL

unread,

Dec 24, 2017, 5:34:09 AM12/24/17

to mechanical-sympathy

Hi,

I already experienced this issue with ParNew Collector: pause time is increasing between each GC and CPU is 100% consumed by GC threads.

For me this is a bug in this collector in a very specific case.

The case can be very reproducible but unfortunately, when we encounter this, the situation is quite complex with a large application and very sensible to Young Gen sizing for example.I was not able to narrow down to small reproducible case.

So to workaround it you can change the size of the young gen.

When profiling with perf top this issue you are basically inside the Collector code. I don't have anymore the stack trace.

Regards

----
Josh Humphries
jh...@bluegosling.com

D

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/AG-iG9JuIfU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Regards,
Abhishek Agarwal

pboot...@gmail.com

unread,

Jan 2, 2018, 3:33:24 AM1/2/18

to mechanical-sympathy

Abhishek,

The fact that you are running on AWS is enormously significant.

It's quite possible that your problem has more to do with your provisioning, or your OS configuration,

than anything else. If "mechanical sympathy" implies that your application and the

underlying hardware have to work together in harmony to see optimal performance, then that

harmony is a holistic thing that has to encompass application, JVM, OS kernel, virtualization layer,

host OS, hardware BIOS, firmware etc.

When we explore issues in this forum we do ourselves a disservice if we only focus on, for example, the characteristics of the garbage collector and pretend that it exists in a vacuum. There are a few default settings in popular Linux distributions that are unhelpful to performance sensitive Java applications. As a starting point I would suggest that you might use pidstat to try and identify what is occurring a the OS level when these minor GCs happen. A command line similar to pidstat -d -h -r -u -v -w -p <PID> run every second both with and without the -t option would be useful.

On Friday, December 22, 2017 at 1:29:01 AM UTC-5, Abhishek Agarwal wrote:

Thank you for the pointers. We have gc logs but didn't help much.

Few corrections:
Old gen size is ~ 20 Gigs
Young gen size is ~ 1.5 Gigs

The skew between young gen and old gen size has been fixed. Here is the snippet of gc logs when time increased to ~ 200 seconds. By the way the service is running on AWS.

--
Regards,
Abhishek Agarwal

Jason Koch

unread,

Jan 2, 2018, 5:03:51 AM1/2/18

to mechanical-sympathy

Abhishek,

Bordering on offtopic here but some thoughts that come to mind as common gotchas:

Assuming it's on EC2, what EC2 instance type are you running your application on and what OS installation? Some instances are burstable and you will get very aggressive throttling after some period of work.

You'll need to share additional information about the instance - storage details, etc. You might not have sufficient IOPS provisioned against the datastore backing your instance.

If it's something else eg EMR / ECS / etc, please share those details too.

Thanks

Jason

Kirk Pepperdine

unread,

Jan 2, 2018, 6:09:50 AM1/2/18

to mechanica...@googlegroups.com

On Jan 2, 2018, at 9:08 AM, pboot...@gmail.com wrote:

Abhishek,

The fact that you are running on AWS is enormously significant.

I would agree with this. However I see none of the symptoms that we typically see when looking at AWS constrained systems. That isn’t to say that AWS isn’t to blame. There is one additional point to consider, if AWS is responsible for slowing down the collector than I would expect that effect to be seen some what evenly over the entire application. IOW, what bugs the collector will have the same effect on your app. GC logs just make the effect more visible making it easy to blame the collector. That AWS hides important information from you is a shame because it makes these generally easy diagnostics much more complicated.

Kind regards,

Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Reply all

Reply to author

Forward