Memory leak in grizzly

Ryan de Laplante

unread,

Jun 24, 2021, 2:48:04 PM6/24/21

to Payara Forum

For years we've had to occasionally kill the GlassFish/Payara java.exe process because it ran out of memory and became completely unresponsive, using 95% CPU. I originally thought it was related to JDBC connections going bad because the server.log ALWAYS starts with this:

Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: I/O Error: Read timed out
Error Code: 0

Which eventually leads to a bunch of these:

java.sql.SQLException: java.lang.reflect.UndeclaredThrowableException

Caused by: java.lang.reflect.InvocationTargetException

Caused by: java.sql.SQLException: Invalid state, the Connection object is closed.

We're using the jTDS JDBC driver v1.3.1 with MSSQL server over the internal network. We have the connection pools configured to verify connections before use and to close all connections on discovery of a failure. I'd expect it to recover automatically, but it doesn't.

The issue happened again today. This time I used jmap to get a heap dump and ran it through Visual VM to get some clues. What I found is that there are over 4 million (each) of org.glassfish.grizzly.http.util.CharChunk/BufferChunk/ByteChunk/DataChunk objects using almost 1.5 GB of memory. I've attached a screenshot. This leads me to believe there is a memory leak in grizzly, the HTTP(S) web server component in the Payara application server.

The website hosted in Payara is a high traffic website. When the DB becomes unusable, I'm sure many people are clicking over and over while more new users connect to the web server. That should not cause Payara to go to 95% CPU solid, cause the memory to balloon, or make the website become completely unresponsive. When this issue happens, we test the DB and find that it is up and running fine. When we restart Payara, the DB connections work without issue.

Any suggestions to help further diagnose and fix this issue would be greatly appreciated.

This server is running Payara 5.2020.5 and OpenJDK 11 on a Windows Server. We've been using GlassFish/Payara since 2008 and I think we've been having this problem occasionally with all versions.

Thanks,

Ryan

payara profiler.png

Fabio Luis - Vanguarda TI

unread,

Jun 24, 2021, 3:39:35 PM6/24/21

to Ryan de Laplante, Payara Forum

Hi Ryan.

I had a similar problem in the past, and it's cost me some months too.

In my case it was a client application that did not close the connections. As its query frequency was at each 5 minutos, in one week we had 2016 locked connections, giving us the famous OutOfMemory.

I don't know why a timeout didn't occur and closed the idle connections, but my guess is that it has to do with TCP idle connections in Operating System (my case Linux), as the keep alive timeout is set up.

We asked the client to fix his application and close the connections before making new ones, and it solved the problem.

Kind Regards

Fabio Silva

--
You received this message because you are subscribed to the Google Groups "Payara Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to payara-forum...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/payara-forum/16b976a0-daa7-4a58-89b1-81b40836636cn%40googlegroups.com.

Ryan de Laplante

unread,

Jun 24, 2021, 3:42:41 PM6/24/21

to Payara Forum

Another interesting clue... based on application session data in the heap dump, I think there were 46 users at the time of the crash. The heap dump shows 235,516 instances of org.glassfish.grizzly.http.Cookie which seems odd. I've attached another screenshot that shows only org.glassfish.grizzly objects.

payara profiler2.png

Ryan de Laplante

unread,

Jun 24, 2021, 3:49:51 PM6/24/21

to Payara Forum

Thank you very much for the tip Fabio. I will look into that. I'm not so sure that is the issue here because the application uses injected JPA PersistenceContext in Spring beans, so I think the connection management happens at the container level. Also, we use JDBC connection pools in Payara, so I would think we would see clues early on if we exhausted all connections in the pool. But who knows, maybe there is some manual JDBC code lurking somewhere that might be opening connections and not closing them.

Will Hartung

unread,

Jun 24, 2021, 5:26:27 PM6/24/21

to Ryan de Laplante, Payara Forum

On Thu, Jun 24, 2021 at 11:48 AM Ryan de Laplante <ry...@ryandelaplante.ca> wrote:

For years we've had to occasionally kill the GlassFish/Payara java.exe process because it ran out of memory and became completely unresponsive, using 95% CPU. I originally thought it was related to JDBC connections going bad because the server.log ALWAYS starts with this:

Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: I/O Error: Read timed out
Error Code: 0

Which eventually leads to a bunch of these:

java.sql.SQLException: java.lang.reflect.UndeclaredThrowableException
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.sql.SQLException: Invalid state, the Connection object is closed.

We're using the jTDS JDBC driver v1.3.1 with MSSQL server over the internal network. We have the connection pools configured to verify connections before use and to close all connections on discovery of a failure. I'd expect it to recover automatically, but it doesn't.

On the surface, this all sounds like an issue with the DB server or the network. On the client side, a socket is a socket. It's pretty brute force stupid. The I/O Error is your server not responding. If your DB server (or the network) is sick, who knows what the automatic refreshing of connections will do. You don't have stack traces showing where you're getting the Invalid state errors, they could well be coming from the connection pool logic trying to reconnect.

The issue happened again today. This time I used jmap to get a heap dump and ran it through Visual VM to get some clues. What I found is that there are over 4 million (each) of org.glassfish.grizzly.http.util.CharChunk/BufferChunk/ByteChunk/DataChunk objects using almost 1.5 GB of memory. I've attached a screenshot. This leads me to believe there is a memory leak in grizzly, the HTTP(S) web server component in the Payara application server.

I'm not saying there isn't a memory leak in Grizzly but...there isn't a memory leak in Grizzly. That code is pushing 15 years old now. If there's a resource leak (memory, connections, whatever), it's in your application. Unfortunately, you have to be savvy on your application to figure out memory leaks.

When analyzing heap dumps for leaks, don't look for system classes (like org.glassfish.*), look for your application classes. Odds are, somewhere, somehow, something from your application is "growing" and not being properly released, and THOSE are hanging on to something from the system. You could be holding on to GBs of String, Byte Arrays, etc. but those are all derivative of your classes. Look at those first, the rest is noise.

The website hosted in Payara is a high traffic website. When the DB becomes unusable, I'm sure many people are clicking over and over while more new users connect to the web server. That should not cause Payara to go to 95% CPU solid, cause the memory to balloon, or make the website become completely unresponsive. When this issue happens, we test the DB and find that it is up and running fine. When we restart Payara, the DB connections work without issue.

When underlying parts of the infrastructure get sick, it's not untoward for Payara to behave badly. Most of the time, if the underlying, external cause can be cleaned up, Payara will recover just fine. But it may take some thrashing to do it. Other times, kicking the server is simply more expedient. But that doesn't necessarily mean there's a root cause in the app server itself, it's something external driven by your application and its traffic.

Any suggestions to help further diagnose and fix this issue would be greatly appreciated.

If you do indeed suffer an Out of Memory exception, the SAFEST approach is to kick the server. OOM errors are unrecoverable, they can happen ANYWHERE (like during static class initialization) and thus lead to "impossible" things. Obviously use your judgement based on the stack traces and where it's happening, but if it's more than one or two (even if the server "recovered"), it's safer to kick the server. OOM servers may as well be bombarded with alpha radiation -- anything goes. (And I've seen that before too, "impossible" things happening in Java system classes).

You have to have intimate knowledge of your application to hunt down memory leaks. You have to know what looks right and wrong but the GBs of noise in the heap. The best thing is to take several heap dumps over time. Obviously, it helps to have some idea when this may happen. Heap dumps aren't cheap. But if you take one every 5 to 10 minutes, you can then compare and see what classes are growing, which are not, and you need to be able to explain what those are and why they're doing it. For those that you can't, those are obviously suspicious.

And as I said earlier, ignore any of the java or Glassfish/Payara classes out the gate. Always assume that it's your code that's wrong, 99% of the time, you're right. All of those other classes are derivative of your application doing its work.

In my time diagnosing these things, I've never had any luck with actual heap dumps. They're stupid expensive, the tools are terrible (they've always been terrible), and slow as mollases. Your tool of choice is the jmap histogram. These are live snapshots of the classes in your heap. These are much, much cheaper to run, and can be run in production (they do add load, of course). But they're human digestible.

During times of stress hunting stuff like that down, I've had jobs that captured one every 5 minutes. Let it run for a week. When something gets sick, you now have a history you can look back on during the post mortem. Even if the server went extra stupid and you just have to kick it, you have history. I don't run those all the time, just when something like this is going weird. On the other hand, I do run thread dumps solid, 24x7, 5 minutes apart, day in and day out. Those are cheap enough. Gives a good snapshot of what your server is doing at any one time. I'd rather have 2 months of unread thread dumps lying idle on the file system, than not have one the few minutes before my server kicked the bucket. Put them on a 60 day clean up timer.

"Hey, my memory is slammed. HEY, I have 700 com.example.MainServlet running in the thread dump! Maybe that's important!"

These are things you can do today that don't cost you anything.

Regards,

Will Hartung

Gregor Kovač

unread,

Jun 24, 2021, 6:49:42 PM6/24/21

to Will Hartung, Ryan de Laplante, Payara Forum

Hi!

Will, nice write-up. Gave me some ideas how to solve one of our long standing problems with Glassfish.

Just couple of questions:

- can you please provide the command-line options of jmap you used to find your memory leak?

- when I try to run jmap against my GlassFish like "jmap PID" I get "Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process". I can attach with my NetBeans debugger to Glassfish. What am I missing with jmap?

Best regards,

Gregor

Will Hartung je 24. 06. 21 ob 23:26 napisal:

--
You received this message because you are subscribed to the Google Groups "Payara Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to payara-forum...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/payara-forum/CAKMEDdzLwNTJS6RRyb91YUV%3DyNN%2BAvs6-SozcyJDGqtcbQdV%2Bw%40mail.gmail.com.

Ryan de Laplante (Personal)

unread,

Jun 24, 2021, 8:56:12 PM6/24/21

to Payara Forum

Thank you William for the detailed reply.   I have a very deep, intimate knowledge of the application since I wrote it and have been maintaining and supporting it for the past 16+ years. I also have experience troubleshooting and fixing at least a few memory leaks in this application. jmap, heap dumps and Visual VM have been invaluable for diagnosing and fixing these leaks. And yes, they were all my application code's fault. I've been programming professionally for 23 years and know my own application code is always the first suspect when diagnosing these types of issues. No need to tell me that, thank you.

This application processes millions of transactions. Some customers experience this issue maybe 1 - 3 times per year, while others experience it every month or two. When it happens, it's quick to go down, not a slow buildup. I've always suspected a DB issue (such as full backups) or a flaky network issue.   Other applications hosted on other servers that communicate with the same DB do not have an issue, although they are .NET and PHP apps.

I know Grizzly is a mature product. My application has been running on Payara since it was called Sun Java System Application Server, before Grizzly even existed. Just because Grizzly is a mature product doesn't mean there isn't an obscure bug or memory leak lurking in there somewhere triggered by a rare, strange network issue. It's always a possibility and I was hoping someone might be willing and able to help me look into that possibility a bit deeper.   If it really is Payara's issue, finding and fixing it will benefit everyone. If it's my application code's fault, I'm happy to admit it and share the gained knowledge with everyone.

I've attached a partial log that begins where we started seeing issues today.   There are many warnings about tasks being delayed and a "huge system clock jump". That's a big clue. I'm pretty sure this customer is hosting on VMs. If the VM becomes very sluggish, that may be triggering these issues, whatever they are.

Thanks,
Ryan

db and clock issue.txt

Will Hartung

unread,

Jun 25, 2021, 12:12:57 AM6/25/21

to Ryan de Laplante (Personal), Payara Forum

On Thu, Jun 24, 2021 at 5:56 PM Ryan de Laplante (Personal) <ry...@ryandelaplante.ca> wrote:

I've attached a partial log that begins where we started seeing issues today. There are many warnings about tasks being delayed and a "huge system clock jump". That's a big clue. I'm pretty sure this customer is hosting on VMs. If the VM becomes very sluggish, that may be triggering these issues, whatever they are.

I can't speak to the Hazelcast chatter. Someone else may be able to help with that.

I summarized the first event in the log, with ellided stack traces, below

Was this a busy server? You must keep very clean logs.

This appears to all be the same "transaction", stemming from SearchCommand.

Apparently you log your SQL errors, but don't honor them (when I get these things, I just blow out the whole stack frame). You appear to press on.

This happened over the period of several minutes. The first failure at 8:31:58, and the last 8:35:09. The gap between the first two is quite long, almost 2 minutes. You get 2 timeouts, then the DB Connection (vs the socket) is closed. Inevitably, it tries to rollback against a dead DB Connection.

The socket timeouts say it's network or DB. It's obviously not too severe of a network issue, as the socket is still alive (the socket connection wasn't dropped), but doesn't mean data is passing.

Did all of the DB calls fail during this transaction? Or just these?

The socket timeouts are interesting for a DB connection. Did you make them aggressive? I never set mine, honestly, in case we get a long query (which would also cause a timeout). But given a long enough query and a short enough timeout, you'll get the same result (a read timeout).

But you can see you had a timeout at 31:58, then the next one at 33:47 -- almost 2m later -- 109s. So, what may have been happening in between there? It's an odd timeout value (despite being decimal creatures, we tend to set timeouts at minute fractions (1/4, 1/2, full minute, etc.). 109s isn't anything "regular". Minute based, 100 second based. So, just an odd number.

I can't speak to what closed the connection. The driver may have done it after see the other errors, the DB may have done it. In Postgres, if you make any kind of error, it blows the transaction. The connection is still good, but any other work in the transaction results in an error (another reason we just blow out SQL Exceptions -- get one, and the transaction is dead anyway). But, the issue with the DB Connection Pool not resetting connections does not apply here. You're still in the same transaction, the pool has not had a chance to release and reuse it yet.

Now, you mentioned VMs and such, which is a whole other kettle of fish. The Hazelcast alerts may speak to something hinky at the VM level, but the times mentioned in the log don't correlate to the times of the events, plus they're all of different threads (which doesn't necessarily indicate anything, but it's interesting). The VM can very well be "losing time", which messes everything up.

I've done things like that in the past as well, had a continually running job that just logged the time to a file, every 15s. And, yea, sometimes, we'd see gaps. Lines that were more than 15s apart. We think it had something to do with the storage and the kernel.

So. Nothing (obviously) memory related here. If you suspect memory, you would need to check GC logs, see if you can correlate crushing GC activity with these anomalies between the server and the DB.

If it were me, as I said, I'd be running regular thread dumps, and log the GC. Use them all in a post mortem of the event. Even if there's a sudden onset, a 5m thread dump can catch a lot of things.

Regards,

Will Hartung

[2021-06-24T08:31:58.952-0700] [Payara 5.2020.5] [WARNING] [] [org.eclipse.persistence.session./file:/D:/payara/payara5/glassfish/domains/domain1/applications/WebCheckInOut/WEB-INF/classes/_WebCicoPU] [tid: _ThreadID=4694 _ThreadName=http-thread-pool::http-listener-2(28)] [timeMillis: 1624548718952] [levelValue: 900] [[
Local Exception Stack:

Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: I/O Error: Read timed out

at com.ijws.webcico.dao.impl.GenericDaoJpaImpl.findByNamedQuery(GenericDaoJpaImpl.java:170)

at com.ijws.webcico.dao.impl.StatsDaoJpaImpl.findStatsForNewTransaction(StatsDaoJpaImpl.java:85)

at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.loadOrCreateStatsEntity(TransactionHistoryServiceImpl.java:359)

at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.updateStats(TransactionHistoryServiceImpl.java:291)

at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.recordTransaction(TransactionHistoryServiceImpl.java:213)

at com.ijws.webcico.service.impl.WebCheckInOutServiceImpl.recordError(WebCheckInOutServiceImpl.java:2786)

at com.ijws.webcico.service.impl.WebCheckInOutServiceImpl.searchReservations(WebCheckInOutServiceImpl.java:495)

at com.ijws.webcico.web.ui.screens.processingcommands.SearchCommand.execute(SearchCommand.java:163)

Caused by: java.sql.SQLException: I/O Error: Read timed out

Caused by: java.net.SocketTimeoutException: Read timed out
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:200)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:850)

[2021-06-24T08:33:47.842-0700] [Payara 5.2020.5] [SEVERE] [] [] [tid: _ThreadID=4694 _ThreadName=http-thread-pool::http-listener-2(28)] [timeMillis: 1624548827842] [levelValue: 1000] [[
javax.persistence.PessimisticLockException: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException

Internal Exception: java.sql.SQLException: I/O Error: Read timed out

at com.ijws.webcico.dao.impl.GenericDaoJpaImpl.findByNamedQuery(GenericDaoJpaImpl.java:170)
at com.ijws.webcico.dao.impl.StatsDaoJpaImpl.findStatsForNewTransaction(StatsDaoJpaImpl.java:85)
at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.loadOrCreateStatsEntity(TransactionHistoryServiceImpl.java:359)
at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.updateStats(TransactionHistoryServiceImpl.java:291)
at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.recordTransaction(TransactionHistoryServiceImpl.java:213)
at com.ijws.webcico.service.impl.WebCheckInOutServiceImpl.recordError(WebCheckInOutServiceImpl.java:2786)
at com.ijws.webcico.service.impl.WebCheckInOutServiceImpl.searchReservations(WebCheckInOutServiceImpl.java:495)
at com.ijws.webcico.web.ui.screens.processingcommands.SearchCommand.execute(SearchCommand.java:163)
Caused by: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException

Internal Exception: java.sql.SQLException: I/O Error: Read timed out

Caused by: java.sql.SQLException: I/O Error: Read timed out
Caused by: java.net.SocketTimeoutException: Read timed out
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:200)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:850)

[2021-06-24T08:34:30.149-0700] [Payara 5.2020.5] [WARNING] [] [org.eclipse.persistence.session./file:/D:/payara/payara5/glassfish/domains/domain1/applications/WebCheckInOut/WEB-INF/classes/_WebCicoPU] [tid: _ThreadID=4694 _ThreadName=http-thread-pool::http-listener-2(28)] [timeMillis: 1624548870149] [levelValue: 900] [[
Local Exception Stack:

Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException

Internal Exception: java.sql.SQLException: java.lang.reflect.UndeclaredThrowableException

at com.ijws.webcico.dao.impl.GenericDaoJpaImpl.findByNamedQuery(GenericDaoJpaImpl.java:170)

at com.ijws.webcico.dao.impl.GenericDaoJpaImpl.findObjectInstanceByNamedQuery(GenericDaoJpaImpl.java:347)

at com.ijws.webcico.dao.impl.TransactionDaoJpaImpl.findTotalAttemptsForResByClassification(TransactionDaoJpaImpl.java:205)

at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.calcTotalRecordsByOutcome(TransactionHistoryServiceImpl.java:430)

at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.updateStats(TransactionHistoryServiceImpl.java:304)

at com.ijws.webcico.service.impl.TransactionHistoryServiceImpl.recordTransaction(TransactionHistoryServiceImpl.java:213)

at com.ijws.webcico.service.impl.WebCheckInOutServiceImpl.recordError(WebCheckInOutServiceImpl.java:2786)

at com.ijws.webcico.service.impl.WebCheckInOutServiceImpl.searchReservations(WebCheckInOutServiceImpl.java:495)

at com.ijws.webcico.web.ui.screens.processingcommands.SearchCommand.execute(SearchCommand.java:163)
Caused by: java.sql.SQLException: java.lang.reflect.UndeclaredThrowableException
Caused by: java.lang.reflect.UndeclaredThrowableException
Caused by: java.lang.reflect.InvocationTargetExceptionCaused by: java.sql.SQLException: Invalid state, the Connection object is closed.

at net.sourceforge.jtds.jdbc.JtdsConnection.checkOpen(JtdsConnection.java:1744)

at net.sourceforge.jtds.jdbc.JtdsConnection.prepareStatement(JtdsConnection.java:2486)

at net.sourceforge.jtds.jdbcx.proxy.ConnectionProxy.prepareStatement(ConnectionProxy.java:466)

[2021-06-24T08:34:31.556-0700] [Payara 5.2020.5] [SEVERE] [] [] [tid: _ThreadID=4694 _ThreadName=http-thread-pool::http-listener-2(28)] [timeMillis: 1624548871556] [levelValue: 1000] [[
javax.persistence.PersistenceException: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.7.7.payara-p2): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: java.lang.reflect.UndeclaredThrowableException

at org.eclipse.persistence.internal.jpa.QueryImpl.getDetailedException(QueryImpl.java:391)