Deadlock during TLS connection creation

95 views
Skip to first unread message

Chris Bailey

unread,
Sep 27, 2023, 12:04:39 PM9/27/23
to Netty discussions

Hi,

              We’re leveraging netty (v4.1.92) via Vertx (v4.3.8) configured to use OpenJdkSSLEngine with a deletegatedTaskExecutor() when creating the ssl_handler for our webserver.  In general, we see SSL operations (e.g., certificate validation) occurring on a worker thread, but we have run into a deadlock due to our current implementation and wanted some feedback on options.

 

The crux of the issue is that during our certificate validation via our custom TrustManager, we need to communicate with another internal component (which is running on a typical event loop thread) to complete the validation processing, but other SSL operations block the event loop on which that internal service is running by grabbing a lock.  Its my understanding that not all blocking SSL operations are offloaded, so there are likely more cases like these floating around.

 

What we are specifically seeing is that while we’re waiting on a response from the internal component to validate the certificate, a simultaneous request to close a TLS connection comes in.  The closing of the session is running on the event loop and blocks by waiting on a lock, but unfortunately, that lock is held by the worker thread waiting on a response from the internal component.  IOW, the internal service cannot respond to the worker thread because it its event loop is blocked by the close connection processing.  At this point, our entire webserver is dead in the water until we restart the JVM.

 

Sample traces are listed below from this example:

  • Close connection processing is running on event loop thread 0 and is blocked trying to grab a lock that is held by a worker thread
  • Cert validation in the TrustManager is running on a worker thread and is awaiting a response from the blocked service running on event loop thread 0
  • Internal service to complete the cert validation is affined to event loop thread 0, but its blocked because of the close connection processing

 

We’ve looked at several options:

  • Detect the deadlock and fail the client authentication.  The issue here is bad user experience so we’d like to avoid it.
  • Segregate the our TLS handling in the webserver and the blocking SSL handling into two DIFFERENT worker threads so that that we never block an event loop.  This is a risker change for us, though.
  • Remove the need to communicate with the internal service running on an event loop thread.

 

Any thoughts on other/better ways to handle this?

 

Thanks in advance,

Chris

 

# close connection handler running on event loop thread 0 is blocked

Sep 15 10:13:43.470794 APM00203802612-A control-path[37578]: "vert.x-eventloop-thread-0" #25 prio=5 os_prio=0 cpu=595313.19ms elapsed=28293.78s tid=0x00007f5bb8021800 nid=0x22c waiting for monitor entry  [0x00007f5c0034d000]

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:    java.lang.Thread.State: BLOCKED (on object monitor)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at sun.security.ssl.SSLEngineImpl.closeOutbound(java...@11.0.19/SSLEngineImpl.java:749)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         - waiting to lock <0x00000000dca1d0f8> (a sun.security.ssl.SSLEngineImpl)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1874)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1078)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at java.lang.Thread.run(java...@11.0.19/Thread.java:829)

 

 

# Trust manager waiting on a response from our internal component which is running on event loop thread 0.  Note that the lock that the close connection handler is waiting on is held by this thread.

Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]: "vert.x-internal-blocking-1" #42 prio=5 os_prio=0 cpu=23498.06ms elapsed=28290.74s tid=0x00007f5bb02b1000 nid=0x23b waiting on condition  [0x00007f5bc4e59000]Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:    java.lang.Thread.State: WAITING (parking)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at jdk.internal.misc.Unsafe.park(java...@11.0.19/Native Method)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         - parking to wait for  <0x00000000dca24ec0> (a java.util.concurrent.CountDownLatch$Sync)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at java.util.concurrent.locks.LockSupport.park(java...@11.0.19/LockSupport.java:194)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java...@11.0.19/AbstractQueuedSynchronizer.java:885)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java...@11.0.19/AbstractQueuedSynchronizer.java:1039)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java...@11.0.19/AbstractQueuedSynchronizer.java:1345)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at java.util.concurrent.CountDownLatch.await(java...@11.0.19/CountDownLatch.java:232)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:86)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at io.reactivex.rxjava3.core.Single.blockingGet(Single.java:3645)Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]:         at com.emc.cyclone.contexts.security.credstore.CycTrustOptions$CycTrustManager.validateCertificateWithCmfa(CycTrustOptions.java:442)

 

James Sager

unread,
Oct 3, 2023, 12:21:01 PM10/3/23
to ne...@googlegroups.com, uncba...@gmail.com
Hello,

I too have my java server completely crashing when using SSL HTTPS when someone sends a malformed http request...  It's a Java problem, not a netty one...  I forget the exact problem, but you can harden your machine by using the lowest TLS like 1.3 or something... I've been able to reduce the number of malformed requests that shut down my java program, but I'm not sure I'm 100% done with em... You gotta play with your Linux root and all sorts of things... Malformed HTTPS requests should be handled by exception properly, but often are not and crash the system.  It's ugly and a known issue... If this is the issue you have.

,Jim

--
You received this message because you are subscribed to the Google Groups "Netty discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to netty+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/netty/8f31febf-5aaa-47b5-8a37-b71d63edf6acn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages