jnats publish hangs forever

169 views
Skip to first unread message

Whitney Jackson

unread,
Jan 30, 2020, 4:24:24 PM1/30/20
to nats
Hi,

I'm using jnats 2.6.6. I have a java application with a a bunch of publisher threads. A nats connection is created at application start time and then shared across all the publisher threads.

One instance of my application stopped working after several day. I took a thread dump and saw all my publisher threads stuck like this:

"my-thread-0" #67 prio=5 os_prio=0 cpu=2522237.37ms elapsed=257081.32s tid=0x00007f1f20008000 nid=0x5794 waiting on condition  [0x00007f1fa86b3000]

   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java...@11.0.6/Native Method)
        - parking to wait for  <0x00000007449c4a10> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java...@11.0.6/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java...@11.0.6/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java...@11.0.6/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java...@11.0.6/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock(java...@11.0.6/ReentrantLock.java:267)
        at io.nats.client.impl.MessageQueue.push(MessageQueue.java:104)
        at io.nats.client.impl.NatsConnectionWriter.queue(NatsConnectionWriter.java:186)
        at io.nats.client.impl.NatsConnection.queueOutgoing(NatsConnection.java:1298)
        at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:780)
        at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:747)
        at myorg.NatsUpdatePublisher.publish(NatsUpdatePublisher.java:21)
        ...

Furthermore, I did not see the expected (I think) nats:1 and nats:2 threads:


"nats:1" #28 prio=5 os_prio=0 cpu=1.62ms elapsed=19.30s tid=0x00007f93e9cc3000 nid=0x4295 runnable  [0x00007f935e6f6000]

   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(java...@11.0.2/Native Method)
        at java.net.SocketInputStream.socketRead(java...@11.0.2/SocketInputStream.java:115)
        at java.net.SocketInputStream.read(java...@11.0.2/SocketInputStream.java:168)
        at java.net.SocketInputStream.read(java...@11.0.2/SocketInputStream.java:140)
        at io.nats.client.impl.SocketDataPort.read(SocketDataPort.java:101)
        at io.nats.client.impl.NatsConnectionReader.run(NatsConnectionReader.java:111)
        at java.util.concurrent.Executors$RunnableAdapter.call(java...@11.0.2/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java...@11.0.2/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java...@11.0.2/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java...@11.0.2/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java...@11.0.2/Thread.java:834)

"nats:2" #29 prio=5 os_prio=0 cpu=1.21ms elapsed=19.30s tid=0x00007f93e9caa800 nid=0x4296 waiting on condition  [0x00007f935e5f5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java...@11.0.2/Native Method)
        - parking to wait for  <0x000000074f29c690> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(java...@11.0.2/LockSupport.java:234)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java...@11.0.2/AbstractQueuedSynchronizer.java:2123)
        at java.util.concurrent.LinkedBlockingQueue.poll(java...@11.0.2/LinkedBlockingQueue.java:458)
        at io.nats.client.impl.MessageQueue.poll(MessageQueue.java:148)
        at io.nats.client.impl.MessageQueue.accumulate(MessageQueue.java:205)
        at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:118)
        at java.util.concurrent.Executors$RunnableAdapter.call(java...@11.0.2/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java...@11.0.2/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java...@11.0.2/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java...@11.0.2/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java...@11.0.2/Thread.java:834)

Did my connection break in some way I should be defending against? What might be happening here? 

Thanks,

Whitney

Ivan Kozlovic

unread,
Jan 30, 2020, 5:39:08 PM1/30/20
to nats
Whitney,

You are right that nats-1 and nats-2 seem to be the connection reader and writer thread, and if they are missing, the outbound queue does not get emptied and after a while I am guessing that the publishers block on a blocking queue.
Could you (if not already the case) setup some connection and error handlers to see if you are getting an error?
You basically need to set some `connectionListener` and `errorListener` handler to see if there were some events. Normally if the client is disconnected the library should reconnect. But, either based on configuration or if the number of reconnect attempts is reached, the connection will be marked as closed. You should receive such event in the connectionListener callback. 

Ivan. 

Whitney Jackson

unread,
Jan 30, 2020, 6:30:44 PM1/30/20
to nat...@googlegroups.com
Thanks for the suggestion. I'll set that up and see what events I'm able to log. It may be a few days until the problem reoccurs.

--
You received this message because you are subscribed to the Google Groups "nats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to natsio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/natsio/4d481c70-edf3-4d11-afac-ea46bbc69610%40googlegroups.com.

Whitney Jackson

unread,
Feb 5, 2020, 12:10:43 PM2/5/20
to nats
Ok. My issue happened again with the logging changes in place:

[2020-02-04T12:16:08,230Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: Read channel closed.
[2020-02-04T12:16:08,231Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection disconnected
[2020-02-04T12:16:08,232Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: discovered servers
[2020-02-04T12:16:08,232Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection reconnected
[2020-02-04T12:16:08,232Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: subscriptions re-established
[2020-02-04T12:16:08,238Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: Read channel closed.
[2020-02-04T12:16:08,238Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection disconnected
[2020-02-04T12:16:08,239Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: discovered servers
[2020-02-04T12:16:08,240Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection reconnected
[2020-02-04T12:16:08,240Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: subscriptions re-established
[2020-02-04T12:16:08,243Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: Read channel closed.
...
[same pattern repeats every few millis]
...
[2020-02-04T12:16:08,450Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: Read channel closed.
[2020-02-04T12:16:08,450Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection disconnected
[2020-02-04T12:16:08,451Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: discovered servers
[2020-02-04T12:16:08,451Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection reconnected
[2020-02-04T12:16:08,451Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: subscriptions re-established
[2020-02-04T12:16:08,457Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: Read channel closed.
[2020-02-04T12:16:08,457Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: Broken pipe (Write failed)
[2020-02-04T12:16:08,457Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection disconnected
[2020-02-04T12:16:08,457Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: discovered servers
[2020-02-04T12:16:08,458Z {pool-6-thread-1} INFO  myorg.myapp.UpdatePublisherModule.NatsConnectionListener:] Connection event - nats: connection disconnected
[2020-02-04T12:16:08,458Z {pool-6-thread-1} WARN  myorg.myapp.UpdatePublisherModule.NatsErrorListener:] The connection handled an exception: null




Unfortunately I don't have a stack trace for that last null exception. I'll make further changes such that I get that next time.

I wonder why my connection terminates just after being opened over and over again? Could it be that a retry buffer contains a message that breaks the connection or something?

Whitney

Ivan Kozlovic

unread,
Feb 5, 2020, 12:16:35 PM2/5/20
to nats
Whitney,


On Wednesday, February 5, 2020 at 10:10:43 AM UTC-7, Whitney Jackson wrote:

Unfortunately I don't have a stack trace for that last null exception. I'll make further changes such that I get that next time.

Ok, thanks.
 
I wonder why my connection terminates just after being opened over and over again? Could it be that a retry buffer contains a message that breaks the connection or something?

Could very well be the case. You should see in the NATS Server log mention of bad protocol if that was the case. For instance, if the server receives an invalid protocol (say XYZ), it would log an error as such:

[20532] 2020/02/05 10:15:09.450191 [ERR] [::1]:59181 - cid:1 - Client parser ERROR, state=0, i=0: proto='"XYZ\r"...' 

Let us know if you see such error..

Ivan.

Whitney Jackson

unread,
Feb 5, 2020, 12:47:13 PM2/5/20
to nats
> Let us know if you see such error..

Nope. I checked and I didn't see any of the invalid protocol messages in the nats server logs.

I should mention that I am running nats-server with these two options because my application occasionally publishes large messages:

max_payload: 67108864

max_pending: 134217728


If need be I can break up the larger messages and run with the default configs. It's less work in the application code with this setup though.

I've not yet made any changes to reconnectBufferSize but perhaps I need to given the max_payload and max_pending settings I'm using?


Ivan Kozlovic

unread,
Feb 5, 2020, 12:53:01 PM2/5/20
to nats


On Wednesday, February 5, 2020 at 10:47:13 AM UTC-7, Whitney Jackson wrote:

I've not yet made any changes to reconnectBufferSize but perhaps I need to given the max_payload and max_pending settings I'm using?


The reconnectBufferSize configures how big the buffer holding published messages while being disconnected. If your application is not getting any error on publish, then it should not be an issue. So yes, if you plan to send big messages and don't want your application to fail if it was to publish while being disconnected, then you should make sure your reconnectBufferSize is big enough, but I don't think this is the issue that you are experiencing at the moment.
The repeated disconnect/reconnect is of a concern if you say that you don't see any protocol error on the server.. If you run the server with `-D` (just debug, not -DV which include trace), we may get more information about this failed reconnect attempts?

Whitney Jackson

unread,
Feb 5, 2020, 2:08:12 PM2/5/20
to nats
> If you run the server with `-D` (just debug, not -DV which include trace), we may get more information about this failed reconnect attempts?

Ok. I'll give that a try.
Reply all
Reply to author
Forward
0 new messages