Wrong handling of SocketTimeoutException in Frame.java

181 views
Skip to first unread message

Ondřej Kunc

unread,
Jan 10, 2022, 12:36:05 PM1/10/22
to rabbitmq-users
Hi all,

I think I have discovered a bug in the Java amqp-client.
This happens when we block for more than requestedHeartbeat / 4 while reading message payload.

VERSIONS:
Java amqp-client-5.8.0 (is also present in the latest)

Context:
We receive the following exception multiple times a day:
ForgivingExceptionHandler - An unexpected connection driver error occured
java.net.SocketTimeoutException: Read timed out
    at java.base/java.net.SocketInputStream.socketRead0(Native Method)
    at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
    at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472)
    at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70)
    at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1374)
    at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:985)
    at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:290)
    at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
    at java.base/java.io.DataInputStream.readFully(DataInputStream.java:200)
    at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
    at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:113)
    at com.rabbitmq.client.impl.SocketFrameHandler.readFrame(SocketFrameHandler.java:184)
    at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:645)
    at java.base/java.lang.Thread.run(Thread.java:829)

I think the problem is on this line: https://github.com/rabbitmq/rabbitmq-java-client/blob/32dfd0fb338a3ff1db2c79dfeb42c9fd451284a0/src/main/java/com/rabbitmq/client/impl/Frame.java#L112

The problem is there is no check for the SocketTimeoutException (as on the line 91), so instead of returning null the SocketTimeoutException exception will be thrown.
As a result in AMQPConnection.java it will jump to the handleFailure method which will shutdown the client.
In case we would return null and not throw exception there would be a logic that would correctly check for the missed heartbeats and wait two heartbeats intervals before it will terminate.

Example scenario:
1) User sets requestedHeartbeat to 16s.
2) This will result in setting _socket.setSoTimeout(4000) in SocketFrameHandler.java
3) Reading payload gets blocked for more than 4s (for example connection issue)
4) The connection is terminated (even if we haven't missed any heartbeat yet)

Possible Solution
I think possible solution is to catch SocketTimeoutException in Frame.java in readFrom method whenever we access DataInputStream variable and return null in that case.
But I am not sure if this is the right approach, becasuse we may lose the data we have already read in the Frame.readFom method.

Can anyone please confirm this issue?

Regards,
Ondrej

Arnaud Cogoluègnes

unread,
Jan 11, 2022, 3:24:22 AM1/11/22
to rabbitmq-users
Thanks for the detailed analysis. I don't see a good way to recover from this exception: the socket input stream is an unknown state, likely inconsistent.

I have several suggestions:
 * Increase the heartbeat value. You may find a sweet spot where the value is low enough and this exception does not show up anymore.
 * Investigate the network (even if most of the time we have to deal with the network we get).
 * Check if some messages are unreasonably large and tend to trigger this exception. There could be something to do at the application level (e.g. compression).
 * Let the automatic connection recovery kick in and recover the connection. I'd say it should kick in this case, but I may be wrong, so you can check out the ConnectionFactory#setConnectionRecoveryTriggeringCondition(Predicate<ShutdownSignalException>) method and set up an appropriate condition.

Ondřej Kunc

unread,
Jan 18, 2022, 7:29:02 AM1/18/22
to rabbitmq-users
Thanks for the suggestions! 
We have already raised the heartbeat value. Unfortunately, we have no control over the incoming messages and for some other reasons we have disabled the automatic recovery.

About recovering from the exception: I don't think DataInputStream is in unknown state - same exception can be thrown on the line 90 https://github.com/rabbitmq/rabbitmq-java-client/blob/32dfd0fb338a3ff1db2c79dfeb42c9fd451284a0/src/main/java/com/rabbitmq/client/impl/Frame.java#L90 and we can keep reading from the same DataInputStream after this exception was handled and Frame.readFrom method is called again. 

I think the only problem is when the exception is thrown on the lines 109 - 114. Because if we would use the same try-catch as earlier, we would also need to cache all the variables we have already read from DataInputStream so we can use them in the next call of Frame.readFrom.

Regards,
Ondrej

Arnaud Cogoluègnes

unread,
Jan 18, 2022, 10:56:28 AM1/18/22
to rabbitmq-users
I'm more worried about the line that reads the whole content in a byte array [1]. If we get the exception there, I don't know if we can recover in a reliable way, we may have read only one part of the content, it's hard to say if we can resume the reading in a reliable, hence my remark on the consistent state.

Reply all
Reply to author
Forward
0 new messages