Handling blocked connections in Java client library

1,438 views
Skip to first unread message

tfasz

unread,
Mar 4, 2015, 8:16:51 PM3/4/15
to rabbitm...@googlegroups.com
I'm hoping someone might be able to suggest an approach for dealing with a RabbitMQ server that is blocking connections from the Java client library due to a memory or disk alarm.

You can easily reproduce this issue by setting your server to have 0 memory available (I tested with server version 3.2.1).

[root@localhost ~]# echo "[{rabbit, [{vm_memory_high_watermark, 0.0}]}]." > /etc/rabbitmq/rabbitmq.config 
[root@localhost ~]# service rabbitmq-server restart

Then when you try to publish a message to the server it hangs indefinitely on the following line of code:
 
// Open your connection the usual way
 
Channel channel = null;
try {
channel = connection.createChannel();
channel.exchangeDeclare(EXCHANGE_NAME, "direct");
channel.queueDeclare(QUEUE_NAME, false, false, false, null);
channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "*");
channel.basicPublish(EXCHANGE_NAME, "TEST", null, "Hello World!".getBytes());
System.out.println("Sent message.");
} catch (IOException ex) {
System.out.println("IOException on send. " + ex.toString());
} finally {
if (channel != null && channel.isOpen()) {
try {
System.out.println("Closing channel.");
channel.close();
System.out.println("Channel closed.");
} catch (IOException ex) {
System.out.println("IOException on channel close. " + ex.getMessage());
}
}
}

I have tested with both our current version (3.2.1) and the most recent client library (3.4.4) with the same results.

I would expect the channel.close() method to have a timeout so the client thread does not hang forever. Or maybe the client should throw an exception on the channel.basicPublish() or channel.close() call if the connection is blocked. 

Any suggestions for how we can avoid this issue? Once you get into this state there is no way to recover from it without restarting the RabbitMQ server or the JVM running the client.

Appreciate any insight.
Todd

Michael Klishin

unread,
Mar 5, 2015, 4:19:36 AM3/5/15
to rabbitm...@googlegroups.com, tfasz
  On 5 March 2015 at 04:16:54, tfasz (tod...@gmail.com) wrote:
> I would expect the channel.close() method to have a timeout
> so the client thread does not hang forever. Or maybe the client
> should throw an exception on the channel.basicPublish() or
> channel.close() call if the connection is blocked.

Not everybody agrees with that. Having a timeout is OK but then you
have another problem: the server does not read anything coming in until
alarm clears, so it won't know the channel was closed.

So there is no obvious solution, I'm afraid.

> Any suggestions for how we can avoid this issue? Once you get into
> this state there is no way to recover from it without restarting
> the RabbitMQ server or the JVM running the client.

You should be able to use Channel#abort (like close but ignores all exceptions) and interrupt the thread that
is blocked waiting for channel.close-ok to arrive.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

tfasz

unread,
Mar 5, 2015, 6:16:32 PM3/5/15
to rabbitm...@googlegroups.com, tod...@gmail.com
Michael - Thanks very much for the prompt reply.

First let me say that RabbitMQ has been a great piece of infrastructure for us. We have been using it for 5+ years and it has been rock solid with very few issues - this is the first big issue we have ever run into with it. 

Couple of questions:
- Is there a way to know the connection or channel is blocked before publish? I tried adding a BlockedListener on the connection, but it only seems to notice the blocked connection after an attempt to publish so I always end up with at least one blocked thread.
- I tried to implement your suggestion of using channel.abort() and interrupting the blocked thread but it looked to exhibit the same behavior. Do you have an example I could look at to see if I am doing something wrong?

I really think you should reconsider the current design that can block a calling thread forever. This is a bad failure scenario waiting to happen. In our case our RabbitMQ server got into a bad state, filled memory, and cascaded to our 12 app servers which all eventually hung trying to publish messages to a queue. This caused a 40 minute outage for our SAAS app mostly because it took a while to identify the cause since RabbitMQ so rarely has issues. Our application handles RabbitMQ crashes and network partitions fine (with some lost messages) but this was a failure scenario we missed in testing. 

Couple of alternatives:
- Allow someone to configure an optional timeout on channel.close(). If a timeout occurs we could then take additional action such as trigger a connection.close().
- Allow the client to be configured so it will auto-disconnect the whole connection if in a blocked state for longer than some duration. Raise exceptions on channels in a blocked state.
- Have the client library internally buffer messages in a queue and publish on a different thread (this is how the spymemcached client library works). Obviously we can implement this too but it seems like a client library should implement this if it is always required.

If you continue to believe blocking a calling thread forever in your client library is OK your documentation should clearly state this failure scenario so developers can be aware to test for it. We use a lot of networked services in our stack (MySQL, Memcached, Redis, ElasticSearch, MongoDB, etc) and I do not believe any have a scenario that can cause a thread to be blocked forever calling into their library - they all allow timeouts to be configured.

Appreciate the assistance and consideration.
Todd


Michael Klishin

unread,
Mar 5, 2015, 6:30:06 PM3/5/15
to rabbitm...@googlegroups.com, tfasz
On 6 March 2015 at 02:16:34, tfasz (tod...@gmail.com) wrote:
> Couple of questions:
> - Is there a way to know the connection or channel is blocked before
> publish? I tried adding a BlockedListener on the connection,
> but it only seems to notice the blocked connection after an attempt
> to publish so I always end up with at least one blocked thread.

Connections are only blocked when they attempt to publish (when RabbitMQ
sees a basic.publish frame or content frame or body frame on it). We do not
want to block consumers who drain queues and thus relieve RAM and disk space
pressure.

This is why you only see connection blocked listener fire when you attempt
to publish: connection is not blocked prior to that.
 
> - I tried to implement your suggestion of using channel.abort()
> and interrupting the blocked thread but it looked to exhibit
> the same behavior. Do you have an example I could look at to see
> if I am doing something wrong?

So if you try to interrupt the thread that's blocked in a socket write, an InterruptedException
is not thrown? Things may be better with NIO, to which we plan to migrate after 3.5.0, but
I cannot guarantee that without trying.

> I really think you should reconsider the current design that
> can block a calling thread forever. This is a bad failure scenario
> waiting to happen.

This is not intentional design. We *do not* block caller threads in 
the library. Socket#write does, because local TCP buffer fills up when RabbitMQ
stops reading from the socket (which is how blocking is implemented). 

As I've mentioned before, there is no good
solution since RabbitMQ stops reading from the socket for blocked connections.
It therefore won't notice that the channel was closed if we just "abandon" a Channel instance
and find a way to unblock. Which leads to inconsistent state in the server and client.

> Couple of alternatives:
> - Allow someone to configure an optional timeout on channel.close().
> If a timeout occurs we could then take additional action such
> as trigger a connection.close().

RabbitMQ does not read from the socket that belong to closed connections.

connection.close then won't be seen.

> - Allow the client to be configured so it will auto-disconnect
> the whole connection if in a blocked state for longer than some
> duration. Raise exceptions on channels in a blocked state.

See above. You can only close TCP connection (which may be sufficient).

> - Have the client library internally buffer messages in a queue
> and publish on a different thread (this is how the spymemcached
> client library works).

Again, what would that solve the peer does not read from the socket?
TCP buffers would fill up quickly and any write(2) call will block.

> Obviously we can implement this too but
> it seems like a client library should implement this if it is always
> required.

I'm afraid it won't be easy to implement this, for you or for us, although
switching to NIO may provide a solution. I hope now it's a bit clearer what the hard parts are.

tfasz

unread,
Mar 6, 2015, 12:10:06 PM3/6/15
to rabbitm...@googlegroups.com, tod...@gmail.com

Let me explain our use case a bit more. We have a SAAS web application that commits information to a database and then sends MQ messages to other servers to perform async processing. Right now if RabbitMQ runs out of memory the threads processing web requests all hang while trying to send MQ messages resulting in a full outage of our application. The RabbitMQ server is a single point of failure. What I am hoping to accomplish is to have that failure scenario result in the application running in a degraded state where we can still respond to requests and read/write to the database but not send MQ messages until the RabbitMQ server issue is resolved. We have designed the app to assume messages can be lost and already have a process in place for recovering from it.

The simplest change I can see is to implement the BlockedListener. When we become blocked we will set a flag and stop trying to send messages (and reverse on unblocked). This will still leave one or more request threads in a blocked state, but should keep the whole app from going offline. A more involved change would be to have the request threads submit the messages to a Java queue and have a dedicated thread on each server pull from this queue and call into the RabbitMQ client library. This would keep the threads serving HTTP requests from being blocked. We could alert and stop adding messages to the Java queue if it exceeded a certain size. 

If this issue had been clearly documented in the Java Docs we would have implemented it via a Java queue from the beginning. You really need to explain this failure scenario in the docs. Java developers will typically not assume calls to another library have known scenarios that cause them to hang indefinitely without  a timeout. 

I'm not trying to say it is an easy problem to solve - just that it needs to be explained better.

Thanks again for taking the time to explain the issue.

Todd

Michael Klishin

unread,
Mar 6, 2015, 12:43:50 PM3/6/15
to rabbitm...@googlegroups.com, tfasz
 On 6 March 2015 at 20:10:09, tfasz (tod...@gmail.com) wrote:
> If this issue had been clearly documented in the Java Docs we
> would have implemented it via a Java queue from the beginning.
> You really need to explain this failure scenario in the docs.
> Java developers will typically not assume calls to another library
> have known scenarios that cause them to hang indefinitely without
> a timeout.
>
> I'm not trying to say it is an easy problem to solve - just that it
> needs to be explained better.

Todd,

We will, shortly after the 3.5.0 release next week:
https://github.com/rabbitmq/rabbitmq-java-client/issues/30

I'm also hopeful that
NIO will give us some options to really solve this problem.

Thank you!
Reply all
Reply to author
Forward
0 new messages