Bunny 2.7.0 timeout errors in a multithreaded Rails app

1,285 views
Skip to first unread message

Arindam Mukherjee

unread,
Jun 21, 2017, 1:12:31 PM6/21/17
to Ruby RabbitMQ libraries
I see several errors from Bunny 2.7.0 in a multi-threaded Rails app in which background threads connect to RabbitMQ to exchange messages. 

1. The threads can be from a thread pool (typically from concurrent-ruby).
2. Usually a single connection is shared and each task (running on a thread-pool thread) makes its own channel to talk to RabbitMQ.
3. Channels are closed in an exception-safe way using with_channel.
4. Most of the queues are RPC over AMQP type ephemeral queues. All are direct exchanges. There are also standard queues connected to these exchanges.
5. The maximum number of queues never reaches beyond 20. The maximum number of concurrent threads connected to MQ is never more than 20. The maximum number of messages on a queue is rarely beyond 10.
6. The Bunny sessions use a :heartbeat => :server option.
7. The RabbitMQ server runs with hipe_compile set to true. SSL is not used.

The following are the error stacks:

1. In Queue#subscribe:

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:1761:in `wait_on_continuations'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:920:in `block in basic_consume_with'
App 6423 stdout: /usr/share/ruby/timeout.rb:91:in `block in timeout'
App 6423 stdout: /usr/share/ruby/timeout.rb:101:in `call'
App 6423 stdout: /usr/share/ruby/timeout.rb:101:in `timeout'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:919:in `basic_consume_with'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/queue.rb:192:in `subscribe'


2. In Channel#close:

App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'
App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:1318:in `wait_on_continuations'
App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:534:in `block in close_channel'
App 6423 stderr:   70122745671040       /usr/share/ruby/monitor.rb:211:in `mon_synchronize'
App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:530:in `close_channel'
App 6423 stderr:   70122745671040       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:236:in `close'


3. In Session#create_channel:

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:1318:in `wait_on_continuations'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:521:in `block in open_channel'
App 6423 stdout: /usr/share/ruby/monitor.rb:211:in `mon_synchronize'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:514:in `open_channel'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:223:in `open'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:355:in `block in create_channel'
App 6423 stdout: /usr/share/ruby/monitor.rb:211:in `mon_synchronize'
App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:350:in `create_channel'


4. In Channel#queue:

App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:1761:in `wait_on_continuations'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:1004:in `queue_declare'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/queue.rb:370:in `declare!'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/queue.rb:51:in `initialize'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:413:in `new'
App 6423 stderr:   70124403077420       /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:413:in `queue'


Other errors seen:

A.
App 6423 stdout: W, [2017-06-21T07:55:21.500595 #6423]  WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Recovering from connection.close (CHANNEL_ERROR - second 'channel.open' seen)
App 6423 stdout: W, [2017-06-21T07:55:21.501245 #6423]  WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Will recover from a network failure (no retry limit)...

B.
App 6423 stdout: W, [2017-06-21T07:55:50.883476 #6423]  WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Retrying connection on next host in line: rmq:5672


C.
App 6423 stdout: E, [2017-06-21T07:55:55.115188 #6423] ERROR -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)
App 6423 stdout: W, [2017-06-21T07:55:55.115378 #6423]  WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: An empty frame was received while opening the connection. In RabbitMQ <= 3.1 this could mean an authentication issue.
App 6423 stdout: W, [2017-06-21T07:55:55.115495 #6423]  WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: TCP connection failed, reconnecting in 5.0 seconds
App 6423 stdout: W, [2017-06-21T07:55:55.115535 #6423]  WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Will recover from a network failure (no retry limit)...


Arindam

Michael Klishin

unread,
Jun 21, 2017, 1:15:24 PM6/21/17
to ruby...@googlegroups.com
Thanks for the details but what is your question specifically?

Continuation queue timeouts, in particular if they include channel.open operations,
can mean all kinds of things but you see a TCP connection failure in the log.
Which seems like a primary candidate for the root cause.


--
Bunny: http://rubybunny.info
March Hare: http://rubymarchhare.info
 
IRC: #rabbitmq on irc.freenode.net
 
Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+unsubscribe@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Jun 21, 2017, 1:16:46 PM6/21/17
to ruby...@googlegroups.com
Also see server logs. If you use Bunny connections in a concurrent way that
had resulted in an incorrect frame interleaving on the wire, it will be
easy to spot in the server logs.

Some general recommendations can be found in http://rubybunny.info/articles/concurrency.html.

Arindam Mukherjee

unread,
Jun 21, 2017, 2:47:20 PM6/21/17
to Ruby RabbitMQ libraries


Thanks for the details but what is your question specifically?


How to get rid of these errors, especially given the low volume of messages and the low number of queues we are dealing with.
 
Continuation queue timeouts, in particular if they include channel.open operations,
can mean all kinds of things but you see a TCP connection failure in the log.
Which seems like a primary candidate for the root cause.


Recovering from connection.close (CHANNEL_ERROR - second 'channel.open' seen)

What does the second "channel.open" seen indicate? i am not calling open twice on any channel. Also, the rails app (Phusion Passenger) and rabbitmq are running on different containers on the same host. So the chances of a network outage are much less - there is no physical network between the applications, just a docker0 bridge.
 
What do I see in RabbitMQ server logs.

During the three different timeout episodes, I see period of 2-3 minutes during which there were no log entries on the server:

=INFO REPORT==== 21-Jun-2017::06:37:33 ===
closing AMQP connection <0.31984.26> (172.17.0.9:44224 -> 172.17.0.7:5672)

<<< nothing here. This client experienced timeouts during this time. >>>

=INFO REPORT==== 21-Jun-2017::06:39:44 ===
accepting AMQP connection <0.254.27> (172.17.0.9:44315 -> 172.17.0.7:5672)

=INFO REPORT==== 21-Jun-2017::06:39:44 ===
closing AMQP connection <0.254.27> (172.17.0.9:44315 -> 172.17.0.7:5672)
Corresponding error in server log:

=ERROR REPORT==== 21-Jun-2017::07:55:21 ===
Error on AMQP connection <0.30990.16> (172.17.0.11:47924 -> 172.17.0.7:5672, vhost: '/', user: 'test', state: running), channel 3:
operation channel.open caused a connection exception channel_error: "second 'channel.open' seen"

=INFO REPORT==== 21-Jun-2017::07:55:21 ===
closing AMQP connection <0.30990.16> (172.17.0.11:47924 -> 172.17.0.7:5672)

=INFO REPORT==== 21-Jun-2017::07:55:51 ===
accepting AMQP connection <0.5177.28> (172.17.0.11:53889 -> 172.17.0.7:5672)

=ERROR REPORT==== 21-Jun-2017::07:55:55 ===
closing AMQP connection <0.5177.28> (172.17.0.11:53889 -> 172.17.0.7:5672):
{handshake_error,tuning,2,
                 {amqp_error,unexpected_frame,
                             "type 1, first 16 octets = <<0,20,0,40,0,200,7,71,111,111,100,98,121,101,0,0>>",
                             none}}
  

Arindam

--
Bunny: http://rubybunny.info
March Hare: http://rubymarchhare.info
 
IRC: #rabbitmq on irc.freenode.net
 
Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+...@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jun 21, 2017, 2:56:19 PM6/21/17
to ruby...@googlegroups.com
"Second channel.open seen" means that a channel.open frame was seen by a connection with a channel id that is already allocated (open).

This is a critical (unrecoverable) exception in the protocol, so RabbitMQ closes the connection but the client does not expect that and assumes it has to recover. Also, no responses for earlier operations arriv (the connection is closed), resulting in timeouts.

This and the note about apps and RabbitMQ on the same host suggest it is a channel sharing issue,
for example, multiple threads close and open channels on a shared connection without synchronization.
Staff Software Engineer, Pivotal/RabbitMQ

Arindam Mukherjee

unread,
Jun 21, 2017, 3:25:07 PM6/21/17
to Ruby RabbitMQ libraries


On Thursday, 22 June 2017 00:26:19 UTC+5:30, Michael Klishin wrote:
"Second channel.open seen" means that a channel.open frame was seen by a connection with a channel id that is already allocated (open).

This is a critical (unrecoverable) exception in the protocol, so RabbitMQ closes the connection but the client does not expect that and assumes it has to recover. Also, no responses for earlier operations arriv (the connection is closed), resulting in timeouts.

This and the note about apps and RabbitMQ on the same host suggest it is a channel sharing issue, 
for example, multiple threads close and open channels on a shared connection without synchronization.

1. Multiple threads close and open channels on a shared connection with synchronization.
 

Connections in Bunny will synchronize writes both for messages and at the socket level. This means that publishing on a shared connection from multiple threads is safe but only if every publishing thread uses a separate channel.
This means that without synchronization on, publishing from multiple threads on a shared channel may result in frames being sent to RabbitMQ out of order,

But there are no shared channels across threads. Each thread opens and closes its own channels. What kind of additional synchronization is needed?

> the note about apps and RabbitMQ on the same host suggest that it is a channel sharing issue

No channels are shared across threads. Also how do two containers (Rails, RabbitMQ) being on the same host have any influence on this?

Michael Klishin

unread,
Jun 21, 2017, 3:40:14 PM6/21/17
to ruby...@googlegroups.com
The issue here is not with publishing. If it was with publishing you'd get a different error
from the parser, something like

> UNEXPECTED_FRAME - expected content header for class 60, got non content header frame instead

This is because a published message is at least 2, typically 3 frames on the wire:

[basic.publish frame] [content header (message metadata)] [message payload]*

Here's a blog post from a team that managed to run into this with Puma:

Your issue is NOT the same. Something in your code opens and closes channels concurrently, so at some point
RabbitMQ believes that channel N is open while the client believes that it was closed and thus ID = N can
be reused, which confuses the server.

I cannot be more specific as I don't have your code or access
to your environment (or the time to get to the bottom of every question, to be quite honest) but
there is enough evidence that this is a concurrency hazard.

Two contains being on the same host is irrelevant to concurrency in a single process,
however, it was important to mention that the connection recovery logged by Bunny is not
due to a TCP connection failure but rather because RabbitMQ closed it due to an unrecoverable protocol exception.


---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arindam Mukherjee

unread,
Jun 23, 2017, 4:15:11 PM6/23/17
to Ruby RabbitMQ libraries
Thanks for the insights and sorry for the frantic flailing. I have been struggling with this issue for a month or so.

> Something in your code opens and closes channels concurrently, so at some point
RabbitMQ believes that channel N is open while the client believes that it was closed and thus ID = N can
be reused, which confuses the server.

When you say that something is opening and closing channels concurrently, do you mean that a channel is closed by a thread, and before this is registered by the server, Bunny reuses this channel when a different thread tries to open a channel around the same time?

What kind of issues in client code calling Bunny could possibly lead to this?

Would you suggest any way to fix this - even if it's just a hack? Say making sure that no channel is open within 10 seconds of closing a channel?

Is it possible that keeping a single Bunny session open over 20 hours, and opening more and more channels is a problem?

Michael Klishin

unread,
Jun 23, 2017, 5:57:15 PM6/23/17
to ruby...@googlegroups.com
I cannot recommend a fix or hack since I don't have your code, traffic dumps
or anything else that provides *the details* that are badly needed when reasoning about
a concurrency hazard.

Connections are meant to be long lived and 20 hours is nothing special.
Channels can be long lived but also can be short lived (opening a channel per operation
is quite inefficient but beyond that, short lived channels are fine).

If you keep accumulating channels, two things will happen:

 * They will consume more and more resources (primarily RAM) in both your app and RabbitMQ node
 * Eventually you will hit the limit known as channel_max, the maximum allowed number of channels
   (65535 by default IIRC)

When I talk about opening and closing channels, I mean two operations:

 * Bunny::Session#create_channel
 * Bunny::Channel#close

a channel can also be closed due to a channel-level exception but those result in Ruby
exceptions very quickly if not immediately so they are hard to not notice.

Every Bunny connection has a map (hash) of channels:

access to which is protected with a mutex in most places (possibly not all
but covering all possible modifications is not the goal or guarantee):



Connections identify channels by an ID, which is just an integer.
The pool of available IDs is managed using a channel allocator based on a bit set:



On the RabbitMQ end, each connection has a map of channels largely similar to what
Bunny has. Channel management in the server is more involved but concurrent access
is effectively excluded because all operations on a given channel are processed sequentially
by the server.

Opening or closing a channel is a network roundtrip (channel.open => channel.open-ok,
channel.close => channel.close-ok). Bunny will wait for it before returning control to your app
intentionally.

In your case, someone Bunny believes that an ID is available to reuse when RabbitMQ doesn't
agree. So Bunny tries to open a channel that the server believes is still in use and a fatal connection
exception is triggered, then RabbitMQ closes TCP connection.

Taking a traffic dump with tcpdump and inspecting it with Wireshark will greatly help with identifying
the sequence of frames on the wire, and hopefully validate my hypothesis:

Of course, the easiest way to complete dodge any shared state concurrency hazard is to not 
have shared state if you can: using N connections in an app instead of 1 can be reasonable
and make your code more straightforward. I don't know if it's realistic in your case
but 5, 10, even 30 connections per app are not alarmingly large numbers (as long as your server has enough
file descriptors available).



---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arindam Mukherjee

unread,
Jun 27, 2017, 1:26:47 PM6/27/17
to Ruby RabbitMQ libraries
I noticed two things.

a. Even after setting the heartbeat to 120 on the server, and using :heartbeat => :server while creating Bunny Sessions, we would occasionally timeout while performing an operation on the connection. This happens irrespective of whether we have a single open connection or several (we have tried with 4).

b. The issues with channel ids where Bunny and the server don't agree on a channel having been closed start happening after this timeout. I was able to work around it by manually assigning channel ids and in an increasing sequence and keeping track of used vs free channel ids.

However, #a keeps showing up in every test run randomly. Typically during the periods when such a timeout is observed, the RabbitMQ server logs show no activity. Are there any connection attributes, etc. in Bunny that I can log when such timeouts happen to get a better idea of what's going on?


Regards,
Arindam

Arindam Mukherjee

unread,
Jun 27, 2017, 3:09:58 PM6/27/17
to Ruby RabbitMQ libraries
For #a below, I realize that the default value of continuation_timeout of 4000 milliseconds might be too low. What factors should I take into account in computing a reasonable value for continuation timeout?

Regards,
Arindam

Michael Klishin

unread,
Jun 27, 2017, 3:13:19 PM6/27/17
to ruby...@googlegroups.com
It should be the same as heartbeat timeout (and 120 is way too high).
Reply all
Reply to author
Forward
0 new messages