Bunny 2.7.0 timeout errors in a multithreaded Rails app

Arindam Mukherjee

unread,

Jun 21, 2017, 1:12:31 PM6/21/17

to Ruby RabbitMQ libraries

I see several errors from Bunny 2.7.0 in a multi-threaded Rails app in which background threads connect to RabbitMQ to exchange messages.

1. The threads can be from a thread pool (typically from concurrent-ruby).

2. Usually a single connection is shared and each task (running on a thread-pool thread) makes its own channel to talk to RabbitMQ.

3. Channels are closed in an exception-safe way using with_channel.

4. Most of the queues are RPC over AMQP type ephemeral queues. All are direct exchanges. There are also standard queues connected to these exchanges.

5. The maximum number of queues never reaches beyond 20. The maximum number of concurrent threads connected to MQ is never more than 20. The maximum number of messages on a queue is rarely beyond 10.

6. The Bunny sessions use a :heartbeat => :server option.

7. The RabbitMQ server runs with hipe_compile set to true. SSL is not used.

The following are the error stacks:

1. In Queue#subscribe:

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:1761:in `wait_on_continuations'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:920:in `block in basic_consume_with'

App 6423 stdout: /usr/share/ruby/timeout.rb:91:in `block in timeout'

App 6423 stdout: /usr/share/ruby/timeout.rb:101:in `call'

App 6423 stdout: /usr/share/ruby/timeout.rb:101:in `timeout'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:919:in `basic_consume_with'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/queue.rb:192:in `subscribe'

2. In Channel#close:

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:1318:in `wait_on_continuations'

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:534:in `block in close_channel'

App 6423 stderr: 70122745671040 /usr/share/ruby/monitor.rb:211:in `mon_synchronize'

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:530:in `close_channel'

App 6423 stderr: 70122745671040 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:236:in `close'

3. In Session#create_channel:

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:1318:in `wait_on_continuations'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:521:in `block in open_channel'

App 6423 stdout: /usr/share/ruby/monitor.rb:211:in `mon_synchronize'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:514:in `open_channel'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:223:in `open'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:355:in `block in create_channel'

App 6423 stdout: /usr/share/ruby/monitor.rb:211:in `mon_synchronize'

App 6423 stdout: /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/session.rb:350:in `create_channel'

4. In Channel#queue:

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:1761:in `wait_on_continuations'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:1004:in `queue_declare'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/queue.rb:370:in `declare!'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/queue.rb:51:in `initialize'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:413:in `new'

App 6423 stderr: 70124403077420 /usr/local/share/gems/gems/bunny-2.7.0/lib/bunny/channel.rb:413:in `queue'

Other errors seen:

A.

App 6423 stdout: W, [2017-06-21T07:55:21.500595 #6423] WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Recovering from connection.close (CHANNEL_ERROR - second 'channel.open' seen)

App 6423 stdout: W, [2017-06-21T07:55:21.501245 #6423] WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Will recover from a network failure (no retry limit)...

B.

App 6423 stdout: W, [2017-06-21T07:55:50.883476 #6423] WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Retrying connection on next host in line: rmq:5672

C.

App 6423 stdout: E, [2017-06-21T07:55:55.115188 #6423] ERROR -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)

App 6423 stdout: W, [2017-06-21T07:55:55.115378 #6423] WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: An empty frame was received while opening the connection. In RabbitMQ <= 3.1 this could mean an authentication issue.

App 6423 stdout: W, [2017-06-21T07:55:55.115495 #6423] WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: TCP connection failed, reconnecting in 5.0 seconds

App 6423 stdout: W, [2017-06-21T07:55:55.115535 #6423] WARN -- #<Bunny::Session:0x64a5410 test@rmq:5672, vhost=/, addresses=[rmq:5672]>: Will recover from a network failure (no retry limit)...

Arindam

Michael Klishin

unread,

Jun 21, 2017, 1:15:24 PM6/21/17

to ruby...@googlegroups.com

Thanks for the details but what is your question specifically?

Continuation queue timeouts, in particular if they include channel.open operations,

can mean all kinds of things but you see a TCP connection failure in the log.

Which seems like a primary candidate for the root cause.

--
Bunny: http://rubybunny.info
March Hare: http://rubymarchhare.info

IRC: #rabbitmq on irc.freenode.net

Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+unsubscribe@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Jun 21, 2017, 1:16:46 PM6/21/17

to ruby...@googlegroups.com

Also see server logs. If you use Bunny connections in a concurrent way that

had resulted in an incorrect frame interleaving on the wire, it will be

easy to spot in the server logs.

Some general recommendations can be found in http://rubybunny.info/articles/concurrency.html.

Arindam Mukherjee

unread,

Jun 21, 2017, 2:47:20 PM6/21/17

to Ruby RabbitMQ libraries

Thanks for the details but what is your question specifically?

How to get rid of these errors, especially given the low volume of messages and the low number of queues we are dealing with.

Continuation queue timeouts, in particular if they include channel.open operations,
can mean all kinds of things but you see a TCP connection failure in the log.
Which seems like a primary candidate for the root cause.

See also http://www.rabbitmq.com/heartbeats.html.

Recovering from connection.close (CHANNEL_ERROR - second 'channel.open' seen)

What does the second "channel.open" seen indicate? i am not calling open twice on any channel. Also, the rails app (Phusion Passenger) and rabbitmq are running on different containers on the same host. So the chances of a network outage are much less - there is no physical network between the applications, just a docker0 bridge.

What do I see in RabbitMQ server logs.

During the three different timeout episodes, I see period of 2-3 minutes during which there were no log entries on the server:

=INFO REPORT==== 21-Jun-2017::06:37:33 ===

closing AMQP connection <0.31984.26> (172.17.0.9:44224 -> 172.17.0.7:5672)

<<< nothing here. This client experienced timeouts during this time. >>>

=INFO REPORT==== 21-Jun-2017::06:39:44 ===

accepting AMQP connection <0.254.27> (172.17.0.9:44315 -> 172.17.0.7:5672)

=INFO REPORT==== 21-Jun-2017::06:39:44 ===

closing AMQP connection <0.254.27> (172.17.0.9:44315 -> 172.17.0.7:5672)

Corresponding error in server log:

=ERROR REPORT==== 21-Jun-2017::07:55:21 ===

Error on AMQP connection <0.30990.16> (172.17.0.11:47924 -> 172.17.0.7:5672, vhost: '/', user: 'test', state: running), channel 3:

operation channel.open caused a connection exception channel_error: "second 'channel.open' seen"

=INFO REPORT==== 21-Jun-2017::07:55:21 ===

closing AMQP connection <0.30990.16> (172.17.0.11:47924 -> 172.17.0.7:5672)

=INFO REPORT==== 21-Jun-2017::07:55:51 ===

accepting AMQP connection <0.5177.28> (172.17.0.11:53889 -> 172.17.0.7:5672)

=ERROR REPORT==== 21-Jun-2017::07:55:55 ===

closing AMQP connection <0.5177.28> (172.17.0.11:53889 -> 172.17.0.7:5672):

{handshake_error,tuning,2,

{amqp_error,unexpected_frame,

"type 1, first 16 octets = <<0,20,0,40,0,200,7,71,111,111,100,98,121,101,0,0>>",

none}}

Arindam

--
Bunny: http://rubybunny.info
March Hare: http://rubymarchhare.info

IRC: #rabbitmq on irc.freenode.net

Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+...@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jun 21, 2017, 2:56:19 PM6/21/17

to ruby...@googlegroups.com

"Second channel.open seen" means that a channel.open frame was seen by a connection with a channel id that is already allocated (open).

This is a critical (unrecoverable) exception in the protocol, so RabbitMQ closes the connection but the client does not expect that and assumes it has to recover. Also, no responses for earlier operations arriv (the connection is closed), resulting in timeouts.

This and the note about apps and RabbitMQ on the same host suggest it is a channel sharing issue,

for example, multiple threads close and open channels on a shared connection without synchronization.

Staff Software Engineer, Pivotal/RabbitMQ

Arindam Mukherjee

unread,

Jun 21, 2017, 3:25:07 PM6/21/17

to Ruby RabbitMQ libraries

On Thursday, 22 June 2017 00:26:19 UTC+5:30, Michael Klishin wrote:

"Second channel.open seen" means that a channel.open frame was seen by a connection with a channel id that is already allocated (open).

This is a critical (unrecoverable) exception in the protocol, so RabbitMQ closes the connection but the client does not expect that and assumes it has to recover. Also, no responses for earlier operations arriv (the connection is closed), resulting in timeouts.

This and the note about apps and RabbitMQ on the same host suggest it is a channel sharing issue,

for example, multiple threads close and open channels on a shared connection without synchronization.

1. Multiple threads close and open channels on a shared connection with synchronization.

The link http://rubybunny.info/articles/concurrency.html says:

> Connections in Bunny will synchronize writes both for messages and at the socket level. This means that publishing on a shared connection from multiple threads is safe but only if every publishing thread uses a separate channel.

> This means that without synchronization on, publishing from multiple threads on a shared channel may result in frames being sent to RabbitMQ out of order,

But there are no shared channels across threads. Each thread opens and closes its own channels. What kind of additional synchronization is needed?

> the note about apps and RabbitMQ on the same host suggest that it is a channel sharing issue

No channels are shared across threads. Also how do two containers (Rails, RabbitMQ) being on the same host have any influence on this?

Michael Klishin

unread,

Jun 21, 2017, 3:40:14 PM6/21/17

to ruby...@googlegroups.com

The issue here is not with publishing. If it was with publishing you'd get a different error

from the parser, something like

> UNEXPECTED_FRAME - expected content header for class 60, got non content header frame instead

This is because a published message is at least 2, typically 3 frames on the wire:

[basic.publish frame] [content header (message metadata)] [message payload]*

Here's a blog post from a team that managed to run into this with Puma:

https://wework.github.io/ruby/rails/bunny/rabbitmq/threads/concurrency/puma/errors/2015/11/12/bunny-threads/

Your issue is NOT the same. Something in your code opens and closes channels concurrently, so at some point

RabbitMQ believes that channel N is open while the client believes that it was closed and thus ID = N can

be reused, which confuses the server.

I cannot be more specific as I don't have your code or access

to your environment (or the time to get to the bottom of every question, to be quite honest) but

there is enough evidence that this is a concurrency hazard.

Two contains being on the same host is irrelevant to concurrency in a single process,

however, it was important to mention that the connection recovery logged by Bunny is not

due to a TCP connection failure but rather because RabbitMQ closed it due to an unrecoverable protocol exception.

Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+unsubscribe@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arindam Mukherjee

unread,

Jun 23, 2017, 4:15:11 PM6/23/17

to Ruby RabbitMQ libraries

Thanks for the insights and sorry for the frantic flailing. I have been struggling with this issue for a month or so.

> Something in your code opens and closes channels concurrently, so at some point

RabbitMQ believes that channel N is open while the client believes that it was closed and thus ID = N can

be reused, which confuses the server.

When you say that something is opening and closing channels concurrently, do you mean that a channel is closed by a thread, and before this is registered by the server, Bunny reuses this channel when a different thread tries to open a channel around the same time?

What kind of issues in client code calling Bunny could possibly lead to this?

Would you suggest any way to fix this - even if it's just a hack? Say making sure that no channel is open within 10 seconds of closing a channel?

Is it possible that keeping a single Bunny session open over 20 hours, and opening more and more channels is a problem?

Michael Klishin

unread,

Jun 23, 2017, 5:57:15 PM6/23/17

to ruby...@googlegroups.com

I cannot recommend a fix or hack since I don't have your code, traffic dumps

or anything else that provides *the details* that are badly needed when reasoning about

a concurrency hazard.

Connections are meant to be long lived and 20 hours is nothing special.

Channels can be long lived but also can be short lived (opening a channel per operation

is quite inefficient but beyond that, short lived channels are fine).

If you keep accumulating channels, two things will happen:

* They will consume more and more resources (primarily RAM) in both your app and RabbitMQ node

* Eventually you will hit the limit known as channel_max, the maximum allowed number of channels

(65535 by default IIRC)

When I talk about opening and closing channels, I mean two operations:

* Bunny::Session#create_channel

* Bunny::Channel#close

a channel can also be closed due to a channel-level exception but those result in Ruby

exceptions very quickly if not immediately so they are hard to not notice.

Every Bunny connection has a map (hash) of channels:

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L208

access to which is protected with a mutex in most places (possibly not all

but covering all possible modifications is not the goal or guarantee):

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L353

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/channel.rb#L223

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L516

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/channel.rb#L235

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L532

Connections identify channels by an ID, which is just an integer.

The pool of available IDs is managed using a channel allocator based on a bit set:

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L935

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L939

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L945

https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/channel_id_allocator.rb

https://github.com/ruby-amqp/amq-protocol/blob/master/lib/amq/int_allocator.rb

On the RabbitMQ end, each connection has a map of channels largely similar to what

Bunny has. Channel management in the server is more involved but concurrent access

is effectively excluded because all operations on a given channel are processed sequentially

by the server.

Opening or closing a channel is a network roundtrip (channel.open => channel.open-ok,

channel.close => channel.close-ok). Bunny will wait for it before returning control to your app

intentionally.

In your case, someone Bunny believes that an ID is available to reuse when RabbitMQ doesn't

agree. So Bunny tries to open a channel that the server believes is still in use and a fatal connection

exception is triggered, then RabbitMQ closes TCP connection.

Taking a traffic dump with tcpdump and inspecting it with Wireshark will greatly help with identifying

the sequence of frames on the wire, and hopefully validate my hypothesis:

https://www.rabbitmq.com/amqp-wireshark.html

Of course, the easiest way to complete dodge any shared state concurrency hazard is to not

have shared state if you can: using N connections in an app instead of 1 can be reasonable

and make your code more straightforward. I don't know if it's realistic in your case

but 5, 10, even 30 connections per app are not alarmingly large numbers (as long as your server has enough

file descriptors available).

Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+unsubscribe@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arindam Mukherjee

unread,

Jun 27, 2017, 1:26:47 PM6/27/17

to Ruby RabbitMQ libraries

I noticed two things.

a. Even after setting the heartbeat to 120 on the server, and using :heartbeat => :server while creating Bunny Sessions, we would occasionally timeout while performing an operation on the connection. This happens irrespective of whether we have a single open connection or several (we have tried with 4).

b. The issues with channel ids where Bunny and the server don't agree on a channel having been closed start happening after this timeout. I was able to work around it by manually assigning channel ids and in an increasing sequence and keeping track of used vs free channel ids.

However, #a keeps showing up in every test run randomly. Typically during the periods when such a timeout is observed, the RabbitMQ server logs show no activity. Are there any connection attributes, etc. in Bunny that I can log when such timeouts happen to get a better idea of what's going on?

Regards,

Arindam

Arindam Mukherjee

unread,

Jun 27, 2017, 3:09:58 PM6/27/17

to Ruby RabbitMQ libraries

For #a below, I realize that the default value of continuation_timeout of 4000 milliseconds might be too low. What factors should I take into account in computing a reasonable value for continuation timeout?

Regards,

Arindam

Michael Klishin

unread,

Jun 27, 2017, 3:13:19 PM6/27/17

to ruby...@googlegroups.com

It should be the same as heartbeat timeout (and 120 is way too high).

Reply all

Reply to author

Forward