Irregularly, bunny looses the connection and can't reconnect

979 views
Skip to first unread message

onli onli

unread,
Sep 3, 2018, 8:26:06 PM9/3/18
to Ruby RabbitMQ libraries
Hi
I got sent to here from https://github.com/ruby-amqp/bunny/issues/464#issuecomment-418202238. I wouldn't know what to look for when going deep into wireshark to debug this (at least not without additional guidance). But I want to be a good citizen and report my issue with as much code as I can. Maybe it helps someone, and best case scenario I do something trivially simple wrong and can get help here, which would be great.

Scenario: I got rabbitMQ on a server. On the same server, a script runs a thread pool, fetches data and sends it using bunny to localhost's rabbitmq. RabbitMQ sends the data to a second server. On that second server runs Sinatra (with puma), and in Sinatra's configure function it starts a bunny subscribe call to fetch the incoming data. And herein lies the problem: This works, but sometimes it stops working. Messages pile up on the rabbitmq side, several hundred thousands this time, it's a bit scary. When that happens rabbitmq still works (it receives data and can initiate new subscriptions), but bunny is - sometimes, right now it worked three times in a row after an error it reported - unable to restore the connection indefinitely long.

This is how my sender looks like:

conn = Bunny.new("amqp://user:pw@localhost:5672")
conn.start

ch = conn.create_channel
@q = ch.queue('offers', :exclusive => false)
@x  = ch.default_exchange

def notifyMaster(data)
       @x.publish(data.to_msgpack, routing_key: @q.name)
end


pool = Thread.pool(16)

pool.process {
    data = gather_data()
    notifyMaster(data)
}


pool.process {
    data = gather_other_data()
    notifyMaster(data)
}

This is my receiver, under Puma/Sinatra:

configure do   
    conn = Bunny.new("amqp://user:pw@serverip:port)
    conn.start
    ch = conn.create_channel
    q  = ch.queue("offers")
    q.subscribe do |delivery_info, metadata, payload|
        data = MessagePack.unpack(payload)
        doStuffWith(data)
    end

    pool.process {
        while true
           sleep(3600)
           activateData()
        end
    }

end

These are the error messages bunny gives when it fails to recover:

E, [2018-09-03T22:38:49.832975 #30937] ERROR -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Got an exception when receiving data: stream closed (IOError)
W, [2018-09-03T22:38:49.833253 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Exception in the reader loop: IOError: stream closed
W, [2018-09-03T22:38:49.833345 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Backtrace:
W, [2018-09-03T22:38:49.833402 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     <internal:prelude>:76:in `__read_nonblock'
W, [2018-09-03T22:38:49.833454 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     <internal:prelude>:76:in `read_nonblock'
W, [2018-09-03T22:38:49.833506 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/cruby/socket.rb:55:in `block in read_fully'
W, [2018-09-03T22:38:49.833560 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/cruby/socket.rb:54:in `loop'
W, [2018-09-03T22:38:49.833612 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/cruby/socket.rb:54:in `read_fully'
W, [2018-09-03T22:38:49.833664 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/transport.rb:218:in `read_fully'
W, [2018-09-03T22:38:49.833716 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:    /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/transport.rb:240:in `read_next_frame'
W, [2018-09-03T22:38:49.833768 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:78:in `block in run_once'
W, [2018-09-03T22:38:49.952678 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:    /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:77:in `loop'
W, [2018-09-03T22:38:49.952785 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:77:in `run_once'
W, [2018-09-03T22:38:49.952840 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:35:in `block in run_loop'
W, [2018-09-03T22:38:49.952893 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:32:in `loop'
W, [2018-09-03T22:38:49.952946 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/pc-kombo/www.pc-kombo.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:32:in `run_loop'
W, [2018-09-03T22:38:49.953017 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Will recover from a network failure (no retry limit)...
W, [2018-09-03T22:38:59.953600 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Retrying connection on next host in line: ip:5672
W, [2018-09-03T22:39:58.483644 #30937]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Will recover from a network failure (no retry limit)...
E, [2018-09-03T22:39:58.484322 #30937] ERROR -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Got an exception when receiving data: stream closed (IOError)

Somehow the stream got closed. For comparison, this is how this looks like when the recovery works out:

E, [2018-09-03T23:15:05.867769 #31730] ERROR -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)
W, [2018-09-03T23:15:06.062731 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Exception in the reader loop: Errno::ECONNRESET: Connection reset by peer
W, [2018-09-03T23:15:06.182571 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Backtrace:
W, [2018-09-03T23:15:06.262606 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     <internal:prelude>:76:in `__read_nonblock'
W, [2018-09-03T23:15:06.382588 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     <internal:prelude>:76:in `read_nonblock'
W, [2018-09-03T23:15:06.383066 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/cruby/socket.rb:55:in `block in read_fully'
W, [2018-09-03T23:15:06.383352 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/cruby/socket.rb:54:in `loop'
W, [2018-09-03T23:15:06.383609 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/cruby/socket.rb:54:in `read_fully'
W, [2018-09-03T23:15:06.383883 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/transport.rb:218:in `read_fully'
W, [2018-09-03T23:15:06.384131 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:    /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/transport.rb:243:in `read_next_frame'
W, [2018-09-03T23:15:06.384434 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:    /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:78:in `block in run_once'
W, [2018-09-03T23:15:06.384680 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:77:in `loop'
W, [2018-09-03T23:15:06.384920 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:77:in `run_once'
W, [2018-09-03T23:15:06.385157 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:    /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:35:in `block in run_loop'
W, [2018-09-03T23:15:06.385394 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:    /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:32:in `loop'
W, [2018-09-03T23:15:06.385630 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>:     /home/user/www.user.de/vendor/ruby/2.3.0/gems/bunny-2.11.0/lib/bunny/reader_loop.rb:32:in `run_loop'
W, [2018-09-03T23:15:06.385886 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Will recover from a network failure (no retry limit)...
W, [2018-09-03T23:15:16.386608 #31730]  WARN -- #<Bunny::Session:0x3440f18 user@ip:5672, vhost=/, addresses=[ip:5672]>: Retrying connection on next host in line: ip:5672

So, to me, it seems that sometimes the recovery process can run into an exception, thinks the stream is closed (or maybe the stream can get closed for a moment, dunno), and then it dies in a fire instead of waiting and trying it later again. That resembled very much what I thought was described in the github issue linked above. All I have to do to make it work again for a while is to restart the receiver, by restarting the sinatra process. Thus I think this is a bug.

Should I try to catch Errno::ECONNRESET around the bunny subscribe and restart the conenction/subscription in that rescue block manually?

Thanks for reading

Michael Klishin

unread,
Sep 3, 2018, 8:36:16 PM9/3/18
to ruby...@googlegroups.com
Errno::ECONNRESET is already handled [1] together with all others Errno::* exception classes
since they inherit from SystemCallError [2].

You will not achieve anything by wrapping Bunny::Queue#subscribe calls with a rescue/catch block:

 * Queues are not concerned with connection recovery, connections are
 * I/O exceptions are thrown and handled in a separate I/O thread unless

Can this be reproduced with a single node? How are the two nodes interconnected? How do you simulate
the I/O exception? How many connections, channels and consumers (on the queue page) do you see in the management
UI in each scenario/outcome?

Consuming in a Puma-based app is a pretty unusual case and Puma can reap its threads as it sees fit which Bunny
cannot possibly recover from but on year 6 or so I'm pretty sure most obvious connection recovery issues in Bunny
have been addressed (recovery attempts limit is the most recent one: I suspect not enough people use the feature so it
took a couple of years!), so the root cause should be reasonably easy to find provided a runnable way to reproduce.

As for traffic dumps, sharing them is a good idea.


--
Bunny: http://rubybunny.info
March Hare: http://rubymarchhare.info
 
IRC: #rabbitmq on irc.freenode.net
 
Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+unsubscribe@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Sep 3, 2018, 8:38:09 PM9/3/18
to ruby...@googlegroups.com
Bunny can be easily made log in debug mode and server logs can provide additional clues, too
(although Wireshark has 99% of the information both logs can cover it can be more obscure to those not familiar with it).

onli onli

unread,
Sep 4, 2018, 3:20:25 AM9/4/18
to Ruby RabbitMQ libraries
Hi Michael, thanks for the answer.

> Can this be reproduced with a single node?

I have no way to reproduce this at all. This happens irregularly. Also makes debugging hard - I'm not stubbornly not listening to you there, I just wouldn't even know when to debug, since I can't predict when the issue will arise.

> How are the two nodes interconnected?

They both are in the same data center and have a public IP. There is nothing special connecting this, and nothing special I know of in between them.

> How do you simulatethe I/O exception?

I don't. I don't know how.

> How many connections, channels and consumers (on the queue page) do you see in the management
UI in each scenario/outcome?

When it works, I see 2 connections, 2 channels and one queue. That's the expected state, right? Now it just got stuck again, Stream IO error, so it's dead again. I'll see whether someting changed... Yes, the connection and the channel from the second server, the one with sinatra, is gone. The queue is still there and running, data is still incoming. When I restart the server now... the connection and the channel reappeared immediately. So there is nothing at the network level blocking the connection right now.

> Consuming in a Puma-based app is a pretty unusual case and Puma can reap its threads as it sees fit which Bunnycannot possibly recover

But isn't the stream closed error more a sign of a timing issue at the protocol level - why would it htink the stream is closed? Ah, that's coming from a local socket connection, it is the socket that is closed, in https://github.com/ruby-amqp/bunny/blob/7769938b92007af697b590dd799b375c668583d6/lib/bunny/transport.rb#L220 . Strange. Also, why the ECONNRESET in the first palce, this would come from the server side...

Given the setup I have, how should I run bunny otherwise? This all has to be in-thread with the puma process, since the goal is to save the data into a sqlite database, so having a separate process is not a real option.

Server logs: There is indeed something there, if you were thinking of the rabbitmq server. At the time the reconnection just failed, rabbitmq says

=ERROR REPORT==== 4-Sep-2018::07:01:31 ===
closing AMQP connection <0.6727.15> (sinatraserver-ip:59114 -> rabbitmqserver-ip:5672):
{writer,send_failed,{error,timeout}}

I will now check how to run bunny in debug mode.

Michael Klishin

unread,
Sep 4, 2018, 7:28:43 AM9/4/18
to ruby...@googlegroups.com
OK, so long story short, it happens naturally due to a network disruption and cannot be reproduced.
Fair enough.

The error message simply says that a socket write from the server end timed out.

You have two options IMO:

 * Use the single threaded mode with disabled automatic recovery. Then recovery will be entirely up to you but in many cases it's not that difficult to do, in particular for consumers
 * Limit the number of recovery attempts to, say, 10 and forcefully recreate it in your code when its closed and not recovering (there is no predicate for knowing when a connection has reached its recovery attempt limit but it can be added if necessary)

With both options I'd use Bunny master since there are improvements for both options (and in particular the functional recovery attempt limit for #2)

I can try running long tests with Toxiproxy in hope to reproduce but ultimately socket write timeouts should result in missed heartbeat failures and recovery attempts
that use sensible TCP socket timeouts and therefore won't hang forever under any circumstances. At least we are not aware of any such cases.

--
Bunny: http://rubybunny.info
March Hare: http://rubymarchhare.info
 
IRC: #rabbitmq on irc.freenode.net
 
Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+unsubscribe@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

onli onli

unread,
Sep 12, 2018, 5:59:06 PM9/12/18
to Ruby RabbitMQ libraries
A short update: On the server that lost the connection there was a performance issue that boils down to a big table being read very often that was missing an index. Since that is fixed the bunny connection stayed stable or could always be repaired. That might be by chance, but for now I assume the periodically extremely high load of the ruby process triggered a race condition in the bunny thread handling. If the issue repeats itself once again I will try the route of disabling the automatic reconnect and just reinitialising the whole connection each time it goes wrong, that's how I handle regular REST api connection problems anyway. Thanks for laying out those options.

I'd then post my code, if it works. But I hope the bug is just gone now.
Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+...@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Sep 12, 2018, 6:54:44 PM9/12/18
to ruby...@googlegroups.com
Irregularities in thread scheduling can be a very hard to reproduce complicating factor.

I can't think of a specific scenario where it would break Bunny's recovery (which is not rocket science)
but it's hard to argue with evidence, even anecdotal one ;)

Thank you for reporting back to the list!


---
You received this message because you are subscribed to the Google Groups "Ruby RabbitMQ libraries" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-amqp+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages