I get a feeling that documentation guides
improvements that I am working on are less important than some basic
solutions and/or guidelines in place for app recovery.
So I sat down and put together a list of things I will be working on next.
I can't guarantee that they will
work great for everyone, so if something doesn't go well, it will be
taken out. Recovery is a hard enough problem
on its own to forever support/fight broken features.
Lets take a look what I think is a good list of features to have:
:recovery event
-----------------------------------------------------------------
Right now there is a number of events on connections you
can register callbacks (handlers) for:
* successful connection
* authentication failure
* TCP connection loss
and so on. Channels, exchanges and queues have limited ability
to respond to TCP connection failure event, for example. They are
automatically reset, but that's about it.
What is badly needed is a new event:
* recovery
that is fired when network connection is back up *and* AMQP connection
is reopened.
Methods to re-open/re-declare
-----------------------------------------------------------------
Next piece of the puzzle is a bunch of methods that all serve a similar
purpose:
AMQP::Channel#reopen
AMQP::Exchange#redeclare
AMQP::Queue#redeclare
Those methods will simply use existing object state/attributes to
redeclare themselves (supposedly once connection comes up).
Which leads to the Crown Jewel of recovery features...
Automagical recovery done right
-----------------------------------------------------------------
By combining recovery event, re-declaration methods and existing
"failure shutdown propagation" (when connection resets channels,
they reset exchanges and queues and so on, the RabbitMQ Java client
calls this the "shutdown protocol"), we can implement a reasonably
good "automatical recovery".
Automagical recovery will only apply to entities that you
specify as auto-recovering (using :auto_recovery => true option).
It will set up a handler for :recovery even that will call/schedule
#reopen/#redeclare, so it will be easy to implement your own
"auto recovery" with those basic tools.
Recovery is hard and I am sure automagical recovery mode will still suck
a great deal, but everyone wants to see it with their own eyes ;)
There is a number of cases when automagical recovery probably will have
some kind of default behavior. For example, server-named queues pretty
much must be re-declared upon recovery (otherwise we cannot know whether
their names are unique).
I also want to fine-grain some events a little bit. Right now we have
* X has happened
but for some events, it makes sense to split them into 2:
* before X
* after X
I haven't made my mind up about what those events are, but :recovery
is one of them.
In the best traditions of Hammock-driven development [1] I will sleep on
this idea for at least one day. But other than that, I think this "opt-in" behavior
plus ability to implement your own recovery strategy using the same
basic tools as the automagical behavior are worth having and won't cause
people grief (if you don't like those, don't use them, they are not shoveled
down developers throats).
Let me know if you have better ideas. Or just think this particular idea
sucks.
1. http://blip.tv/clojure/hammock-driven-development-4475586
MK
http://github.com/michaelklishin
http://twitter.com/michaelklishin
--
Documentation guides: http://bit.ly/amqp-gem-docs
Code examples: http://bit.ly/amq-gem-examples
API reference: http://bit.ly/mDm1JE
Drop by #rabbitmq on irc.freenode.net
Bug tracker: https://github.com/ruby-amqp/amqp/issues
Post to the group: ruby...@googlegroups.com | unsubscribe: ruby-amqp+...@googlegroups.com
Group page: http://groups.google.com/group/ruby-amqp?hl=en
Eek. So, specifically, what do you want with respect to heartbeats handling? because this is all part of reconnection/recovery, at least in my eyes. Lets make heartbeat handling more useful while we are at it.
Automagical recovery done right
-----------------------------------------------------------------
By combining recovery event, re-declaration methods and existing
"failure shutdown propagation" (when connection resets channels,
they reset exchanges and queues and so on, the RabbitMQ Java client
calls this the "shutdown protocol"), we can implement a reasonably
good "automatical recovery".
Automagical recovery will only apply to entities that you
specify as auto-recovering (using :auto_recovery => true option).
It will set up a handler for :recovery even that will call/schedule
#reopen/#redeclare, so it will be easy to implement your own
"auto recovery" with those basic tools.
The only solution I see is to add some kind of buffering and "connectivity state" to exchanges. RabbitMQ Java client
has "connectivity state" for queues and exchanges, AFAIR, I need to read the source to learn more.
The problem is that it will affect message throughput. I can't say I have a solution for this in mind right now.
One more problem is that if your network goes down for hours (not completely unheard of, really), number of accumulated
messages may simply top process memory allowance so it will be killed by OOM killer or a monitoring tool like Monit or Nagios.
So maybe we should provide a way for apps to plug into network connection loss handling and stop publishing. I don't believe
amqp gem can really provide a good enough solution here, at least not with the default AMQP::Exchange implementation.
Maybe having a module that overrides Exchange#publish to support buffering is a good idea.
> I think in my case the only way around this is to keep a copy of each
> message I send out until I get an ack for each of those messages. On
> reconnect if some of my messages never were ack`ed then I'd resend
> them, preferably before any newer messages to keep the proper order.
> On reconnect, I need a way to send out a couple of messages(the "lost"
> messages) immediately before any other pending messages on the queue.
> Of course if the gem also had an option to do this automagically, then
> it would be even more awesome. ;o)
>
Like I said, maybe with another opt-in option. I am still working on the consumer side of the problem.
> What are your plans regarding re/delivery of lost/pending messages
> after a recovery? As you mentioned previously, currently everything is
> just reset and all messages are lost. I realize most apps care more
> about speed than messages getting lost, but given our requirements
> ideally we would not want lose a single message under any
> circumstances except for catastrophic failure.
To support all this we need to put several more pieces of the puzzle together:
* Everything described in the original post
* Broker detection + automatic loading of RabbitMQ extensions if broker is RabbitMQ (easy to do)
* Some kind of Exchange extension mechanism that will override #publish and add buffering/will keep track of publisher confirms.
Then I can see how Exchange instances can keep track of what was delivered automagically.
Removing that error callback allows the connection loss to be properly detected and recover, either connection forced should be treated as a connection loss error or the on_error should detect this exception and not bail out.
The other issue I hit is that it does not appear to recover the callbacks defined on queue subscriptions. After the broker comes back up the bindings are there but the code in the subscribe block never gets called. Is this supposed to be handled by the automatic recovery or is an additional step required to re-establish the callback?
Consumers should be restored. Maybe there is an issue, to test deliveries there needs to be a separate example.
I was able to reproduce the issue. Looking into what is going on.
If this is the desired behavior, it's time for some redesign for me, and perhaps a note in the documentation that relying on autodeleting exclusive queues with well-known names is not a good idea if network fault tolerance is part of the app? Or am I just weird to be doing this to begin with? {8'>
Paul,
Last time I checked, it worked. We are in the process of migrating travis-ci.org to amqp, this will be
an excellent ground to try automatic recovery, and one of the largest open source examples of
amqp Ruby apps in general.
If you are still having issues, let me know and I will take a look. It is a tedious thing to test, I am not sure how to
automate testing of it so I rely on other people to test it with me.
Paul Dlug escribió:
> Any update on the issues recovering autodeleting queues and callbacks?Paul,
Last time I checked, it worked. We are in the process of migrating travis-ci.org to amqp, this will be
an excellent ground to try automatic recovery, and one of the largest open source examples of
amqp Ruby apps in general.If you are still having issues, let me know and I will take a look. It is a tedious thing to test, I am not sure how to
automate testing of it so I rely on other people to test it with me.
Here's some code to demonstrate what I'm trying:
If I run the consumer and producer they communicate, I then kill -9 the rabbitmq broker and restart it. As far as I can tell the bindings are restored but the callbacks no longer exist and I'm not sure what the best way to re-register them is. I'm using immediate and mandatory when publishing so I can confirm the consumer isn't restored by restarting the producer, it then starts getting messages returned with "NO_CONSUMERS".
Consumers should be re-subscribed, too. This was primary motivation behind extracting AMQP::Consumer and switching Queue#subscribe to it. I will take a look.
If I run the consumer and producer they communicate, I then kill -9 the rabbitmq broker and restart it. As far as I can tell the bindings are restored but the callbacks no longer exist and I'm not sure what the best way to re-register them is. I'm using immediate and mandatory when publishing so I can confirm the consumer isn't restored by restarting the producer, it then starts getting messages returned with "NO_CONSUMERS".
I think after a bunch of fixes from last night and today amqp gem master no longer has these issues. Examples I tried include both explicitly named and server-named queues with 1 or 2 consumers. Publisher recovery is a harder problem, but typically publishers are also structured simpler than consumer apps. So we will see how soon I will find a good enough solution for that part of the puzzle.
In the meantime, please try what's in the master (please note: it needs an unreleased amq-client version from git, too).
Any recommendations on dealing with a graceful consumer shut down? In that case I get a "CONNECTION_FORCED" to the on_error handler for the connection. I'd like to treat this the same as a connection loss and just go into a reconnect loop. I'm not sure if there are other connection related exceptions the broker can through that should be treated the same.
Can you explain what exactly causes CONNECTION_FORCED to be raised? AMQP reference does not clarify when it may be raised.
It's raised if you try a graceful shutdown of the broker: "q()." in terminal when running in foreground, "rabbitmqctl stop", etc.
https://github.com/ruby-amqp/amqp/commit/9145984622caf23974d02528d51348cbc42788f4
Give it a try. I really want to release RC14 in the next few days.
I tested this and it works perfectly, I do think it is a bit messy for the consumer of the amqp gem to have to know to catch an error with a reply code of 320.
It would be optimal to have the amqp gem just treat this as a connection loss (since it is).
I have a problem with publisher recovery. When I kill -9 rabbitmq
process and start it back up, channel.default_exchange.publish does
nothing (neither does channel.queue('test').publish). Here is the
code:
I can't get the publisher to recover properly. After I kill -9
rabbitmq process and start it back up, channel becomes useless (can't
register a new queue or use the default exchange). Here is the code:
connection.after_recovery do# Uncommenting following two lines would make it work. This leads me
puts 'recovered'
to believe it is a problem with the channel and not the connection.
# @channel.close if @channel
# @channel = AMQP::Channel.new(connection, :auto_recovery => true)
end
Automatic recovery callbacks are fired when AMQP connect is open again. But you also need to re-open the channel (as you already figured out). Automatic recovery does that for you, and only then queues, bindings & consumers are recovered.