In reviewing the mailing list archives, I see various threads which state that ensuring "exactly once" delivery requires deduplication by the consumer. For example the following:
"Exactly-once requires coordination between consumers, or idempotency,
even when there is just a single queue. The consumer, broker or network
may die during the transmission of the ack for a message, thus causing
retransmission of the message (which the consumer has already seen and
processed) at a later point." http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2009-July/004237.html
In the case of competing consumers which pull messages from the same queue, this will require some sort of shared state between consumers to de-duplicate messages (assuming the consumers are not idempotent).
Our application is using RabbitMQ to distribute tasks across multiple workers residing on different servers, this adds to the cost of sharing state between the workers.
Another message in the email archive mentions that "You can guarantee exactly-once delivery if you use transactions, durable queues and exchanges, and persistent messages, but only as long as any failing node eventually recovers."
From the way I understand it, the transaction only affects the publishing of the message into RabbitMQ and prevents the message from being queued until the transaction is committed. If this is correct, I don't understand how the transaction will prevent a duplicate message in the previously mentioned scenarios that will cause a retransmission. Can anybody clarify?
On a more practical level:
What's the recommended way to deal with the potential of duplicate messages?
What do people generally do?
Is this a rare enough edge case that most people just ignore it?
Thanks,
Mike
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq...@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
On Tue, Aug 03, 2010 at 04:43:56AM -0400, Mike Petrusis wrote:
> In reviewing the mailing list archives, I see various threads which state that ensuring "exactly once" delivery requires deduplication by the consumer. For example the following:
>
> "Exactly-once requires coordination between consumers, or idempotency,
> even when there is just a single queue. The consumer, broker or network
> may die during the transmission of the ack for a message, thus causing
> retransmission of the message (which the consumer has already seen and
> processed) at a later point." http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2009-July/004237.html
>
> In the case of competing consumers which pull messages from the same queue, this will require some sort of shared state between consumers to de-duplicate messages (assuming the consumers are not idempotent).
>
> Our application is using RabbitMQ to distribute tasks across multiple workers residing on different servers, this adds to the cost of sharing state between the workers.
>
> Another message in the email archive mentions that "You can guarantee exactly-once delivery if you use transactions, durable queues and exchanges, and persistent messages, but only as long as any failing node eventually recovers."
All the above is sort of wrong. You can never *guarantee* exactly once
(there's always some argument about whether receiving message duplicates
but relying on idempotency is achieving exactly once. I don't feel it
does, and this should become clearer as to why further on...)
The problem is publishers. If the server on which RabbitMQ is running
crashes, after commiting a transaction containing publishes, it's
possible the commit-ok message may get lost. Thus the publishers still
think they need to republish, so wait until the broker comes back up and
then republishes. This can happen an infinite number of times: the
publishers connect, start a transaction, publish messages, commit the
transaction and then the commit-ok gets lost and so the publishers
repeat the process.
As a result, on the clients, you need to detect duplicates. Now this is
really a barrier to making all operations idempotent. The problem is
that you never know how many copies of a message there will be. Thus you
never know when it's safe to remove messages from your dedup cache. Now
things like redis apparently have the means to delete entries after an
amount of time, which would at least allow you to avoid the database
eating up all the RAM in the universe, but there's still the possibility
that after the entry's been deleted, another duplicate will come along
which you now won't detect as a duplicate.
This isn't just a problem with RabbitMQ - in any messaging system, if
any message can be lost, you can not achieve exactly once semantics. The
best you can hope for is a probability of a large number of 9s that you
will be able to detect all the duplicates. But that's the best you can
achieve.
Scaling horizontally is thus more tricky because, as you say, you may
now have multiple consumers which each receive one copy of a message.
Thus the dedup database would have to be distributed. With high message
rates, this might well become prohibitive because of the amount of
network traffic due to transactions between the consumers.
> What's the recommended way to deal with the potential of duplicate messages?
Currently, there is no "recommended" way. If you have a single consumer,
it's quite easy - something like tokyocabinet should be more than
sufficiently performant. For multiple consumers, you're currently going
to have to look at some sort of distributed database.
> Is this a rare enough edge case that most people just ignore it?
No idea. But one way of making your life easier is for the producer to
send slightly different messages on every republish (they would still
obviously need to have the same msg id). That way, if you detect a msg
with "republish count" == 0, then you know it's the first copy, so you
can insert async into your shared database and then act on the message.
You only need to do a query on the database whenever you receive a msg
with "republish count" > 0 - thus you can tune your database for
inserts and hopefully save some work - the common case will then be the
first case, and lookups will be exceedingly rare.
The question then is: if you've received a msg, republish count > 0 but
there are no entries in the database, what do you do? It shouldn't have
overtaken the first publish (though if consumers disconnected without
acking, or requeued messages, it could have), but you need to cause some
sort of synchronise operation between all the consumers to ensure none
are in the process of adding to the database - it all gets a bit hairy
at this point.
Thus if your message rate is low, you're much safer doing the insert and
select on every message. If that's too expensive, you're going to have
to think very hard indeed about how to avoid races between different
consumers thinking they're both/all responsible for acting on the same
message.
This stuff isn't easy.
Matthew
What exactly do they mean by that? In particular, how do they deal with
duplicates? Do they report failure, or silently let a dup through in certain
situations? If you could point me to the part of the spec that sets out the JMS
resolution of these issues, that's be really useful.
Tony
The other piece of this is time-to-live (TTL). Given a finite-length dedup
cache and message TTL, you can detect and report failure. (And if the ack
travels upstream to the publisher, you can report failures at the send end,
too.) Without the TTL, you have silent dups on rare occasions.
Tony
As an API spec, it's quite easy for JMS to mandate something apparently
impossible, without hinting at how it might actually be implemented.
Most of the spec says that the PERSISTENT delivery mode gives
"once-and-only-once" delivery. But section 4.4.13 (of JMS 1.1) admits
that there are a number of caveats to this. So it's really
"once-and-only-once-except-in-some-corner-cases".
I think the wrinkle that might prevent us saying that RabbitMQ gives the
same guarantees is on the publishing side. The caveats in JMS all seems
to apply only to the consuming side. But what happens with an AMQP
producer if the connection gets dropped before a tx.commit-ok gets back
to the client? In that case the client has to re-publish, leading to a
potential dup. This can be avoided by a de-dup filter on published
messages in the broker. I don't know if JMS brokers really go to such
lengths.
David
--
David Wragg
Staff Engineer, RabbitMQ
SpringSource, a division of VMware
For consumers, JMS has client ack mode; the application acknowledges
messages, and the server must not resend a message that has been
acknowledged.
A failure in the connection may result in the server resending a message
which the application thinks it has acknowledged. The spec suggests
"Since such clients cannot know for certain if a particular message has
been acknowledged, they must be prepared for redelivery of the last
consumed message.". I.e., the client application has to have an
idempotency barrier.
For producers, duplicate publishing is simply prohibited. As for
failure modes -- "A message that is redelivered due to session recovery
is not considered a duplicate message."
So JMS cannot magically do "exactly once" any more than anything else.
--Michael
So that seems to suggest that every messages is universally unique?
If this is correct, who's responsibility is it to add GUIDs (or some
such) to every message? Does the client library do that automatically?
Matthew
JMS requires the provider to set the redelivered flag (and optionally the delivery count) field if it thinks the message has been given to the application before. The application may or may not have seen it but this flag can be used to trigger the check for a duplicate by the application. The use of unique message ids helps on this end.
Tony Menges
VMware, Inc.
John Apps wrote:What exactly do they mean by that? In particular, how do they deal with
> The JMS standard offers guaranteed once
duplicates? Do they report failure, or silently let a dup through in certain
situations? If you could point me to the part of the spec that sets out the JMS
resolution of these issues, that's be really useful.
Tony
Ahh interesting. It would thus seem that JMS requires slightly more of
the producer when publishing messages (more logic is required in the
client library there) and AMQP possibly requires more at the consumer
side.
Matthew's idea of having producers add a "republish count" to the messages is good suggestion to optimize the de-duplication of messages, but this only helps for messages resent by a producer.
Can messages get duplicated while they are propagating in the broker? If duplicates are produced in the broker they will have the same "republish count" and this method won't work.
Well, a message that is sent to an exchange which then results in the
message going to several queues will obviously be duplicated. But
presumably in that case, your consumers consuming from the different
queues would be doing different tasks with the messages, hence the need
for the different queues in the first place.
That aside, no, within a queue, Rabbit does not arbitrarily duplicate
messages.
The way we do it in HornetQ is we have a well defined header key
"_HQ_DUP_ID". The client can set this with some unique value of it's
choice before sending (e.g. a GUID). When the server receives the
message if the _HQ_DUP_ID header is set, it looks up the value in it's
cache, and if it's seen it before it ignores it. The cache can
optionally be persisted.
On the client side, the producer can resend the message/transaction if
it does not receive a confirmation-ok, so it effectively makes
sends/commits idempotent.
David
>
--
Sent from my BBC Micro Model B
Tim Fox
JBoss
HornetQ - putting the buzz in messaging http://hornetq.org
http://hornetq.blogspot.com/
http://twitter.com/hornetq
irc://irc.freenode.net:6667#hornetq
f...@redhat.com
How do you prevent the cache from growing without bound?
Matthew
On Fri, Aug 06, 2010 at 10:43:56PM +0100, Tim Fox wrote:How do you prevent the cache from growing without bound?
> The way we do it in HornetQ is we have a well defined header key
> "_HQ_DUP_ID". The client can set this with some unique value of it's
> choice before sending (e.g. a GUID). When the server receives the
> message if the _HQ_DUP_ID header is set, it looks up the value in
> it's cache, and if it's seen it before it ignores it. The cache can
> optionally be persisted.
Matthew
John Apps wrote:
> That's really like the piece of string question, no? Of course it can
> fill up, as can the DB where things are persisted for those cases where
> messages cannot be delivered.
> Having an unique ID in every message is not something new and not
> restricted to messaging, of course. It is simply a very good idea!
I believe Matthew was simply trying to point out that many of the
supposed guarantees of messaging systems are a lot softer than most
people think. In reality a "guarantee" is little more than an increase
in the probability that the right thing will happen. Coming clean about
that is going to be important for cloud computing to succeed - improving
the probabilities does come at a price, and for systems at massive
scales the cost/benefit calculations look quite different.
So, for example, using publisher-supplied message ids for de-duping
simply does not scale. Think what a genuine cloud messaging system would
have to do to handle the case where a producer injects the same message
first in a node in Australia and then in New York.
> What has not been touched on in this little discussion so far is the
> question of transactions
Similar considerations apply here. XA in the cloud? Hmmm.
Regards,
Matthias.
What is the problem you're thinking of? Would a setup like the following cope?
- publishers choose a message ID
- publishers choose a TTL
- receivers dedup based on message ID
- receiver's dedup buffer is expired by (some factor of) TTL
- each delivery contains an address to which the ACK should be routed
Tony
Tony Garnock-Jones wrote:
> Would a setup like the following cope?
>
> - publishers choose a message ID
> - publishers choose a TTL
> - receivers dedup based on message ID
> - receiver's dedup buffer is expired by (some factor of) TTL
> - each delivery contains an address to which the ACK should be routed
That's end-to-end dedup you are thinking of. Nothing wrong with that,
and it doesn't require the broker to do/know anything. The context of
the discussion here was a "broker dedups publishes" feature.
Matthias.
AFAIK the normal approach with this system is to bound it arbitrarily.
Well couldn't the broker take responsibility for the delivery itself by acking?
And use the same protocol (including perhaps a fresh message ID) to relay a
message out to a receiver on the outbound leg?
Perhaps it's a distraction: I guess I was really wondering what the New York vs
Australia part had to do with it.
Tony
Tony Garnock-Jones wrote:
> Matthias Radestock wrote:
>> That's end-to-end dedup you are thinking of. Nothing wrong with that,
>> and it doesn't require the broker to do/know anything. The context of
>> the discussion here was a "broker dedups publishes" feature.
>
> Well couldn't the broker take responsibility for the delivery itself by acking?
> And use the same protocol (including perhaps a fresh message ID) to relay a
> message out to a receiver on the outbound leg?
>
> Perhaps it's a distraction: I guess I was really wondering what the New York vs
> Australia part had to do with it.
1) publisher connects to cloud; ends up connecting to Australia node
2) publisher sends message
3) connection drops, publisher didn't get ack so must resend ...
4) publisher connects to cloud; ends up connecting to NY node
5) publisher re-sends message
At that point the cloud messaging service has two copies of the same
message in different locations. For the duplication to be detected, some
information needs to flow between the two locations. Which is expensive.
Regards,
Matthias.
Aha! I see. Thank you. I had the *publishers* being in different locations in
my mind. Which is of course a different problem.
In the context in which you originally posted the example ("using
publisher-supplied message ids for de-duping simply does not scale") I suspect
that the mention of publisher-supplied message IDs is not relevant: it's the
maintainance of any kind of dedup buffer at all in more than one place at the
same time that's not scalable.
Isn't the fundamental state-synchronisation-between-server-nodes problem
independent of the choice of message ID?
Tony
Well, in my mind, if you "guarantee" something, then you can offer a
proof in some branch of maths that the property you are guaranteeing can
never be violated except in the circumstances you qualify the statement
with.
That is why, as far as I'm concerned, no messaging system, regardless of
whether it's implemented in a computer or not, can ever guarantee
exactly once semantics: I can prove (or rather, other people have
proven) that provided any individual message can be lost, you can either
achieve "at least once" semantics or "at most once" semantics.
Quantum mechanics may offer a way to avoid messages ever being lost, but
I suspect that's probably not going to be sorted out in time for
RabbitMQ 3.0.
> In reality a "guarantee" is little more than an
> increase in the probability that the right thing will happen.
Indeed. And importantly, under different circumstances, that probability
can vary wildly. Once you start combining different systems, and the
error rates compound, the probability that the whole thing works end to
end can end up looking very shaky. I suspect the insurance industry will
do well here.
> > What has not been touched on in this little discussion so far is
> > the question of transactions. .. Updating DBs from queues and
> > posting the results of those updates to queues should be atomic; and
> > if I want my debit/credit to happen once rather than many times or
> > not at all, then a combination of transactions and guaranteed
> > delivery becomes very attractive both to the designer and the
> > developers. Yes, ACID comes to mind here...and it is indeed what I
> > am referring to.
Distributed transactions rely on 2-phase or 3-phase commit or some
variant thereof (eg there's an improved 3-phase build on paxos). Whilst
the protocols themselves are sound, my understanding is (and please
correct me if I'm wrong) that they all rely on, once everyone's agreed
there is no problem committing the transaction, no participant will
renege on that promise. But of course, no participant can actually
guarantee that. Thus distributed transactions can never guarantee ACID.
All you can get is increasing probability that the outcome will be
atomic, consistent and isolated. But you can never guarantee it.
Matthew
AFAICT, the normal way this plays out is that people seek to provide
>=1 semantics for publisher-broker, and <=1 semantics for
broker-consumer. Which for most people is 'good enough'.
> Quantum mechanics may offer a way to avoid messages ever being lost, but
> I suspect that's probably not going to be sorted out in time for
> RabbitMQ 3.0.
This would be renamed QbitMQ. Delivery would be determined by opening
a box. No bunnies would be harmed in this experiment.
>> Quantum mechanics may offer a way to avoid messages ever being lost, but
>> I suspect that's probably not going to be sorted out in time for
>> RabbitMQ 3.0.
>
> This would be renamed QbitMQ. Delivery would be determined by opening
> a box. No bunnies would be harmed in this experiment.
Even better:
If message is missing, try to guess what was that the sender intended to
send. Deliver it to the receiver. If it turns out later on that the
guess was incorrect, cancel the transaction. Additional advantage is
that you can get negative latencies this way.
Martin
Alexis Richardson wrote:
Quantum mechanics may offer a way to avoid messages ever being lost, but
I suspect that's probably not going to be sorted out in time for
RabbitMQ 3.0.
This would be renamed QbitMQ. Delivery would be determined by opening
a box. No bunnies would be harmed in this experiment.
Even better:
If message is missing, try to guess what was that the sender intended to send. Deliver it to the receiver. If it turns out later on that the guess was incorrect, cancel the transaction. Additional advantage is that you can get negative latencies this way.
Martin
On Mon, Aug 9, 2010 at 9:41 AM, John Apps <john...@gmail.com> wrote:
>
> It is good to see humour in discussions of this nature; it would be even
> better if those implementing the applications were to share the same humour!
> I suspect the world of open source is at times a different one to that which
> I seem to work in. Oh well, back to the drawing board.
John we take this stuff as seriously as anyone. That's why we do it
professionally.
We've often found customers whose requirements include "please break
the laws of physics, and cure cancer". I'm sure you know what I mean.
If you could advise us on how to best help in such cases, we're all
ears.
In the meantime we are stuck in world where "guaranteed" has no fixed
connotation.
alexis
In HornetQ we also provide an interface above and beyond JMS, which
allows the user to receive an *asynchronous* ack that the message they
sent (or tx commit) has been received ok on the server so they can clear
it from their local resend cache. Since this is async it's not limited
by network latency as in the blocking JMS case. The downside is many
messages can be in-flight at any time per producer so the caches need to
be larger.
To do all of this without being limited by arbitrary cache size, would
need some kind of "ack of ack" (we don't implement this yet)- i.e. 1)
the client sends message to server, 2) server sends ack back to to
client to say "received-ok" 3) client sends further ack from client to
server saying received-received-ok-ok. At point 2) the client can clear
their resend cache. At point 3) the server can clear it's cache. I
believe AMQP 1.0 specifies something similar to this too (?)
So.. this could scale. You'd have a further buffer per producer on the
server side. If you're using TCP on the server, then every connection
will have it's own buffer anyway. The extra buffer per producer should
be of the same order of size as the TCP buffer, since it's effectively
defined by a window, kind of similar to the TCP window size.
Like others have said, 100% once and only once delivery doesn't happen.
To get very near at 100% you can implement stuff like the above, and
also make sure your storage is highly redundant, also mirrored in
geographically distributed sites in case the building blows up. Then
you've got bugs in your own app, device drivers or the operating system
that screw your once and only once (for example last week I hit a bug in
the Linux kernel TCP impl which can cause packets to be lost at high
load) that you might hit well before having to take quantum effects into
account :)
> Matthew
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq...@lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
--
Sent from my BBC Micro Model B
Tim Fox
JBoss
HornetQ - putting the buzz in messaging http://hornetq.org
http://hornetq.blogspot.com/
http://twitter.com/hornetq
irc://irc.freenode.net:6667#hornetq
f...@redhat.com
_______________________________________________
That's an occasionally-useful optimization, but it really only works well with
message UUIDs (i.e. never repeating a message ID). Even then, in the general
case you're likely to want to bound the time you wait for the message
transmission to complete anyway (consider what happens when an ack or ack-ack
goes missing), it seems simpler (not to mention much less chatty) to me to
avoid ack-acks and go with a Delta-T style timeout-bounded buffer. At that
point you're free to choose any kind of message ID space, including
compressible fixed-length reusable spaces like those in TCP/SCTP etc.
> Like others have said, 100% once and only once delivery doesn't happen.
> To get very near at 100% you can implement stuff like the above, and
> also make sure your storage is highly redundant, also mirrored in
> geographically distributed sites in case the building blows up. Then
> you've got bugs in your own app, device drivers or the operating system
> that screw your once and only once (for example last week I hit a bug in
> the Linux kernel TCP impl which can cause packets to be lost at high
> load) that you might hit well before having to take quantum effects into
> account :)
Amen :-)
Tony
> We've often found customers whose requirements include "please break
> the laws of physics, and cure cancer". I'm sure you know what I mean.
> If you could advise us on how to best help in such cases, we're all
> ears.
>
> In the meantime we are stuck in world where "guaranteed" has no fixed
> connotation.
Yes. People tend to have unrealistic expectations.
It kind of reminds me of requirements for "security". Yes, application
can be made more resistant to attacks, however, "security" is not a
purely technical issue. It's a technical and _operational_ issue. To
achieve "security" you have to use good software _and_ introduce a sane
security policy in your organisation.
Same with guaranteed delivery. While applications can do all kinds of
tricks to improve reliability at some point you'll have to load the data
to USB key and ride across the country to deliver it.
So the goal of designing "guaranteed delivery" is twofold IMO. First,
it's doing all the tweaking of the software necessary to get as much
nines of reliability as possible. Second, its promoting sane
organisational patterns (patterns that would prevent message lost or
duplication).
Let me give a simple example:
1. Messages are confirmed by simple ack. There's no replay
functionality. This guarantees no duplicates.
2. Sender has a timeout for getting an ack. When the ack does not
arrive, the message is moved to dead letter queue. (Note that messages
in the DLQ are "dubious", i.e. they may have been delivered or not, we
don't know).
3. There's a person in the organisation responsible for the dead letter
queue. He goes through the queue once a day and tries to find out what's
the actual state of the dubious messages is (using personal
conversation, phone, checking the production system etc.) Delivered
messages can be simply deleted from the DLQ. Lost messages can be
resubmitted in different ways (doing a new transaction, reading the
details of the transaction over the phone etc.)
My 2c.
Martin