Rabbitmq persistence losing messages

1,457 views
Skip to first unread message

Alexey Makarov

unread,
Oct 10, 2014, 8:31:37 AM10/10/14
to rabbitm...@googlegroups.com
Hi all, i've setup cluster with 3 nodes, (1 ram 2 disc) In front of it i have Haproxy lb, I create vhost, with some name, when i send messages, then turn off one node, some messages get lost, then turn off second node, and another lost. Why? I'm using mirror queues. And if i turn off every node, and then start them again, there is no messages at all. Settings of the nodes is standard, only changed vm_memory_watermark to have more ram. Thanks

Michael Klishin

unread,
Oct 10, 2014, 8:36:12 AM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov
One has to ask a few questions before claiming that RabbitMQ is "losing messages":

 * Are the queues durable?
 * Do you publish messages as persistent?
 * Are your queues actually mirrored? What are your policy pattern and queue names?
 * Do you see node down/promotion messages in the logs?
 * How exactly do you verify that "messages are lost"? Did you try the same test with a single node that is shut down and started again? 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Alexey Makarov

unread,
Oct 10, 2014, 8:54:24 AM10/10/14
to rabbitm...@googlegroups.com
1. Yes queue is durable
2. yes messages are persistent
3. I'm testing with jmeter, so queues names like "jmeterQueue, test, quque, and so on,
set_policy -p vhost ha-all '.*' '{"ha-mode": "all", "ha-sync-mode": "automatic"}  
here is policy, (In management console i see in front of the queue name policy "ha-all")
Maybe something with policy pattern? Actually i need a pattern for all types of name queues.

4. How i test loss? Just turn off nodes with init script.

5. I checked ../../mnesia/rabbit@hostname/msg_store_persistent/ there were many files, after node turn of, and get up, they disappear  from that dir.

Michael Klishin

unread,
Oct 10, 2014, 9:03:42 AM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov
On 10 October 2014 at 16:54:30, Alexey Makarov (zip...@gmail.com) wrote:
> 4. How i test loss? Just turn off nodes with init script.

This is not how you test, this is how you shut down a node.

Please provide a test script or similar, or at least describe your methodology.

Do you use publisher confirms, for example? If you don't then your client will consider
a message "delivered" not only before it reaches all mirrors but also before it reaches
the node you are connected to.

More details, please. This is a technical mailing list.

> 5. I checked ../../mnesia/rabbit@hostname/msg_store_persistent/
> there were many files, after node turn of, and get up, they disappear
> from that dir.

OK, so this was worth mentioning from the start. One reason may be that the queues aren't
actually durable. 

What about node down and related messages in the log?

Alexey Makarov

unread,
Oct 10, 2014, 9:21:26 AM10/10/14
to rabbitm...@googlegroups.com
I just need to make a rabbit HA cluster, to prevent message lost. Other guys are writing client on java, with option publisher confirms. So i only install rabbit on three separate machines, clustered nodes with "join cluster", made policy (which i've wrote before) on each node. I am new at rabbitmq, so dont know how to test HA. I just pushed persistence messages to queue with jmeter, then turn of nodes, or with kill, and then checked messages in webconsole, and in mnesia dir. It's all. In webconsole, in the "Parametres" of queues i see a D letter, which means durable:true. Sorry for my English, i'm russian.

Michael Klishin

unread,
Oct 10, 2014, 9:34:05 AM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov


On 10 October 2014 at 17:21:31, Alexey Makarov (zip...@gmail.com) wrote:
> > Other guys are writing client on java, with option publisher
> confirms. So i only install rabbit on three separate machines,
> clustered nodes with "join cluster", made policy (which i've
> wrote before) on each node.



> I just pushed persistence messages to queue with jmeter,
> then turn of nodes, or with kill, and then checked messages in
> webconsole, and in mnesia dir. It's all. In webconsole, in the
> "Parametres" of queues i see a D letter, which means durable:true.

Alexey,

We understand that you don't want to get into the details of how RabbitMQ works
but to suggest anything we need to form a hypothesis, try to [dis]prove it,
form another one, and so on. Guessing is not particularly effective when trying to
understand why a distributed system behaves the way it does. 

So far we have eliminated a couple of possible causes:

 * Messages are published as transient [per your own words]
 * Queues are not durable [there is a D flag in the mgmt UI]

and in addition we know that

 * The message store on the node that was shut down is fully purged after being restarted.

What isn't clear (or entirely unknown) yet:

 * That the queues have the policy applied (it seems like it should be but I'd prefer to see a management UI screenshot and `rabbitmqctl list_queues` output instead of guessing)
 * Do you see messages enqueued before node shutdown?
 * Does your test code use publisher confirms? Not the other guys', the one that you use.
 * Do you see any log messages related to nodes being detected as down, promoted, and so on?
   Note that this is the 3rd time I'm asking about the log files.
 * Do you see the queues after a node restart with 0 messages in them or you don't even see the queues?

Alexey Makarov

unread,
Oct 10, 2014, 9:58:25 AM10/10/14
to rabbitm...@googlegroups.com
Why transient, if mnesia files in persistence folder, not in transient?

It's strange but list_queues show nothing.
Only -> Listing queues ...done.
Hereis screenshot from WebUI
http://s35-temporary-files.radikal.ru/24bc8913cc384dafb339294320fa2066/-929206895.png

Yes i published messages, and see them before i shut down nodes.
i'm not using code, i'm using jmeter tool to send request to queues.
i see only =INFO REPORT==== 10-Oct-2014::16:55:53 ===
stopped TCP Listener on 0.0.0.0:5672
when i turn off node.
Yes i see my queues in webconsole, with 0 messages after restart node.

Michael Klishin

unread,
Oct 10, 2014, 10:08:33 AM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov


On 10 October 2014 at 17:58:31, Alexey Makarov (zip...@gmail.com) wrote:
> > Why transient, if mnesia files in persistence folder, not in
> transient?

I was referring to messages being published as transient. Why? Because we routinely see
similar questions posted and the issues typically come down to

 * Messages published as transient (non-persistent)
 * Queues not being durable
 * Policy(-ies) not being applied

> It's strange but list_queues show nothing.
> Only -> Listing queues ...done.

What node (01 or 03) do you run this against?
According to this screenshot you have over 1M messages and a sync in process.
Is this before or after node shutdown?

> i'm not using code, i'm using jmeter tool to send request to queues. 

Oh, come on. My point is that your publishing code does something, and we need to understand
what exactly it does. If we don't, how can we possibly form any hypothesis and recommend
anything?

Please investigate if JMeter indeed publishes messages as persistent (delivery_mode = 2).

> i see only =INFO REPORT==== 10-Oct-2014::16:55:53 ===
> stopped TCP Listener on 0.0.0.0:5672
> when i turn off node.

Yes, on the node that you are shutting down, you won't see anything beyond that. This is
quite obvious.

What's in the logs of _other_ nodes?

According to the screenshot you have 3 nodes. We can label them N1, N2, and N3, for example.

What node do you run JMeter against (and later shut down)? What node do you connect to to access
mgmt UI? Can you post a node list page screenshot before and after shutdown? 

Michael Klishin

unread,
Oct 10, 2014, 10:16:06 AM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov


On 10 October 2014 at 18:08:31, Michael Klishin (mic...@rabbitmq.com) wrote:
> > I was referring to messages being published as transient. Why?
> Because we routinely see
> similar questions posted and the issues typically come down
> to
>
> * Messages published as transient (non-persistent)
> * Queues not being durable
> * Policy(-ies) not being applied

Someone suggests you may be running into the scenario described in
http://next.rabbitmq.com/ha.html#cluster-shutdown

(note that these are 3.4.0 docs, which is an unreleased version).

Before 3.4.0 is out, you need to make sure that mirrors are synchronised
before shutting down master in this experiment.

With over 1M queues, sync can take some time but after that, published message
that is routed to mirrored queue(s) will be delivered to mirror(s) before RabbitMQ
confirms the publish.

In other words, normally mirrors will be in sync. 

Michael Klishin

unread,
Oct 10, 2014, 10:19:28 AM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov
 On 10 October 2014 at 18:16:03, Michael Klishin (mic...@rabbitmq.com) wrote:
> Before 3.4.0 is out, you need to make sure that mirrors are synchronised
> before shutting down master in this experiment.
>
> With over 1M queues, sync can take some time but after that, published
> message
> that is routed to mirrored queue(s) will be delivered to mirror(s)
> before RabbitMQ
> confirms the publish.
>
> In other words, normally mirrors will be in sync.

Correction: mirrors only get out of sync if you shut them down (or they fail),
so if we understand the steps in your test correctly, you simply need to give
the mirrors some time to sync before running the test again.

Or else, unsynchronised mirrors won't be promoted.

Alexey Makarov

unread,
Oct 10, 2014, 10:24:10 AM10/10/14
to rabbitm...@googlegroups.com
screenshot is before i down the node
Jmeter has option  in UI, "Persistence" which can be turn on.
In the logs of other nodes when turn of the node i see
=INFO REPORT==== 10-Oct-2014::17:14:38 ===
node 'rabbit@mw-rabbitmq-02' down: connection_closed


What node do you run JMeter against (and later shut down)? What node do you connect to to access
mgmt UI? Can you post a node list page screenshot before and after shutdown?
I'm running jmeter on the N1 node, (I shut down N2, some message get lost, up the node, then shut the N3, message get lost)
I can access the UI from all three machines. If i turn off the N2, i access the ui from N1, if turn off the N1, and N3, i can see UI ffrom N2.

Michael Klishin

unread,
Oct 10, 2014, 12:00:23 PM10/10/14
to rabbitm...@googlegroups.com, Alexey Makarov
On 10 October 2014 at 18:24:15, Alexey Makarov (zip...@gmail.com) wrote:
> Someone suggests you may be running into the scenario described
> in
> http://next.rabbitmq.com/ha.html#cluster-shutdown
>
> (note that these are 3.4.0 docs, which is an unreleased version).
>
> Before 3.4.0 is out, you need to make sure that mirrors are synchronised
> before shutting down master in this experiment.
>
> With over 1M queues, sync can take some time but after that, published
> message
> that is routed to mirrored queue(s) will be delivered to mirror(s)
> before RabbitMQ
> confirms the publish.

Right, so I performed the following steps:

1. Set up a 2 node cluster, rabbit and rabbit2.
# deliberately not using automatic sync in step 2
2. Apply a policy: set_policy ha-all "^ha\." '{"ha-mode":"all"}'
3. Connect to rabbit, publish 100K messages that route to a queue named "ha.q1"
4. Manually perform sync to rabbit2
5. Shut down rabbit
6. See rabbit2 elected master for ha.q1
7. See ha.q1 still have 100K messages
8. Bring rabbit back
9. See rabbit become a mirror of ha.q1, sync it
10. Shut down rabbit2
11. Check that ha.q1 still has 100K messages
12. Bring rabbit2 back
13. See rabbit2 become a mirror of ha.q1, unsynchronised

Since rabbit2 is not unsynchronised, it will not be elected master should rabbit shut down.

Let's try it:

14. Shut down rabbit
15. See ha.q1 master move to rabbit2 and have 0 messages
16. Bring back rabbit
17. Publish 100K messages that are routed to ha.q1
18. ha.q1 now has 100K messages, master is on rabbit2, mirror on rabbit
19. Ensure rabbit is synchronised
20. Shut down rabbit2
21. ha.q1 master is now on rabbit, with 100K messages enqueued
22. Bring rabbit2 back
23. ha.q1 master is on rabbit, with 1 mirror, still with 100K messages
 
I think this is enough evidence to suggest that you are indeed running into what's described in
http://next.rabbitmq.com/ha.html#cluster-shutdown and need to give slaves
some time to sync before shutting down master.

With 1.1M messages it takes several messages on my 1-2 year old Core i7 machine with an SSD.

Michael Klishin

unread,
Oct 12, 2014, 4:14:01 AM10/12/14
to rabbitm...@googlegroups.com, Alexey Makarov
On 10 October 2014 at 20:00:21, Michael Klishin (mic...@rabbitmq.com) wrote:
> Since rabbit2 is not unsynchronised

This should read: "not synchronised".

Alexey Makarov

unread,
Oct 13, 2014, 1:34:46 AM10/13/14
to rabbitm...@googlegroups.com
Thanks for the answers and test, i made something, and messages after restart, on the node, no lost. Thanks again.

Alexey Makarov

unread,
Oct 13, 2014, 10:06:44 AM10/13/14
to rabbitm...@googlegroups.com
Hmm something strange or it's normal? I see in logs
Using haproxy in front of cluster, and in rabbit logs strange messages. I've read on the internet that this is haproxy checking availability of rabbitmq? Can we ignore this log messages, or something going wrong? Rabbitmq heartbeat standard

=INFO REPORT==== 13-Oct-2014::16:58:48 ===
accepting AMQP connection <0.9030.0> (ip haproxy -> ip node:5672)

=WARNING REPORT==== 13-Oct-2014::16:58:48 ===
closing AMQP connection <0.9030.0> (ip haproxy -> ip node:5672):
connection_closed_abruptly

Hereis config of lb
global
        log 127.0.0.1  local0
        log 127.0.0.1  local1 notice
           stats socket /var/run/haproxy.stat mode 600 level admin
        stats timeout 30s
        maxconn 10000
        user haproxy
        group haproxy
        daemon

defaults
        log     global
        mode    http
        option  tcplog
        option  dontlognull
        option redispatch
        retries 3
        timeout connect 5s
        timeout client  3m
        timeout server  3m
  frontend localnodes
    bind *:5671
    mode tcp
    option tcplog
    log global
    default_backend rabbit
backend rabbit
        mode tcp
        option tcplog
        balance roundrobin
        server rabbitmq-01 ip:5672 check 
        server rabbitmq-02 ip:5672  check
        server rabbitmq-03 ip:5672  check
listen stats *:8085
    mode http
        balance roundrobin
        option httpchk HEAD / HTTP/1.0
         option httpclose
        option forwardfor
        option httpchk OPTIONS /health_check.html

Simon MacMullen

unread,
Oct 13, 2014, 10:39:28 AM10/13/14
to Alexey Makarov, rabbitm...@googlegroups.com
On 13/10/14 15:06, Alexey Makarov wrote:
> Hmm something strange or it's normal? I see in logs
> Using haproxy in front of cluster, and in rabbit logs strange messages.
> I've read on the internet that this is haproxy checking availability of
> rabbitmq?

That's correct.

HAProxy is opening a TCP connection to port 5672 to check the server is
up. But from RabbitMQ's perspective that's not a valid AMQP connection
because it closes immediately, so we log a warning.

> Can we ignore this log messages, or something going wrong?

Yes. You can silence it with:

[{rabbit, [{log_levels, [{connection, error}]}]}].

Cheers, Simon

Alexey Makarov

unread,
Oct 13, 2014, 10:42:59 AM10/13/14
to rabbitm...@googlegroups.com
Thanks a lot

Alexey Makarov

unread,
Oct 14, 2014, 2:31:32 AM10/14/14
to rabbitm...@googlegroups.com
The same log messages i see now, when client connects to the node, without haproxy.  Why so, and what does it mean? Thanks
=INFO REPORT==== 14-Oct-2014::09:29:16 ===
accepting AMQP connection <0.22968.0> (client ip:42323 -> node ip:5672)

=WARNING REPORT==== 14-Oct-2014::09:29:16 ===
closing AMQP connection <0.22968.0> (client ip:42323 -> node ip:5672):
connection_closed_abruptly


Michael Klishin

unread,
Oct 14, 2014, 4:21:59 AM10/14/14
to rabbitm...@googlegroups.com, Alexey Makarov
 On 14 October 2014 at 10:31:44, Alexey Makarov (zip...@gmail.com) wrote:
> The same log messages i see now, when client connects to the node,
> without haproxy. Why so, and what does it mean?

Alexey,

Would you mind starting new threads for new questions instead of hijacking
the existing (and completely unrelated) ones?

Thanks.
Reply all
Reply to author
Forward
0 new messages