understanding RabbitMQ crash in production

272 views
Skip to first unread message

Roie Yossef

unread,
Mar 9, 2018, 6:25:49 AM3/9/18
to rabbitmq-users
Hi, we are running a 3 nodes cluster in production. yesterday we had a that we would like to investigate.
Lets give a names to the rabbit members - rabbit1, rabbit2 and rabbit 3.
All Rabbits running on the same machine types in AWS in the same subnet. The first node crashed yesterday at 7:35AM , the only system log i could find is a docker log:
Mar  8 07:35:33 ip-xxx-xxx-xxx-xxx dockerd[953]: time="2018-03-08T07:35:33.304139804Z" level=error msg="containerd: get exit status" error="containerd: process has not exited" id=8424f9b77fe4fb5fea3afe736bb7b468ac4ddb15ef1f223288a34cc5088bd474 pid=e86f626f6f896818a46515af4e57ed2685b134a6f29bed106d2888065b631232 systemPid=16355

at the same time rabbit2 and rabbit3 reported 'rabbit on node ‘rabbit@rabbit1’ down’.
out monitors show that rabbit1 machine was under load (around 90% cpu usage) before the crash.

the second node - rabbit2 -  crashed at 9:05AM. before the crash we saw a cpu spike on the third node (rabbit3) and then rabbit2 crashed.
at the time of the crash the third rabbit (rabbit3) generated a crash report but i’m not really understanding what it means.
please see attached test file.

i have no clue how to continue with the investigation , i would like to get your advice here.
rabbitmq_crash_report.txt

Luke Bakken

unread,
Mar 9, 2018, 1:29:45 PM3/9/18
to rabbitmq-users
Hi Roie,

What version of RabbitMQ and Erlang are you using? Do you monitor any RabbitMQ metrics? (http://www.rabbitmq.com/monitoring.html) It would be interesting to correlate message rate and other RabbitMQ metrics with the 90% cpu load event.

The log file you provided doesn't provide much insight. Is that the entire file? Is there anything different in the crash.log file? (if it is present)

Thanks,
Luke

Roie Yossef

unread,
Mar 10, 2018, 6:32:41 AM3/10/18
to rabbitmq-users
Hi Luke,
Thanks for you answer.
we running rabbbitMQ version 3.6.5 and earlang version 19. i don't see and crash.log files in the machines.
i do see some more logs in other rabbit members.
rabbit1 reported this when i strated him:
=CRASH REPORT==== 8-Mar-2018::16:15:36 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.172.0>
    registered_name: []
    exception exit: {{timeout_waiting_for_tables,
                         [rabbit_user,rabbit_user_permission,rabbit_vhost,
                          rabbit_durable_route,rabbit_durable_exchange,
                          rabbit_runtime_parameters,rabbit_durable_queue]},
                     {rabbit,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 134)
    ancestors: [<0.171.0>]
    messages: [{'EXIT',<0.173.0>,normal}]
    links: [<0.171.0>,<0.31.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 98
  neighbours:

  it doesn't seems helpful for me.


  we do collect rabbit metrics (using this prometheus exporter: https://github.com/deadtrickster/prometheus_rabbitmq_exporter) please see attached screenshots.
  let me know if you need more information.

  Thanks.
  
Screen Shot 2018-03-10 at 13.31.59.png

Michael Klishin

unread,
Mar 10, 2018, 2:37:13 PM3/10/18
to rabbitm...@googlegroups.com
The exception means that the node (or rather its internal database) failed to sync
critically important tables from a peer.

Prior to 3.6.7 there is only one attempt at doing so, starting with 3.6.7 there are 10.
This usually happens when nodes are restarted in arbitrary order (which 3.6.7 handles as long
as all nodes come online within 5 minutes by default: 10 attempts * 30 seconds waiting interval).

This is not the root cause you are looking for or even an indication of an issue.

Anyhow, please upgrade to at least 3.6.15 ASAP:

 * http://www.rabbitmq.com/changelog.html
 * http://www.rabbitmq.com/upgrade.html

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Roie Yossef

unread,
Mar 11, 2018, 5:03:05 AM3/11/18
to rabbitmq-users
Hi Michael,
we will upgrade the cluster to new version anyway. we really want to know what was to root cause for this failure. 
our 3 nodes failed in hour and led to downtime in production, we can't just ignore this.
please let me know what information required to understand the root cause.

Thanks

בתאריך יום שבת, 10 במרץ 2018 בשעה 21:37:13 UTC+2, מאת Michael Klishin:
The exception means that the node (or rather its internal database) failed to sync
critically important tables from a peer.

Prior to 3.6.7 there is only one attempt at doing so, starting with 3.6.7 there are 10.
This usually happens when nodes are restarted in arbitrary order (which 3.6.7 handles as long
as all nodes come online within 5 minutes by default: 10 attempts * 30 seconds waiting interval).

This is not the root cause you are looking for or even an indication of an issue.

Anyhow, please upgrade to at least 3.6.15 ASAP:

 * http://www.rabbitmq.com/changelog.html
 * http://www.rabbitmq.com/upgrade.html
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Mar 11, 2018, 3:55:15 PM3/11/18
to rabbitm...@googlegroups.com
Full server logs.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
Mar 12, 2018, 10:34:42 AM3/12/18
to rabbitmq-users
Hi Roie,

As you can see from the graph you provided, message queues began to back up in your system. Without more information, we can't determine why this happened. Did the consuming applications stop or slow down? If you are running in a cloud environment, did that slow down? Or, was it something else entirely?

As Michael said, full logs from all nodes could help. The following script will collect logs as well as other important system information:


In addition, we have a document with suggestions for what sort of additional information to provide -


Thanks,
Luke

Roie Yossef

unread,
Mar 12, 2018, 11:19:39 AM3/12/18
to rabbitmq-users

Hi, 
we had another failure today. seems like only one node crashed. this time the whole instance crashed
and no load at all. right after the failure only one node was healthy (baed on the ui).
we have replaced the problematic instance just in case he made the problems.
i'm uploading the logs you required (of the two machines , i don't have the logs of the third one).

regarding to the machines , we use t2.medium machines with 2 cpu cores and 8GB of memory.
our metrics shows that the machines usage is very low at most of time.


let me know if you find anything or if you need more information.
Thanks
בתאריך יום שני, 12 במרץ 2018 בשעה 16:34:42 UTC+2, מאת Luke Bakken:
rabbitmq-env-ip-100-65-5-154-20180312-145324.tgz
rabbitmq-env-ip-100-65-5-165-20180312-145530.tgz

Michael Klishin

unread,
Mar 12, 2018, 1:59:44 PM3/12/18
to rabbitm...@googlegroups.com
The logs directory is empty — do you use a non-standard location for logs or log rotate frequently?
We cannot suggest much without logs. I do not see anything
that would represent "a crash" in any of the files provided.

According to `rabbitmqctl status` the node was using well below 1 GB of RAM and less than 100 socket
descriptors.

You are, however, running a version with plenty of known issues and a long deprecated stats DB
implementation plus an Erlang version that has known bugs that badly affect RabbitMQ.

See http://www.rabbitmq.com/changelog.html, http://www.rabbitmq.com/which-erlang.html, http://www.rabbitmq.com/upgrade.html.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roie Yossef

unread,
Mar 13, 2018, 6:36:11 AM3/13/18
to rabbitmq-users
see attached logs from the machines.
regarding to upgrade , i've noticed that the upgrade require a downtime. we cannot afford another downtime , there is any way to avoid the downtime?

Many Thanks 

בתאריך יום שני, 12 במרץ 2018 בשעה 19:59:44 UTC+2, מאת Michael Klishin:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
rabbit1.log
rabbit3.log

Michael Klishin

unread,
Mar 13, 2018, 8:52:01 AM3/13/18
to rabbitm...@googlegroups.com
The Upgrade doc guide lists all possible scenarios, including blue/green deployments.

The logs contain a few exceptions from direct Erlang client connections (most likely Shovels, since it's a plugin that's enabled according to the log)
getting messages they did not expect, e.g. in rabbit3.log around 8-Mar-2018::09:05:06.

In rabbit1.log, there is no evidence of node termination but then on boot one of the nodes
failed to contact its last known peer:

BOOT FAILED
===========

Timeout contacting cluster nodes: ['rabbit@ip-100-65-5-154',
                                   'rabbit@ip-100-65-5-19'].



    exception exit: {{timeout_waiting_for_tables,
                         [rabbit_user,rabbit_user_permission,rabbit_vhost,
                          rabbit_durable_route,rabbit_durable_exchange,
                          rabbit_runtime_parameters,rabbit_durable_queue]},
                     {rabbit,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 134)

In 3.6.5, when all nodes are restarted, it has to happen in a particular order. The log has a message that suggests as much:

> This cluster node was shut down while other nodes were still running.
> To avoid losing data, you should start the other nodes first, then
> start this one. To force this node to start, first invoke
> "rabbitmqctl force_boot". If you do so, any changes made on other
> cluster nodes after this one was shut down may be lost.

Starting with 3.6.7 it's not required
outside of feature version upgrades.

There are also messages about inability to contact epmd, an auxiliary daemon used to resolve inter-node
communication ports:

> attempted to contact: ['rabbit@ip-100-65-5-154','rabbit@ip-100-65-5-19']
>
> rabbit@ip-100-65-5-154:
> * unable to connect to epmd (port 4369) on ip-100-65-5-154: address (cannot connect to host/port)
>
> rabbit@ip-100-65-5-19:
> * unable to connect to epmd (port 4369) on ip-100-65-5-19: address (cannot connect to host/port)

So either the peer node was rebooted/down/unreachable, or something blocked inter-node communication.

Lastly there's a lot of

> client unexpectedly closed TCP connection

which was already mentioned above.

My best guess is that

 * The entire cluster was restarted or rebooted, intentionally or by accident
 * Since this version requires rolling restarts to be performed in a certain order, some nodes refused to start with expected errors in the log
 * Log exceptions were incorrectly identified as the root cause
 * Assuming that the operator(s) wasn't aware of the node restart order requirement, random measures such as "lets restart things N times until they re-cluster" were attempted until they succeeded

The word "crash" is used by the runtime to describe unhandled exceptions and is not an indication of a node going down.

I don't see any evidence of nodes going down or any exceptions that would report any major failures.

There's also some noise in the logs about connections being forced to shut down after they could not shut down in time
(e.g. because their client never clearly or due to https://github.com/rabbitmq/rabbitmq-server/issues/544, which is confusing but harmless).

We highly recommend upgrading to Erlang 19.3.6.5 or later and at least RabbitMQ 3.6.15, e.g. using the Blue/Green deployment
strategy:

http://www.rabbitmq.com/blue-green-upgrade.html
https://content.pivotal.io/blog/blue-green-application-deployments-with-rabbitmq
https://www.youtube.com/watch?v=S2oO-t-E38c


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roie Yossef

unread,
Mar 13, 2018, 9:49:00 AM3/13/18
to rabbitmq-users
Hi,
Thanks for you answer.

well , the crash was few hours before the logs you pointed to.
my college tried to start the node in random order - but it was after the whole cluster shutted down.
as i understand we don't have any clue what happened. the only thing i know is no one connected to these machines for long time.

regarding the client shutdown, we have made some checks in that area as well.
we run our cluster in AWS and probably the ELB invoking connection resets after some idle time.

do we have anything else to check? we will upgrade to cluster next week - thanks for the reference to the blue \ green deployment.

בתאריך יום שלישי, 13 במרץ 2018 בשעה 14:52:01 UTC+2, מאת Michael Klishin:

Michael Klishin

unread,
Mar 13, 2018, 10:27:12 AM3/13/18
to rabbitm...@googlegroups.com
Does it mean that you have absolutely no logs from before and around the time of "the crash" (what does that mean, by the way? was there any evidence of nodes actually not running?)

In that case I can't recommend much else.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roie Yossef

unread,
Mar 13, 2018, 11:00:58 AM3/13/18
to rabbitmq-users
the logs i provided you includes the time of the crash. just the logs you mentioned printed after. 
all started around 07:41AM and finished with the attempts to restart the nodes in random order.

we have evidence that nodes actually not running. it its confirmed by docker (no running containers). 

בתאריך יום שלישי, 13 במרץ 2018 בשעה 16:27:12 UTC+2, מאת Michael Klishin:

Michael Klishin

unread,
Mar 13, 2018, 11:51:09 AM3/13/18
to rabbitm...@googlegroups.com
RabbitMQ can stop in some scenarios but it cannot possibly shut down the container it is running in. Something else did.
--
Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Mar 13, 2018, 12:29:26 PM3/13/18
to rabbitm...@googlegroups.com
A workaround for proxies/load balancers that cose client connections is covered in
http://www.rabbitmq.com/heartbeats.html#tcp-proxies. If clients can recover
it's not a big deal but avoids log noise on both ends.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roie Yossef

unread,
Mar 13, 2018, 12:31:02 PM3/13/18
to rabbitmq-users
well , docker stopped the containers due to high load.
we will upgrade and let you know if it will happen again.

Thank you very much Michael & Luke.

בתאריך יום שלישי, 13 במרץ 2018 בשעה 17:51:09 UTC+2, מאת Michael Klishin:
Reply all
Reply to author
Forward
0 new messages