The rabbitmq pod continually restart due to rabbit_disk_monitor terminating

1,009 views
Skip to first unread message

杜昱萱

unread,
Feb 15, 2019, 5:12:04 AM2/15/19
to rabbitmq-users
There are 2 rabbitmq pods in k8s. They continually restart after deployed.

$ kubectl get pods --watch NAME READY STATUS RESTARTS AGE crmq-rabbitmq-0 1/1 Running 0 1h crmq-rabbitmq-1 1/1 Running 0 1h crmq-rabbitmq-0 0/1 Running 0 1h crmq-rabbitmq-0 0/1 Running 1 1h crmq-rabbitmq-0 1/1 Running 1 1h

From the log file, can see rabbit_disk_monitor is in terminating status.

2019-02-15 08:29:20.020 [error] <0.211.0> ** Generic server rabbit_disk_monitor terminating ** Last message in was update ** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@crmq-rabbitmq-0",50000000,8259809280,100,10000,#Ref<0.739307764.2369781761.195649>,false,true,10,120000} ** Reason for termination == ** {eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,197}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,169}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,686}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]} 2019-02-15 08:29:20.020 [error] <0.211.0> CRASH REPORT Process rabbit_disk_monitor with 0 neighbours crashed with reason: {eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,197}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,169}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,...},...]} 2019-02-15 08:29:20.021 [error] <0.210.0> Supervisor rabbit_disk_monitor_sup had child rabbit_disk_monitor started with rabbit_disk_monitor:start_link(50000000) at <0.211.0> exit with reason {eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,197}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,169}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,...},...]} in context child_terminated 2019-02-15 08:29:20.022 [info] <0.6260.0> FORMAT ERROR: "Free disk space monitor encountered an error (e.g. failed to parse output from OS tools): ~p, retries left: ~s~n" [{{'EXIT',{eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,enable,1,[{file,"src/rabbit_disk_monitor.erl"},{line,273}]},{rabbit_disk_monitor,init,1,[{file,"src/rabbit_disk_monitor.erl"},{line,132}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,365}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,333}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}},16335470592},10]

Michael Klishin

unread,
Feb 15, 2019, 11:06:11 AM2/15/19
to rabbitm...@googlegroups.com
The node tries to start a subprocess that uses df (on Linux) and that fails
with EAGAIN.

I don't think I've seen that before. Disk monitor is an optional feature (without it disk-based alarms won't work but that's it),
I'm not sure why that would lead to pod restarts. How are your checks configured?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

杜昱萱

unread,
Feb 18, 2019, 2:14:07 AM2/18/19
to rabbitmq-users
Thank you!
Any configuration related to the disk monitor process?

The pod restart due to failing running rabbitmqctl command.

Warning Unhealthy 6m (x6 over 7m) kubelet, ip-10-10-13-166.us-east-2.compute.internal Liveness probe failed: runuser: failed to execute /usr/lib/rabbitmq/bin/rabbitmqctl: Resource temporarily unavailable
Warning Unhealthy 6m kubelet, ip-10-10-13-166.us-east-2.compute.internal Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:262: starting container process caused "process_linux.go:81: executing setns process caused \"exit status 15\""
Warning Unhealthy 6m kubelet, ip-10-10-13-166.us-east-2.compute.internal Readiness probe failed: Failed to create aux thread

This is the crash error log:
2019-02-15 08:29:20.020 [error] <0.211.0> ** Generic server rabbit_disk_monitor terminating 
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@crmq-rabbitmq-0",50000000,8259809280,100,10000,#Ref<0.739307764.2369781761.195649>,false,true,10,120000}
** Reason for termination == 
** {eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,197}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,169}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,686}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2019-02-15 08:29:20.020 [error] <0.211.0> CRASH REPORT Process rabbit_disk_monitor with 0 neighbours crashed with reason: {eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,197}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,169}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,...},...]}
2019-02-15 08:29:20.021 [error] <0.210.0> Supervisor rabbit_disk_monitor_sup had child rabbit_disk_monitor started with rabbit_disk_monitor:start_link(50000000) at <0.211.0> exit with reason {eagain,[{erlang,open_port,[{spawn,"/bin/sh -s unix:cmd"},[binary,stderr_to_stdout,stream,in,hide,out]],[]},{os,cmd,1,[{file,"os.erl"},{line,239}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,220}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,197}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,169}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,...},...]} in context child_terminated

在 2019年2月16日星期六 UTC+8上午12:06:11,Michael Klishin写道:

杜昱萱

unread,
Feb 18, 2019, 2:42:39 AM2/18/19
to rabbitmq-users
Why EAGAIN error is reported?
May be it is caused by some resource unavailable.
How can I identify what resource blocked the subprocess?

在 2019年2月18日星期一 UTC+8下午3:14:07,杜昱萱写道:

Michael Klishin

unread,
Feb 19, 2019, 6:50:51 AM2/19/19
to rabbitm...@googlegroups.com
According to [1], EAGAIN means the OS reported that its process limit has been reached.

杜昱萱

unread,
Feb 20, 2019, 2:20:43 AM2/20/19
to rabbitmq-users
Thanks! And the process limit has been increased.
But still report error when describe pod:

05:58:05.278 [error] GenServer :inet_gethost_native_sup terminating

    • (stop) :eagain
      :erlang.open_port({:spawn, 'inet_gethost 4 '}, [\{:packet, 4}, :eof, :binary])
      (kernel) inet_gethost_native.erl:184: :inet_gethost_native.server_init/2
      Last message: {:EXIT, #PID<0.87.0>, {:eagain, [{:erlang, :open_port, [{:spawn, 'inet_gethost 4 '}, [\{:packet, 4}, :eof, :binary]], []}, {:inet_gethost_native, :server_init, 2, file: 'inet_gethost_native.erl', line: 184}]}}
      Error:
      {:eagain, [{:erlang, :open_port, [{:spawn, '/bin/sh -s unix:cmd'}, [:binary, :stderr_to_stdout, :stream, :in, :hide, :out]], []}, {:os, :cmd, 1, file: 'os.erl', line: 239}, {:vm_memory_monitor, :get_linux_pagesize, 0, file: 'src/vm_memory_monitor.erl', line: 448}, {:vm_memory_monitor, :init_state_by_os, 1, file: 'src/vm_memory_monitor.erl', line: 277}, {:vm_memory_monitor, :get_rss_memory, 0, file: 'src/vm_memory_monitor.erl', line: 250}, {:rabbit_vm, :memory, 0, file: 'src/rabbit_vm.erl', line: 65}, {:rabbit, :status, 0, file: 'src/rabbit.erl', line: 704}, {:rpc, :"-handle_call_call/6-fun-0", 5, file: 'rpc.erl', line: 197}]}
      Warning Unhealthy 34m kubelet, ip-10-15-13-86.us-east-2.compute.internal Liveness probe failed: Failed to create dirty io scheduler thread 6


在 2019年2月19日星期二 UTC+8下午7:50:51,Michael Klishin写道:

Michael Klishin

unread,
Feb 20, 2019, 2:28:09 AM2/20/19
to rabbitm...@googlegroups.com
Now you have EAGAIN when the node tries to determine its hostname.

> Failed to create dirty io scheduler thread 6

also doesn't look good. Something in your Kubernetes environment prevented the runtime from starting a thread.
Consider revisiting any OS limits you may have in place.

杜昱萱

unread,
Feb 20, 2019, 4:12:22 AM2/20/19
to rabbitmq-users
The OS limit is as follow:

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62210
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 62210
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Can I run some test command in the pod to check what prevent to start a thread?

在 2019年2月20日星期三 UTC+8下午3:28:09,Michael Klishin写道:

Michael Klishin

unread,
Feb 20, 2019, 4:33:27 AM2/20/19
to rabbitm...@googlegroups.com
The OS limit reported by `ulimit -a` can be misleading in environments where containers and container orchestrators can have their own
(explicit or implicit, documented or not) limits.

FWIW this is the second time I see this reported in several years. The other is [1] and it ended up being an IaaS provider-specific
problem which was addressed by moving elsewhere.

Note that you can control how many I/O threads RabbitMQ starts using RABBITMQ_IO_THREAD_POOL_SIZE (128 by default IIRC, 64 on Windows)
but in the message above we see that it's thread #6 that fails to start, suggesting that the limit is way lower than that.

Xiaoxi Zhan

unread,
Feb 22, 2019, 3:09:24 AM2/22/19
to rabbitmq-users

Actually, it seems failed to schedule many other threads except thread 6. Is this because failing to get hostname?


Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 34m kubelet, ip-10-15-13-86.us-east-2.compute.internal Readiness probe failed: Failed to create aux thread

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 33m kubelet, ip-10-15-13-86.us-east-2.compute.internal Liveness probe failed: Failed to create dirty io scheduler thread 5

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 33m kubelet, ip-10-15-13-86.us-east-2.compute.internal Readiness probe failed: Failed to create dirty io scheduler thread 9

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 32m kubelet, ip-10-15-13-86.us-east-2.compute.internal Liveness probe failed: Failed to create dirty io scheduler thread 4

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 32m kubelet, ip-10-15-13-86.us-east-2.compute.internal Readiness probe failed: Failed to create dirty io scheduler thread 8

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 31m kubelet, ip-10-15-13-86.us-east-2.compute.internal Liveness probe failed: Failed to create dirty io scheduler thread 3

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)
Warning Unhealthy 31m kubelet, ip-10-15-13-86.us-east-2.compute.internal Readiness probe failed: Failed to create dirty io scheduler thread 7

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Aborted (core dumped)



在 2019年2月20日星期三 UTC+8下午5:33:27,Michael Klishin写道:

seabir...@gmail.com

unread,
Feb 22, 2019, 4:52:00 AM2/22/19
to rabbitmq-users

Hi,


From the rabbitmq checks the limit is 4096.

bash-4.2# cat /etc/security/limits.d/20-nproc.conf

 

# Default limit for number of user's processes to prevent

# accidental fork bombs.

# See rhbz #432903 for reasoning.

 

*          soft    nproc     4096

root       soft    nproc     unlimited


When creating more processes than rabbimtq’s limit (4096…actually 3990...)Then the eagain error will report.
Why it will report eagain when haven't reach 4096?

Can we set this to ulimit value and how to configure this?

在 2019年2月20日星期三 UTC+8下午5:33:27,Michael Klishin写道:
The OS limit reported by `ulimit -a` can be misleading in environments where containers and container orchestrators can have their own

Michael Klishin

unread,
Feb 22, 2019, 1:29:09 PM2/22/19
to rabbitm...@googlegroups.com
If it cannot start thread #6 then chances are it will fail to start more threads.

Michael Klishin

unread,
Feb 22, 2019, 1:32:04 PM2/22/19
to rabbitm...@googlegroups.com
What a virtualised or containerised OS reports for limits isn't necessarily what its host has
configured for limits, therefore I mentioned that there can be undocumented or implicit limits.

This is an Erlang runtime question more than a RabbitMQ one so you may get more informative answers
on erlang-questions.
Reply all
Reply to author
Forward
0 new messages