Hi, everyone.
Openstack cluster have three control nodes: node-30, node-31, node-32 and other compute nodes, three rabbitmq docker container run on control nodes separately. Occasionally, one of three rabbitmq will consume a lot of memory until system OOM or restart the rabbitmq container manually.
As the picture shows, node-30 was restart until trigger OOM at midnight(2022-04-08 03:16).
all rabbitmq nodes (lager, error_logger_hwm) setting is {ok,50}.
Here is some snip log:
/var/log/kolla/rabbitmq/rab...@node-30.log
2022-04-08 02:49:45.007 [warning] <0.32.0> lager_error_logger_h dropped 9 messages in the last second that exceeded the limit of 1000 messages/sec
2022-04-08 03:17:44.003 [warning] <0.32.0> lager_error_logger_h dropped 154 messages in the last second that exceeded the limit of 1000 messages/sec
/var/log/kolla/rabbitmq/log/crash.log(node-30)
2022-04-17 13:23:40 =SUPERVISOR REPORT====
2022-04-17 11:20:37.826
Offender: [{nb_children,1},{name,channel_sup},{mfargs,{rabbit_channel_sup,start_link,[]}},{restart_type,temporary},{shutdown,infinity},{child_type,supervisor}]
2022-04-07 23:52:29.201
Offender: [{nb_children,1},{name,channel_sup},{mfargs,{rabbit_channel_sup,start_link,[]}},{restart_type,temporary},{shutdown,infinity},{child_type,supervisor}]
Reason: shutdown
Context: shutdown_error
Supervisor: {<0.7366.866>,rabbit_channel_sup_sup}
2022-04-07 23:52:22 =SUPERVISOR REPORT====
/var/log/kolla/rabbitmq/rab...@node-31.log
2022-04-08 02:47:17.000 [warning] <0.32.0> lager_error_logger_h dropped 17 messages in the last second that exceeded the limit of 1000 messages/sec
/var/log/kolla/rabbitmq/rab...@node-32.log
2022-04-08 03:17:33.000 [warning] <0.32.0> lager_error_logger_h dropped 2925 messages in the last second that exceeded the limit of 1000 messages/sec
2022-04-08 03:17:38.000 [warning] <0.32.0> lager_error_logger_h dropped 206 messages in the last second that exceeded the limit of 1000 messages/sec
I had add the full logs by attachment.
Question:
- 1, what's the root cause of this problem, and how could I fix it?
- 2, could I fix it by update error_logger_hwm to a large value such as 4000? what will be the harm? bcz in lager README.md says about error_logger_hwm: It is probably best to keep this number small.
Here is some relevant stackoverflow question but nobody could help me: