Hi all,
We've been seeing periodic crashes (~3-4x/day) for a while and hope someone might have insight on what could be causing this.
Our setup:
- RabbitMQ version: 3.8.14, but we have seen this since at least 3.7.x.
- Queue is used by PHP and Go services. Messages are small and rates are usually < 20/s
The problem:
Several times a day, we see a CPU spike, uncorrelated to any increase in load, then we see rabbitmq crash with the following in the error log. Note that when we collected the data on this particular incident, we were using RabbitMQ 3.8.9, but the crashes have continued on 3.8.14 with the same error message and behavior.
-----------
2021-03-19 02:37:03.493 [error] <0.362.618> Supervisor {<0.362.618>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2021-03-19 02:40:31.532 [error] <0.18589.464> ** Generic server aten_detector terminating
** Last message in was poll
** When Server state == {state,#Ref<0.
2392762665.3115581441.7108>,5000,0.99,#{},#{}}
** Reason for termination ==
** {{timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,238}]},{aten_detector,handle_info,2,[{file,"src/aten_detector.erl"},{line,103}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,680}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,756}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2021-03-19 02:40:46.786 [error] <0.18589.464> CRASH REPORT Process aten_detector with 0 neighbours exited with reason: {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in gen_server:call/2 line 238
2021-03-19 02:41:00.864 [error] <0.235.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.18589.464> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated
-----------
Here's a chart where you can see the regular spikes in CPU corresponding to crashes:
Zooming in on the last crash:
Same crash from the management panel:
As you can see, there's no increase in messages before the crash. We thought the churn caused by PHP connections might be an issue, so we later added
amqproxy, reducing connections to < 2/s, but the crashes have continued.
Attached are our config files and logs. Let me know if there's anything else that would be helpful!
Thanks,
Antonio