Periodic crashes with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}}

1,054 views
Skip to first unread message

Antonio Lupher

unread,
Mar 21, 2021, 1:09:44 PM3/21/21
to rabbitm...@googlegroups.com
Hi all,

We've been seeing periodic crashes (~3-4x/day) for a while and hope someone might have insight on what could be causing this.

Our setup:
- RabbitMQ version: 3.8.14, but we have seen this since at least 3.7.x. 
- Using official docker image rabbitmq:3.8.14-management-alpine, running on k8s on Google Cloud (GKE).
- Queue is used by PHP and Go services. Messages are small and rates are usually < 20/s

The problem:
Several times a day, we see a CPU spike, uncorrelated to any increase in load, then we see rabbitmq crash with the following in the error log. Note that when we collected the data on this particular incident, we were using RabbitMQ 3.8.9, but the crashes have continued on 3.8.14 with the same error message and behavior.

-----------
2021-03-19 02:37:03.493 [error] <0.362.618> Supervisor {<0.362.618>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2021-03-19 02:40:31.532 [error] <0.18589.464> ** Generic server aten_detector terminating 
** Last message in was poll
** When Server state == {state,#Ref<0.2392762665.3115581441.7108>,5000,0.99,#{},#{}}
** Reason for termination ==
** {{timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,238}]},{aten_detector,handle_info,2,[{file,"src/aten_detector.erl"},{line,103}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,680}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,756}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2021-03-19 02:40:46.786 [error] <0.18589.464> CRASH REPORT Process aten_detector with 0 neighbours exited with reason: {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in gen_server:call/2 line 238
2021-03-19 02:41:00.864 [error] <0.235.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.18589.464> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated
-----------

Here's a chart where you can see the regular spikes in CPU corresponding to crashes:

image.png


Zooming in on the last crash:
image.png


Same crash from the management panel:
image.png


image.png

image.png


As you can see, there's no increase in messages before the crash. We thought the churn caused by PHP connections might be an issue, so we later added amqproxy, reducing connections to < 2/s, but the crashes have continued.

Attached are our config files and logs. Let me know if there's anything else that would be helpful!

Thanks,
Antonio

enabled_plugins
rabbitmq.conf
rabbitmq.log.gz

jo...@cloudamqp.com

unread,
Mar 22, 2021, 1:34:05 PM3/22/21
to rabbitmq-users
Hi,

It is not 100% clear to me what happens in your scenario, does RabbitMQ fully crash? If so, do you have a crash dump to share or analyze?  See the queue_fun.awk script on https://ferd.github.io/recon/ for a quick way to do that.

I've reported a similar issues, see [1]. We've seen it a few more times since I reported it.
I'd start with adding Prometheus scraping, that might be the easiest way to narrow down what is. If Prometheus times out we'd have to use more customized methods to narrow it down. 

/Johan

Reply all
Reply to author
Forward
0 new messages