Trying to understand a CRASH REPORT

2,310 views
Skip to first unread message

Jens Troeger

unread,
Nov 4, 2019, 1:47:28 AM11/4/19
to rabbitmq-users
Hello,

Over the past two or so weeks, I’ve seen the following issue happening at random times

2019-11-04 05:20:55.553 [error] <0.19332.7> ** Generic server aten_detector terminating 
** Last message in was poll
** When Server state == {state,#Ref<0.598122687.2509766657.60349>,1000,0.99,#{},#{}}
** Reason for termination == 
** {{timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,215}]},{aten_detector,handle_info,2,[{file,"src/aten_detector.erl"},{line,97}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-11-04 05:21:04.582 [error] <0.19332.7> CRASH REPORT Process aten_detector with 0 neighbours exited with reason: {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in gen_server:call/2 line 215
2019-11-04 05:21:10.346 [error] <0.260.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.19332.7> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated

It then repeats a moment later again

2019-11-04 05:21:20.387 [error] <0.27383.10> ** Generic server aten_detector terminating
...

About 30min later I notice that something is off because of server notifications; when I reboot the server then RabbitMQ restarts and continues to run fine:

2019-11-04 05:56:12.733 [info] <0.307.0> 
 Starting RabbitMQ 3.8.0 on Erlang 21.3.8.7
 Copyright (C) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL.  See https://www.rabbitmq.com/
2019-11-04 05:56:12.739 [info] <0.307.0> 
 node           : rabbit@ip-10-0-1-98
 home dir       : /var/lib/rabbitmq
 config file(s) : (none)
 cookie hash    : 8fZhdkfp5GCirbVTKV5mZw==
 log(s)         : /var/log/rabbitmq/rab...@ip-10-0-1-98.log
                : /var/log/rabbitmq/rabbit@ip-10-0-1-98_upgrade.log
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@ip-10-0-1-98

What does the timeout mean exactly (or the error message) and where can I find some more context on how to address this issue? Might this be caused by another problem?

Much thanks!
Jens

Wesley Peng

unread,
Nov 4, 2019, 3:18:57 AM11/4/19
to rabbitm...@googlegroups.com
Hi

on 2019/11/4 14:47, Jens Troeger wrote:
> What does the timeout mean exactly (or the error message) and where can
> I find some more context on how to address this issue? Might this be
> caused by another problem?


Before the community can help you, we must know this information:

Version of Erlang and RabbitMQ
Operating system and version
RabbitMQ configuration files
RabbitMQ log files, or log file entries
Exact error output
Exact commands you are running, or code you are running


regards

Jens Troeger

unread,
Nov 4, 2019, 4:31:14 AM11/4/19
to rabbitmq-users
Thank you, Wesley.

Here’s the status:

OS PID: 1160
OS: Linux
Uptime (seconds): 12061
RabbitMQ version: 3.8.0
Node name: rabbit@ip-10-0-1-98
Erlang configuration: Erlang/OTP 21 [erts-10.3.5.5] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64] [hipe]
Erlang processes: 309 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 60

This is a Ubuntu 16.04 (xenial) Linux, and I followed the installation instruction here. Since I’m still learning and exploring RabbitMQ I haven’t changed anything beyond the default installation and config, except:

~ > # Add a user and permissions.
~ > sudo rabbitmqctl add_user test Test
~ > sudo rabbitmqctl set_permissions -p / test ".*" ".*" ".*"
~ > # Check that the message broker is running.
~ > sudo rabbitmqctl status

The messages are pushed from Celery running on the same host, pushed to amqp://test:Test@localhost:5672// 

Exact error output from /var/log/rabbitmq/rab...@ip-10-0-1-98.log is above in my initial post. However, I think that at the time of the error no messages were pushed.

There is also a slew of

2019-11-04 09:10:31.637 [error] <0.12807.0> closing AMQP connection <0.12807.0> (10.0.1.155:50110 -> 10.0.1.98:5672):
missed heartbeats from client, timeout: 60s

which apparently is a Celery client issue, see this post; looks like the suggested Celery workaround actually working for me, too.

Cheers,
Jens

Wesley Peng

unread,
Nov 4, 2019, 4:39:11 AM11/4/19
to rabbitm...@googlegroups.com
Hi

on 2019/11/4 17:31, Jens Troeger wrote:
>
> There is also a slew of
>
> |
> 2019-11-04 09:10:31.637 [error] <0.12807.0> closing AMQP connection
> <0.12807.0> (10.0.1.155:50110 -> 10.0.1.98:5672):
> missed heartbeats from client, timeout: 60s
> |
>
> which apparently is a Celery client issue, see this post
> <https://groups.google.com/d/msg/celery-users/5qYYu_RbKiI/xO6CFjYjCAAJ>;
> looks like the suggested Celery workaround actually working for me, too.

Yes you are smart enough.
This is may the problem of client's timeout setting.
For python library there is also pika client:
https://github.com/pika/pika

regards.

Jens Troeger

unread,
Nov 4, 2019, 10:50:38 PM11/4/19
to rabbitmq-users

Thanks Wesley!

> which apparently is a Celery client issue, see this post
> <https://groups.google.com/d/msg/celery-users/5qYYu_RbKiI/xO6CFjYjCAAJ>;
> looks like the suggested Celery workaround actually working for me, too.

Yes you are smart enough.
This is may the problem of client's timeout setting.

I’ll see how using that workaround works out over the next few days.

Still, though, could you please shed some light onto the error message?
 
For python library there is also pika client:
https://github.com/pika/pika

This is for the Celery folks to decide, I think. Frankly though, having had nothing but misery and working around their bugs I’m going to ditch Celery for Dramatiq soon.

Cheers,
Jens

Karl Nilsson

unread,
Nov 5, 2019, 10:48:27 AM11/5/19
to rabbitmq-users
Hi,

You say you are getting server notifications. What are you referring to here? Memory alarms? If you enable rabbitmq-top do you notice any processes that have a large Erlang mailbox backlog.

Is the cluster under pressure at this time? Pushing a lot of messages?

Aten is a failure detector mostly used by Quorum Queues (a new feature in 3.8).


Cheers
Karl


On Monday, 4 November 2019 06:47:28 UTC, Jens Troeger wrote:
Hello,

Over the past two or so weeks, I’ve seen the following issue happening at random times

2019-11-04 05:20:55.553 [error] <0.19332.7> ** Generic server aten_detector terminating 
** Last message in was poll
** When Server state == {state,#Ref<0.598122687.2509766657.60349>,1000,0.99,#{},#{}}
** Reason for termination == 
** {{timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,215}]},{aten_detector,handle_info,2,[{file,"src/aten_detector.erl"},{line,97}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-11-04 05:21:04.582 [error] <0.19332.7> CRASH REPORT Process aten_detector with 0 neighbours exited with reason: {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in gen_server:call/2 line 215
2019-11-04 05:21:10.346 [error] <0.260.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.19332.7> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated

It then repeats a moment later again

2019-11-04 05:21:20.387 [error] <0.27383.10> ** Generic server aten_detector terminating
...

About 30min later I notice that something is off because of server notifications; when I reboot the server then RabbitMQ restarts and continues to run fine:

2019-11-04 05:56:12.733 [info] <0.307.0> 
 Starting RabbitMQ 3.8.0 on Erlang 21.3.8.7
 Copyright (C) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL.  See https://www.rabbitmq.com/
2019-11-04 05:56:12.739 [info] <0.307.0> 
 node           : rabbit@ip-10-0-1-98
 home dir       : /var/lib/rabbitmq
 config file(s) : (none)
 cookie hash    : 8fZhdkfp5GCirbVTKV5mZw==
 log(s)         : /var/log/rabbitmq/rabbit@ip-10-0-1-98.log
                : /var/log/rabbitmq/rabbit@ip-10-0-1-98_upgrade.log
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@ip-10-0-1-98

What does the timeout mean exactly (or the error message) and where can I find some more context on how to address this issue? Might this be caused by another problem?

Much thanks!
Jens
Reply all
Reply to author
Forward
0 new messages