Stop master node cause slave node nodedown

301 views
Skip to first unread message

xue

unread,
May 29, 2018, 5:21:30 AM5/29/18
to rabbitmq-users
Hello, I create a rabbitmq cluster with mirrored queue enabled. There are two nodes in cluster, one is master and another is slave. I used 'rabbitmqctl stop' cmd to stop master node, and information of queues will synchronize to the slave node. If there are few queues in cluster, everything is ok, but if there are plenty of  queues, slave node will report nodedown error

##########################################################################
run cmd fail,cmd=timeout -s SIGKILL 20 /usr/local/lib/rabbitmq/sbin/rabbitmqctl status,ret=2,out=Status of node rabbit@rabbitmqNode0 ...
,err=Error: unable to connect to node rabbit@rabbitmqNode0: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit@rabbitmqNode0]

rabbit@rabbitmqNode0:
  * connected to epmd (port 4369) on rabbitmqNode0
  * epmd reports node 'rabbit' running on port 50000
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
  * suggestion: is the Erlang distribution using TLS?

current node details:
- node name: 'rabbitmq-cli-18614@BJ_AZ2_POD03_Cascaded_09'
- home dir: /usr/local/lib/rabbitmq
- cookie hash: P7Imk5s8lUeR70yzdvPyDA==
##########################################################################

And this error will recover few minutes later.

I haven't changed hostname,cookie or TSL, and rabbitmq can recover automatically. So is it a normal behavior that salve node report nodedown error when there are lots of queues?

Rabbitmq version is 3.5.6

Michael Klishin

unread,
May 29, 2018, 10:15:17 AM5/29/18
to rabbitm...@googlegroups.com
There are no master and slave nodes in RabbitMQ. All nodes are equal. Queues have masters and mirrors [1].

There is no evidence that the node is down. The message includes some diagnostics info, including

> TCP connection succeeded but Erlang distribution failed

which is one of the most commonly discussed topics on this list that has a dedicated doc section [2].

2. How CLI tools Authenticate Nodes (and Nodes to Each Other): the Erlang Cookie on http://www.rabbitmq.com/cli.html

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

xue

unread,
Jun 3, 2018, 11:42:06 PM6/3/18
to rabbitmq-users
Thanks for your replay. Maybe I used the wrong method to migrate master queues. I should modify nodes policy first, and then stop master queues. When rabbitmq-server with master queues stopped, new master need to take  some time to synchronise queues. And 'rabbitmqctl' cmd can't use at that time? 

在 2018年5月29日星期二 UTC+8下午10:15:17,Michael Klishin写道:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 4, 2018, 2:54:06 AM6/4/18
to rabbitmq-users
I also tried "erl -sname bao@rabbitmqNode0 -setcookie rabbitmq_server_cookie -remsh rabbit@rabbitmqNode0" this cmd to connect rabbitmq-server, and it reported the same error.


在 2018年5月29日星期二 UTC+8下午10:15:17,Michael Klishin写道:
There are no master and slave nodes in RabbitMQ. All nodes are equal. Queues have masters and mirrors [1].
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jun 4, 2018, 5:24:23 AM6/4/18
to rabbitm...@googlegroups.com
The use of rabbitmqctl is orthogonal to queue operations but if a rabbitmqctl operation requires a running queue master
and a new one hasn't been elected yet, it will predictably fail.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 4, 2018, 9:29:23 AM6/4/18
to rabbitmq-users
There are only two nodes in cluster, one stopped, another should be promoted to master immediately. I tried to start a erlang process to connect rabbitmq's beam process, and it reported below error

######################################################################

# erl -sname aatest@rabbitmqNode1 -setcookie rabbitmq_server_cookie -remsh rabbit@rabbitmqNode1
Erlang/OTP 18 [erts-7.3.1] [source] [64-bit] [smp:24:24] [async-threads:10] [hipe] [kernel-poll:false]

*** ERROR: Shell process terminated! (^G to start new job) ***

BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
d
=node:'aatest@rabbitmqNode1'

######################################################################

But the other node is fine,
######################################################################

# erl -sname test@rabbitmqNode0 -setcookie rabbitmq_server_cookie -remsh rabbit@rabbitmqNode0
Erlang/OTP 18 [erts-7.3.1] [source] [64-bit] [smp:24:24] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3.1  (abort with ^G)
(rabbit@rabbitmqNode0)1> 
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
d
=node:'test@rabbitmqNode0'
=visible_node:4701
Name: 'rabbit@rabbitmqNode0'
Controller: #Port<0.361>
Creation: 2
Remote monitoring: <0.15.0> <4701.389.0>
Remotely monitored by: <0.33.0> <4701.515.0>
Remote link: <0.33.0> <4701.513.0>

######################################################################

and two nodes  ping pang success

######################################################################

# erl -sname test@rabbitmqNode0 -setcookie rabbitmq_server_cookie -remsh rabbit@rabbitmqNode0
Erlang/OTP 18 [erts-7.3.1] [source] [64-bit] [smp:24:24] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3.1  (abort with ^G)
(rabbit@rabbitmqNode0)1> net_adm:ping(rabit@rabbitmqNode1).
pang

######################################################################

So what happened to the first node, is beam process blocked?


在 2018年6月4日星期一 UTC+8下午5:24:23,Michael Klishin写道:

Michael Klishin

unread,
Jun 4, 2018, 10:59:19 AM6/4/18
to rabbitm...@googlegroups.com
First of all, there is no such thing as a master node in RabbitMQ [1][3].

Second, you are running a version of OTP that has known bugs that severely affect
RabbitMQ: the node cannot accept any connections and the existing ones also go stale after a period of time [2]. Please
upgrade to whatever is the most recent version supported by the RabbitMQ release in question [2][4].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 11, 2018, 9:05:14 AM6/11/18
to rabbitmq-users
Hello, I upgraded rabbitmq to 3.6.15 with erlang 20.3.4. But there's still the same problem. I created lots of queues, stopped rabbitmq-server with master queues, waited 30 seconds, and I found  I can't connect to another rabbitmq-server


Then I started first rabbitmq-server and traced "net_adm:ping" cmd. And erlang:send function returned noconnect.


Do you have any ideas about what's wrong with rabbitmq?
Now, I want to debug c code of erlang. Would you please tell me is there any way to get the traceback for the c code? Thank you.



在 2018年6月4日星期一 UTC+8下午10:59:19,Michael Klishin写道:

Michael Klishin

unread,
Jun 11, 2018, 1:08:21 PM6/11/18
to rabbitm...@googlegroups.com
I'd inspect server logs and resource usage of the target node before debugging the runtime.
A swapped out node process won't be able to respond in time even to a net_adm:ping/1.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 13, 2018, 3:00:42 AM6/13/18
to rabbitmq-users

Hello, I used 'rabbitmqctl stop' cmd to stop rabbit@rabbitmqNode0 at about 20:18, and started again. rabbit@rabbitmqNode1 reported nodedown error at about 20:19



This is resource usage of rabbit@rabbitmqNode1, beam.smp process took up a lot of CPU resources


I created a new group because the log file is little big.

This is the log of rabbit@rabbitmqNode0

This is the log of rabbit@rabbitmqNode1

Do you need any other information?


在 2018年6月12日星期二 UTC+8上午1:08:21,Michael Klishin写道:

Michael Klishin

unread,
Jun 13, 2018, 11:45:28 AM6/13/18
to rabbitm...@googlegroups.com
It's the CLI tool that reports it. See [1] first.

The node in question consumed 17 GB of memory and uses close to 7 full CPU cores. It can actually be
pretty busy. See [2].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 13, 2018, 10:36:37 PM6/13/18
to rabbitmq-users
Hello, is it normal that CLI tool can't use when rabbitmq is too busy? And is there a better way to migrate master queues to another node?

I couldn't connect to rabbit@rabbitmqNode1 when it reported nodedown error. It reported shell process terminated error. 




I connected to rabbit@rabbitmqNode0 after it started. Then I tried to ping rabbit@rabbitmqNode1, but it returned pang.

Are these still problems of CLI tool?

在 2018年6月13日星期三 UTC+8下午11:45:28,Michael Klishin写道:

Michael Klishin

unread,
Jun 14, 2018, 6:23:34 AM6/14/18
to rabbitm...@googlegroups.com
CLI tool requests have to be scheduled just like any other activity so that might be the case.

I don't see any info on what Erlang version is used but with RabbitMQ 3.5.x it is 18.x at most.
Those versions have a couple of serious bugs that prevent nodes from accepting *any* TCP connections
(including those from CLI tools, peers and remote shells) and no activity on others.

See [1] and please upgrade [2][3].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 14, 2018, 7:36:58 AM6/14/18
to rabbitmq-users
I've upgraded rabbitmq and erlang. First line of rab...@rabbitmqNode1.log.3 in node1.zip shows rabbitmq is 3.6.15 and erlang is 20.3.4, so these bugs should bu fixed.




在 2018年6月14日星期四 UTC+8下午6:23:34,Michael Klishin写道:

Michael Klishin

unread,
Jun 14, 2018, 7:44:46 AM6/14/18
to rabbitm...@googlegroups.com
net_adm:ping/1 will return a 'pang` for any failure, from network connectivity
to authentication. It really doesn't do much so unless the node fails to accept connections
or is swapped out by the OS, even a very busy node should respond.

See [1][2][3][4].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 21, 2018, 2:06:08 AM6/21/18
to rabbitmq-users
Hello, I stopped rabbit@rabbitmqNode0 and found rabbit@rabbitmqNode1 spent lots of time on gen_server:loop

This is etop result of rabbit@rabbitmqNode1

========================================================================================
 rabbit@rabbitmqNode1                                                      07:54:36
 Load:  cpu        11               Memory:  total    25143328    binary    1262630
        procs  113765                        processes22763889    code        24937
        runq        0                        atom         1025    ets        217156
                         
Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function
----------------------------------------------------------------------------------------
<6504.32.0>    error_logger             '-'48186886 6081016    8947 io:execute_request/2
<6504.437.0>   queue_metrics_metric     '-'2377948856180776       0 gen_server:loop/7   
<6504.49.0>    net_kernel               '-'1433154715822296    7505 erlang:setelement/3 
<6504.438.0>   queue_coarse_metrics     '-'1117942313090000       0 dict:find/2         
<6504.429.0>   connection_coarse_me     '-' 3355111 9571296       0 gen_server:loop/7   
<6504.53.0>    file_server_2            '-' 2796721 1407824     397 prim_file:drv_get_re
<6504.73.0>    erlang:apply/2           '-' 2742544   12056       1 file_io_server:serve
<6504.386.0>   ranch_conns_sup:init     '-' 2213105  811920      24 gen:do_call/4       
<6504.427.0>   connection_created_m     '-' 1797863 7637544       0 gen_server:loop/7   
<6504.17182.3> etop_txt:init/1          '-' 137506782526800       0 etop:update/1       
========================================================================================
                         
                         
========================================================================================
 rabbit@rabbitmqNode1                                                      07:54:48
 Load:  cpu        26               Memory:  total    30539285    binary    4132696
        procs  150069                        processes25255648    code        24937
        runq        0                        atom         1025    ets        249877
                         
Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function
----------------------------------------------------------------------------------------
<6504.32.0>    error_logger             '-'25077570   55456       0 gen_event:fetch_msg/
<6504.49.0>    net_kernel               '-'1291425614205200   14474 erlang:setelement/3 
<6504.437.0>   queue_metrics_metric     '-'1236168832774576       0 gen_server:loop/7   
<6504.438.0>   queue_coarse_metrics     '-'1025782222445736       0 gen_server:loop/7   
<6504.429.0>   connection_coarse_me     '-' 874507411170704       0 gen_server:loop/7   
<6504.339.0>   delegate_6               '-' 8155470********       0 gen:do_call/4       
<6504.427.0>   connection_created_m     '-' 3907706 7637544       0 gen_server:loop/7   
<6504.435.0>   channel_process_metr     '-' 3070661 4720656       0 gen_server:loop/7   
<6504.17182.3> etop_txt:init/1          '-' 259352175649648       0 etop:update/1       
<6504.431.0>   channel_metrics_metr     '-' 258594011170704       0 gen_server:loop/7   
========================================================================================

This is eprof of rabbit@rabbitmqNode1

****** Process <6504.437.0>    -- 100.00 % of profiled time *** 
FUNCTION                                                          CALLS        %      TIME  [uS / CALLS]
--------                                                          -----  -------      ----  [----------]
time_compat:os_system_time/1                                          3     0.00         2  [      0.67]
exometer_slide:timestamp/0                                            3     0.00         3  [      1.00]
rabbit_core_metrics:delete/2                                          2     0.00         3  [      1.50]
gen_server:handle_msg/6                                               3     0.00         3  [      1.00]
os:system_time/1                                                      3     0.00         3  [      1.00]
gen_server:handle_common_reply/8                                      4     0.00         4  [      1.00]
ets:foldl/3                                                           3     0.00         5  [      1.67]
ets:'-foldl/3-after$^0/0-0-'/1                                        4     0.00         5  [      1.25]
rabbit_mgmt_metrics_collector:handle_info/2                           3     0.00         5  [      1.67]
rabbit_mgmt_metrics_collector:handle_deleted_queues/3                 4     0.00         5  [      1.25]
gen_server:decode_msg/9                                               3     0.00         5  [      1.67]
gen_server:try_dispatch/3                                             3     0.00         5  [      1.67]
exometer_slide:new/3                                                  6     0.00         6  [      1.00]
gen_server:try_dispatch/4                                             3     0.00         6  [      2.00]
rabbit_mgmt_metrics_collector:aggregate_metrics/2                     3     0.00         7  [      2.33]
dict:fold_dict/3                                                     16     0.00         8  [      0.50]
rabbit_mgmt_metrics_collector:exec_table_ops/3                       12     0.00         9  [      0.75]
dict:fold/3                                                          16     0.00         9  [      0.56]
ets:delete/2                                                          2     0.00         9  [      4.50]
dict:mk_seg/1                                                        15     0.00        10  [      0.67]
rabbit_mgmt_metrics_collector:'-aggregate_metrics/2-fun-1-'/4        12     0.00        12  [      1.00]
ets:first/1                                                           3     0.00        13  [      4.33]
dict:new/0                                                           15     0.00        14  [      0.93]
exometer_slide:add_to_total/2                                        16     0.00        16  [      1.00]
proplists:get_value/3                                                36     0.00        17  [      0.47]
rabbit_mgmt_metrics_collector:ceil/1                                  6     0.00        18  [      3.00]
lists:duplicate/2                                                    28     0.00        21  [      0.75]
erlang:send_after/3                                                   4     0.00        21  [      5.25]
ets:safe_fixtable/2                                                   7     0.00        27  [      3.86]
erlang:list_to_tuple/1                                               28     0.00        28  [      1.00]
dict:expand_segs/2                                                   88     0.00        65  [      0.74]
lists:duplicate/3                                                  1692     0.00       320  [      0.19]
dict:fold_segs/4                                                   1560     0.00       576  [      0.37]
dict:fold_seg/4                                                   26248     0.05      6392  [      0.24]
io_lib_format:build/1                                             33361     0.06      8804  [      0.26]
rabbit_mgmt_metrics_collector:queue_exists/1                      33360     0.06      8912  [      0.27]
rabbit_mgmt_format:queue_state/1                                  33361     0.07      9658  [      0.29]
rabbit_misc:dirty_read/1                                          33360     0.07      9885  [      0.30]
rabbit_mgmt_format:print/2                                        33361     0.07      9896  [      0.30]
calendar:last_day_of_the_month1/2                                 33360     0.07      9957  [      0.30]
calendar:date_to_gregorian_days/3                                 33360     0.07     10225  [      0.31]
calendar:is_leap_year1/1                                          33361     0.07     10240  [      0.31]
rabbit_mgmt_metrics_collector:update_op/4                         33361     0.08     10513  [      0.32]
ets:do_foldl/4                                                    33366     0.08     11492  [      0.34]
lists:concat/1                                                    33361     0.08     11693  [      0.35]
rabbit_mgmt_metrics_collector:difference/2                        33359     0.09     11841  [      0.35]
rabbit_mgmt_metrics_collector:insert_with_index/3                 66750     0.09     13029  [      0.20]
erlang:tuple_to_list/1                                            33389     0.10     13532  [      0.41]
calendar:df/2                                                     33360     0.10     14332  [      0.43]
dict:maybe_expand_segs/1                                          26509     0.10     14528  [      0.55]
calendar:gregorian_days_to_date/1                                 33360     0.11     15071  [      0.45]
rabbit_mgmt_format:'-format_queue_stats/1-lc$^1/1-0-'/1           33361     0.11     15370  [      0.46]
io_lib_format:pcount/1                                            33361     0.11     15414  [      0.46]
calendar:year_day_to_date/2                                       33360     0.11     15824  [      0.47]
erlang:make_fun/3                                                 33360     0.12     16320  [      0.49]
calendar:is_leap_year/1                                           33360     0.12     16343  [      0.49]
rabbit_mgmt_metrics_collector:'-init/1-fun-0-'/1                  33360     0.12     16346  [      0.49]
io_lib_format:scan/2                                              33361     0.12     16430  [      0.49]
rabbit_mgmt_metrics_collector:'-aggregate_metrics/2-fun-0-'/2     33362     0.12     16549  [      0.50]
rabbit_mgmt_format:'-format_queue_stats/1-lc$^0/1-1-'/1           33361     0.12     16563  [      0.50]
rabbit_mgmt_metrics_collector:insert_old_aggr_stats/3             33361     0.12     16622  [      0.50]
rabbit_mgmt_metrics_collector:vhost/1                             33362     0.12     16805  [      0.50]
calendar:day_to_year/1                                            33360     0.12     16854  [      0.51]
rabbit_mgmt_format:properties/1                                   33361     0.13     17655  [      0.53]
calendar:dm/1                                                     33360     0.13     17773  [      0.53]
calendar:date_to_gregorian_days/1                                 33360     0.13     18123  [      0.54]
calendar:time_to_seconds/1                                        33360     0.13     18128  [      0.54]
calendar:last_day_of_the_month/2                                  33360     0.13     18167  [      0.54]
rabbit_mgmt_metrics_collector:get_difference/3                    33362     0.13     18296  [      0.55]
lists:flatten/1                                                   33361     0.13     18751  [      0.56]
calendar:datetime_to_gregorian_seconds/1                          33360     0.14     18948  [      0.57]
rabbit_mgmt_format:now_to_str/1                                   33360     0.14     18974  [      0.57]
rabbit_mgmt_metrics_collector:insert_op/4                         33361     0.14     19033  [      0.57]
io_lib:format/2                                                   33361     0.14     19064  [      0.57]
io_lib_format:fwrite/2                                            33361     0.14     19764  [      0.59]
rabbit_mgmt_format:format/2                                       33360     0.15     20581  [      0.62]
rabbit_amqqueue:lookup/1                                          33360     0.15     20778  [      0.62]
rabbit_mgmt_metrics_collector:aggregate_entry/4                   33362     0.15     20866  [      0.63]
calendar:seconds_to_time/1                                        33361     0.15     20959  [      0.63]
rabbit_mgmt_metrics_collector:insert_index/2                      66750     0.17     23486  [      0.35]
calendar:dty/3                                                    66720     0.19     26090  [      0.39]
calendar:year_day_to_date2/2                                      33361     0.19     26096  [      0.78]
rabbit_mgmt_metrics_collector:insert_entry/7                      66750     0.19     26894  [      0.40]
calendar:gregorian_seconds_to_datetime/1                          33360     0.20     28091  [      0.84]
io_lib_format:adjust/3                                           133444     0.20     28158  [      0.21]
unicode_util:gc_extend/2                                         133444     0.20     28492  [      0.21]
rabbit_misc:pget/2                                                66724     0.21     29914  [      0.45]
unicode_util:gc_extend/3                                         133444     0.23     32387  [      0.24]
rabbit_mgmt_metrics_collector:pget/3                              66724     0.24     33511  [      0.50]
dict:put_bucket_s/3                                               53018     0.26     35778  [      0.67]
erlang:list_to_binary/1                                           33773     0.27     37836  [      1.12]
exometer_slide:add_element/3                                      66750     0.28     38460  [      0.58]
dict:fold_bucket/3                                               124833     0.29     39956  [      0.32]
io_lib_format:term/5                                             200166     0.33     45903  [      0.23]
io_lib:write1/3                                                  200166     0.33     46027  [      0.23]
rabbit_mgmt_metrics_collector:'-exec_table_ops/3-fun-0-'/5       100117     0.35     48301  [      0.48]
io_lib_format:field_width/3                                      200166     0.38     52656  [      0.26]
io_lib_format:chars/2                                            133444     0.39     54817  [      0.41]
rabbit_mgmt_metrics_collector:insert_entry_ops/6                  66722     0.42     57832  [      0.87]
calendar:dy/1                                                    100080     0.42     58225  [      0.58]
erlang:integer_to_list/1                                         200166     0.43     60139  [      0.30]
unicode_util:gc_1/1                                              266888     0.44     61069  [      0.23]
string:length/1                                                  133444     0.46     63402  [      0.48]
rabbit_mgmt_metrics_collector:'-insert_entry_op/4-fun-0-'/2      133436     0.48     66525  [      0.50]
rabbit_mgmt_metrics_collector:sum_entry/2                        133436     0.48     66707  [      0.50]
ets:next/2                                                        33363     0.55     76302  [      2.29]
dict:store/3                                                     300251     0.55     76344  [      0.25]
rabbit_mgmt_format:'-tuple/1-lc$^0/1-0-'/1                       200166     0.56     77824  [      0.39]
io_lib_format:collect_cseq/2                                     200166     0.56     78035  [      0.39]
epp:default_encoding/0                                           200166     0.57     78974  [      0.39]
io_lib_format:decr_pc/2                                          200166     0.57     79368  [      0.40]
io_lib_format:strings/2                                          200166     0.59     81669  [      0.41]
io_lib_format:pcount/2                                           400332     0.59     82317  [      0.21]
io_lib_format:precision/2                                        200166     0.60     82995  [      0.41]
string:length_1/2                                                366971     0.60     83154  [      0.23]
io_lib_format:field_width/2                                      200166     0.60     83631  [      0.42]
dict:'-update/4-fun-0-'/4                                        200168     0.65     90147  [      0.45]
io_lib:write/2                                                   200166     0.67     93341  [      0.47]
io_lib_format:pad_char/2                                         200166     0.69     95359  [      0.48]
dict:get_bucket_s/2                                              293400     0.69     96059  [      0.33]
rabbit_mgmt_metrics_collector:'-insert_entry_ops/6-fun-0-'/6     200168     0.70     97000  [      0.48]
lists:foldl/3                                                    333615     0.70     97277  [      0.29]
io_lib_format:encoding/2                                         200166     0.71     98893  [      0.49]
dict:maybe_expand/2                                              500419     0.71     98957  [      0.20]
dict:update/4                                                    200168     0.74    102862  [      0.51]
dict:get_bucket/2                                                266891     0.76    105395  [      0.39]
unicode_util:gc/1                                                266888     0.80    110724  [      0.41]
erlang:min/2                                                     266888     0.81    112147  [      0.42]
rabbit_mgmt_metrics_collector:insert_entry_op/4                  200168     0.81    112770  [      0.56]
io_lib_format:control/9                                          200166     0.84    117060  [      0.58]
lists:keyfind/3                                                  467056     0.85    118454  [      0.25]
dict:maybe_expand_aux/2                                          500419     0.91    127323  [      0.25]
dict:'-store/3-fun-0-'/3                                         300251     0.93    128772  [      0.43]
io_lib_format:field_value/3                                      533776     0.94    130950  [      0.25]
dict:rehash/4                                                    215984     0.94    131082  [      0.61]
io_lib_format:field_value/2                                      333610     1.01    140583  [      0.42]
io_lib_format:collect/2                                          400332     1.09    151885  [      0.38]
unicode_util:cp/1                                                400332     1.10    153202  [      0.38]
dict:find/2                                                      266891     1.10    153219  [      0.57]
dict:find_val/2                                                  487907     1.13    157159  [      0.32]
lists:thing_to_list/1                                            900747     1.18    164486  [      0.18]
io_lib:get_option/3                                              400332     1.20    166617  [      0.42]
io_lib_format:build/3                                            400332     1.20    167067  [      0.42]
erlang:'++'/2                                                    900775     1.36    189825  [      0.21]
io_lib_format:collect_cc/2                                       200166     1.51    210574  [      1.05]
rabbit_mgmt_format:'-properties/1-lc$^0/1-0-'/1                  533776     1.72    240006  [      0.45]
rabbit_mgmt_format:tuple/1                                       667220     1.79    248864  [      0.37]
dict:on_bucket/3                                                 500419     2.09    290977  [      0.58]
ets:insert/2                                                     100117     2.14    298041  [      2.98]
lists:'-concat/1-fun-0-'/1                                       900747     2.38    331444  [      0.37]
lists:do_flatten/2                                              1100913     2.61    363529  [      0.33]
ets:lookup/2                                                     133472     2.63    365759  [      2.74]
lists:flatmap/2                                                  934108     2.64    367356  [      0.39]
erlang:phash/2                                                   956785     2.82    392391  [      0.41]
rabbit_mgmt_format:format_queue_stats/1                          900746     2.84    394569  [      0.44]
dict:update_bkt/4                                                549588     2.92    406411  [      0.74]
rabbit_mgmt_format:'-format/2-lc$^2/1-0-'/2                      934107     2.98    415014  [      0.44]
dict:get_slot/2                                                  767310     3.04    423197  [      0.55]
dict:store_bkt_val/3                                             777411     3.28    455942  [      0.59]
erlang:setelement/3                                             2148045     6.98    971658  [      0.45]
gen_server:loop/7                                                     4    11.02   1533359  [ 383339.75]
-------------------------------------------------------------  --------  -------  --------  [----------]
Total:                                                         29150172  100.00%  13917631  [      0.48]
ok


Do you know why gen_server:loop took such long time, and what happened to rabbit@rabbitmqNode1 when rabbit@rabbitmqNode0 stopped. 

Thank you.


在 2018年6月14日星期四 UTC+8下午7:44:46,Michael Klishin写道:

Michael Klishin

unread,
Jun 21, 2018, 8:53:33 AM6/21/18
to rabbitm...@googlegroups.com
Many profilers use periodic sampling of stack traces and if a function
is used to loop waiting for requests (which is exactly what gen_server loop does),
it will show up in the trace a lot even though it actually won't be consuming any CPU resources
(or the process might even be suspended).

See rabbitmq-top's Reductions metric instead [1].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

xue

unread,
Jun 26, 2018, 9:16:48 AM6/26/18
to rabbitmq-users
Hello, I found that if I execute 'rabbitmqctl stop' command on rabbit@rabbitmqNode0,  rabbit@rabbitmqNode1 would report nodedown error. But if I execute'rabbitmqctl stop_app' command instead, rabbit@rabbitmqNode1 works fine.

The only difference between stop and stop_app is that 'rabbitmqctl stop_app' only stops rabbitmq-server, but 'rabbitmqctl stop' stops not only rabbitmq-server, but also erlang vms.

So would erlang vms cause this problem?

Thank you

在 2018年6月21日星期四 UTC+8下午8:53:33,Michael Klishin写道:

Daniil Fedotov

unread,
Jun 26, 2018, 12:26:11 PM6/26/18
to rabbitmq-users
Hi.

Let me clarify some things:

Are you trying to move masters across by stopping the node?
According to the logs, you have around 22 thousand queues, is that right?
By nodedown you mean that you cannot access the node1 via rabbitqmctl?
Do you see the nodedown error after node0 is restarted or before?

When a node detects another node stopped, it will notify all the mirrored queues slave processes to maybe promote to master. For 22K queues this will take quite a time and will use some CPU resources to wake each queue process up from hibernation, update a database to record a new master and do some promote logic. This could slow down other parts of the system, because of internal process scheduling and make it seem like node is not accessable.
I can see handshake timeouts in the logs, which might also happen because connection processes do not get enought scheduled time to perform connection logic, because CPU is busy processing queues.

Having that many mirrored queues would require much more CPU cores to schedule in case of node disconnect (e.g. actual network or hardware errors). The scheduler will have to give all of them some time to promote or handle mirror removal.

If you want to migrate queues, you can define a policy to configure different nodes as described here https://groups.google.com/forum/#!msg/rabbitmq-users/bJNcrDVhWiU/6oMO0DjNQ4oJ
You better do that for limited number of queues at a time to not overload your system again. You can do that with queue name patterns in policies.
This should allow you to do that maintenance easier, but you still can encounter a scheduling overload during actual network error or crash.

xue

unread,
Jun 27, 2018, 3:37:42 AM6/27/18
to rabbitmq-users
Hello,

Are you trying to move masters across by stopping the node?
Yes

According to the logs, you have around 22 thousand queues, is that right?
Yes

By nodedown you mean that you cannot access the node1 via rabbitqmctl?
Yes, I executed 'rabbitmqctl' commands on node1, and it reported nodedown error. I also tried this comand, but it failed too.




Do you see the nodedown error after node0 is restarted or before?
node0 works fine, it only reports some timeout errors, such as 'handshake_timeout' or 'missed heartbeats from client'. I think it is because there are too many queues, and rabbitmq can't handle these messages in time.

I'll try to migrate queues by define polices.

Thank you.


在 2018年6月27日星期三 UTC+8上午12:26:11,Daniil Fedotov写道:

xue

unread,
Jun 28, 2018, 9:22:00 AM6/28/18
to rabbitmq-users
Hello, when node0 stopped, all the mirrored queues on node1 would wake up from hiberation. Would you please show me the relative code, and can I change some parameters to make mirrored queues always active, so that mirrored queues don't need CPU resources to wake up.

Thank you

在 2018年6月27日星期三 UTC+8上午12:26:11,Daniil Fedotov写道:
Hi.

Daniil Fedotov

unread,
Jun 28, 2018, 9:36:52 AM6/28/18
to rabbitmq-users
If all 22K queues are always active, they will be granted CPU resources and will consume all the CPU anyway. That is the reason why they are hibernated - to conserve CPU while they are unactive.
What you can do is to check if all the queues need to be mirrored and maybe reduce a number of queues per CPU core. You can reduce number of queues, or make some queues non-mirrored if data there is not critical. You can also add more CPU cores or create another cluster and try to spread queues across multiple clusters, if your topology allows that.

xue

unread,
Jun 29, 2018, 5:16:35 AM6/29/18
to rabbitmq-users
Hello, I also found some confusing things. I created 40K queues and enabled mirrored queues, then I stopped node0, some mirrored queues toke more than ten minutes to see the deaths of mirrors.

This is the first log of "saw deaths of mirrors"



This is the last log


I also found that ,in this process, most schedulers are in sleeping status. and rabbitmq didn't use lot of cpu resources.




Until all the mirrored queues  promoted to master, can nodedown error of node1 disappered.

So  why some queues took such long time to see deaths of mirrors, and can I speed up this process?

Thank you.


在 2018年6月28日星期四 UTC+8下午9:36:52,Daniil Fedotov写道:

Daniil Fedotov

unread,
Jun 29, 2018, 12:13:32 PM6/29/18
to rabbitmq-users
As I mentioned before, you have too many processes to schedule CPU time to. Each queue is a process, which should be given some CPU resources to execute, the "down" message is put into the queue process message box (memory allocated for the process) and then sits there until the process is given some CPU ticks to process it. The message is written only when the process handles the message. The more queues you have, the more time it will take for them to process the message. You only have 7 cores, for 44K processes it will be ~5714 processes per core. Each of the processes will have to update the mnesia database record to mark itself as master and also communicate to coordination process. Also switching between processes is not free and takes additional time and CPU resources.

To improve performance in this case you should reduce a number of processes per core either by reducing a number of queues or increasing a number of cores. As I mentioned before you can also try to use multiple clusters or disable mirroring for some queues.
Reply all
Reply to author
Forward
0 new messages