rabbitmq-server Too many processes

1,050 views
Skip to first unread message

Fei Jiang

unread,
Jul 13, 2018, 3:39:18 AM7/13/18
to rabbitmq-users
 am working on rabbitmq performance. I am using https://github.com/openstack/oslo.messaging/blob/master/tools/simulator.py  measure the performance.
So, I am running the tool for following use-cases and test setup

Memory :  100GB
CPU:  40 Vcpu
RabbitMQ version : 3.5.7
socket and fd  : 65535

RabbitMQ use 2 Node(Node0 Node1),  enabled ha-mode=all, also create 25K queues and 25K TCP connection(25K channel) on rabbitmq-broker.

25K queues was create at Node0, and mirred all queues on Node1, when restart Node0, the Node1 accepting  25k TCP connection, and the close the connection because of Missed heartbeats from client, timeout: 60s. 

there is questions:
1、the node0 down ,node1 recv the  node down message, but node1 have no message ' Slave <rab...@rabbitmqNode1.1.3227.0> saw deaths of mirrors <rab...@rabbitmqNode0.1.11859.0> and promoting slave to master
     the normal case , if node0 down , the node1 have these messages in log, but  this time have no these messages  in mq log, only tcp conntion accepting and closing in 15 min.

Mirrored queue 'metering_fanout_a3694ca9e8e74dd78ee93ac6a19b523a' in vhost '/': Slave <rab...@rabbitmqNode1.1.3227.0> saw deaths of mirrors <rab...@rabbitmqNode0.1.11859.0>

Mirrored queue 'metering_fanout_a3694ca9e8e74dd78ee93ac6a19b523a' in vhost '/': Promoting slave <rab...@rabbitmqNode1.1.3227.0> to master

Mirrored queue 'metering_fanout_a3694ca9e8e74dd78ee93ac6a19b523a' in vhost '/': Synchronising: 0 messages to synchronise

Mirrored queue 'metering_fanout_a3694ca9e8e74dd78ee93ac6a19b523a' in vhost '/': Synchronising: all slaves already synced

2、why rabbitmq have so many heartBeat miss and close the client connection, at last,the connection  reach the system limit

3、after the node0 restart, node0 want to join  node1 cluster for autocluster, node0 have  {badrpc, nodedown}  messge ,auto-cluster failed。

-------------------------------the rabbitmq log --------------------------
the Node 0 log:

=INFO REPORT==== 28-Apr-2018::01:41:18 ===
Stopping RabbitMQ

=INFO REPORT==== 28-Apr-2018::01:41:20 ===
stopped SSL Listener on 0.0.0.0:5672

=ERROR REPORT==== 28-Apr-2018::01:41:20 ===
Error on AMQP connection <0.29780.179> (192.168.4.154:60286 -> 192.168.8.220:5672, vhost: '/', user: 'rabbit', state: running), channel 0:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}
......

=ERROR REPORT==== 28-Apr-2018::01:41:29 ===
Error on AMQP connection <0.26951.304> (192.168.5.43:38425 -> 192.168.8.220:5672, vhost: '/', user: 'rabbit', state: running), channel 0:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}

=INFO REPORT==== 28-Apr-2018::01:42:39 ===
Starting RabbitMQ 3.5.6 on Erlang 18.3.3
Copyright (C) 2007-2015 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 28-Apr-2018::01:42:39 ===
node           : rabbit@rabbitmqNode1
home dir       : /usr/local/lib/rabbitmq
config file(s) : /usr/local/lib/rabbitmq/sbin/../etc/rabbitmq/rabbitmq.config
cookie hash    : P7Imk5s8lUeR70yzdvPyDA==
log            : /var/log/rabbitmq/rab...@rabbitmqNode1.log
sasl log       : /var/log/rabbitmq/rab...@rabbitmqNode1-sasl.log
database dir   : /opt/data/rabbitmq/rabbit@rabbitmqNode1

=INFO REPORT==== 28-Apr-2018::01:42:40 ===
Memory limit set to 102400MB of 450749MB total.

=INFO REPORT==== 28-Apr-2018::01:42:40 ===
Disk free limit set to 1024MB

=INFO REPORT==== 28-Apr-2018::01:42:40 ===
Limiting to approx 65436 file handles (58890 sockets)

=INFO REPORT==== 28-Apr-2018::01:42:40 ===
FHC read buffering:  ON
FHC write buffering: ON

=WARNING REPORT==== 28-Apr-2018::01:42:47 ===
Could not auto-cluster with rabbit@rabbitmqNode0: {badrpc,nodedown}

=WARNING REPORT==== 28-Apr-2018::01:42:47 ===
Could not find any node for auto-clustering from: [rabbit@rabbitmqNode1,
                                                   rabbit@rabbitmqNode0]



the Node1 log:

=INFO REPORT==== 28-Apr-2018::01:41:18 ===
Statistics database started.

=INFO REPORT==== 28-Apr-2018::01:41:38 ===
rabbit on node rabbit@rabbitmqNode1 down                   
=INFO REPORT==== 28-Apr-2018::01:41:55 ===
accepting AMQP connection <0.5408.17> (192.168.4.227:46108 -> 192.168.8.220:5672)

=INFO REPORT==== 28-Apr-2018::01:41:55 ===
accepting AMQP connection <0.5428.17> (192.168.4.213:54830 -> 192.168.8.220:5672)

=INFO REPORT==== 28-Apr-2018::01:41:55 ===
accepting AMQP connection <0.5568.17> (192.168.0.3:36128 -> 192.168.8.220:5672)

=INFO REPORT==== 28-Apr-2018::01:41:55 ===
accepting AMQP connection <0.5617.17> (192.168.2.108:33425 -> 192.168.8.220:5672)

=INFO REPORT==== 28-Apr-2018::01:41:55 ===
accepting AMQP connection <0.5632.17> (192.168.2.108:33426 -> 192.168.8.220:5672)

accepting new connectionl;  then ,closing connection  Missed heartbeats from client, timeout: 60s
..................
=ERROR REPORT==== 28-Apr-2018::01:52:03 ===
closing AMQP connection <0.14836.39> (192.168.5.60:47624 -> 192.168.8.220:5672):
Missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 28-Apr-2018::01:52:03 ===
closing AMQP connection <0.7681.35> (192.168.4.53:50494 -> 192.168.8.220:5672):
Missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 28-Apr-2018::01:52:03 ===
closing AMQP connection <0.20562.35> (192.168.4.56:57362 -> 192.168.8.220:5672):
Missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 28-Apr-2018::01:52:03 ===
closing AMQP connection <0.32075.34> (192.168.2.174:44434 -> 192.168.8.220:5672):
Missed heartbeats from client, timeout: 60s

=INFO REPORT==== 28-Apr-2018::01:52:03 ===
accepting AMQP connection <0.28698.69> (192.168.1.234:44314 -> 192.168.8.220:5672)

=INFO REPORT==== 28-Apr-2018::01:52:03 ===
accepting AMQP connection <0.28687.69> (192.168.4.48:46033 -> 192.168.8.220:5672)

=ERROR REPORT==== 28-Apr-2018::01:52:03 ===
closing AMQP connection <0.29807.34> (192.168.4.51:60386 -> 192.168.8.220:5672):
Missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 28-Apr-2018::01:52:03 ===
closing AMQP connection <0.6187.34> (192.168.7.109:45366 -> 192.168.8.220:5672):
Missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 28-Apr-2018::01:57:14 ===
closing AMQP connection <0.31142.112> (172.28.0.197:46795 -> 172.28.8.220:5672):
Missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 28-Apr-2018::01:57:14 ===
Too many processes


=ERROR REPORT==== 28-Apr-2018::01:57:14 ===
Too many processes


=ERROR REPORT==== 28-Apr-2018::01:57:14 ===
** Generic server <0.14434.6> terminating 
** Last message in was {inet_async,#Port<0.69843>,14461,{ok,#Port<0.485076>}}
** When Server state == {state,
                            {rabbit_networking,start_ssl_client,
                                [[{versions,['tlsv1.1','tlsv1.2']},
                                  {certfile,
                                      "/etc/rabbitmq/rabbitmq-server.crt"},
                                  {keyfile,
                                      "/etc/rabbitmq/rabbitmq-server.key"},
                                  {ciphers,
                                      [{dhe_rsa,aes_256_cbc,sha256},
                                       {dhe_dss,aes_256_cbc,sha256},
                                       {dhe_rsa,aes_128_cbc,sha256},
                                       {dhe_dss,aes_128_cbc,sha256},
                                       {dhe_dss,aes_256_cbc,sha},
                                       {dhe_dss,aes_128_cbc,sha},
                                       {ecdhe_ecdsa,aes_256_gcm,null,sha384},
                                       {ecdhe_rsa,aes_256_gcm,null,sha384},
                                       {ecdh_ecdsa,aes_256_gcm,null,sha384},
                                       {ecdh_rsa,aes_256_gcm,null,sha384},
                                       {dhe_rsa,aes_256_gcm,null,sha384},
                                       {dhe_dss,aes_256_gcm,null,sha384},
                                       {ecdhe_ecdsa,aes_128_gcm,null,sha256},
                                       {ecdhe_rsa,aes_128_gcm,null,sha256},
                                       {ecdh_ecdsa,aes_128_gcm,null,sha256},
                                       {ecdh_rsa,aes_128_gcm,null,sha256},
                                       {dhe_rsa,aes_128_gcm,null,sha256},
                                       {dhe_dss,aes_128_gcm,null,sha256}]}]]},
                            #Port<0.69843>,14461}
** Reason for termination == 
** {{badmatch,
     {error,
      {'EXIT',
       {{badmatch,
         {error,
          {{'EXIT',
            {system_limit,
             [{erlang,spawn_opt,
               [proc_lib,init_p,
                [<0.31152.144>,
                 [rabbit_tcp_client_sup,rabbit_sup,<0.88.0>],
                 gen,init_it,
                 [gen_server,<0.31152.144>,<0.31152.144>,supervisor2,
                  {self,rabbit_connection_helper_sup,[]},
                  []]],
                [link]],
               []},
              {proc_lib,start_link,5,[{file,"proc_lib.erl"},{line,330}]},
              {supervisor2,do_start_child,2,[]},
              {supervisor2,handle_start_child,2,[]},
              {supervisor2,handle_call,3,[]},
              {gen_server,try_handle_call,4,
               [{file,"gen_server.erl"},{line,629}]},
              {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},
              {proc_lib,init_p_do_apply,3,
               [{file,"proc_lib.erl"},{line,240}]}]}},
           {child,undefined,helper_sup,
            {rabbit_connection_helper_sup,start_link,[]},
            intrinsic,infinity,supervisor,
            [rabbit_connection_helper_sup]}}}},
        [{rabbit_connection_sup,start_link,0,[]},
         {supervisor2,do_start_child_i,3,[]},
         {supervisor2,handle_call,3,[]},
         {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},
         {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},
         {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}}}},
    [{rabbit_networking,start_client,2,[]},
     {tcp_acceptor,handle_info,2,[]},
     {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,681}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

=ERROR REPORT==== 28-Apr-2018::01:57:14 ===
Too many processes


sasl-log:


=SUPERVISOR REPORT==== 28-Apr-2018::01:57:14 ===
     Supervisor: {<0.31553.124>,rabbit_channel_sup}
     Context:    start_error
     Reason:     {'EXIT',
                     {system_limit,
                         [{erlang,spawn_opt,
                              [proc_lib,init_p,
                               [<0.31553.124>,
                                [<0.10384.160>,<0.14218.175>,<0.7228.105>,
                                 rabbit_tcp_client_sup,rabbit_sup,<0.88.0>],
                                gen,init_it,
                                [gen_server2,<0.31553.124>,<0.31553.124>,
                                 rabbit_limiter,
                                 [{<<"172.28.0.51:33150 -> 172.28.8.220:5672">>,
                                   1}],
                                 []]],
                               [link]],
                              []},
                          {proc_lib,start_link,5,
                              [{file,"proc_lib.erl"},{line,330}]},
                          {supervisor2,do_start_child,2,[]},
                          {supervisor2,start_children,3,[]},
                          {supervisor2,init_children,2,[]},
                          {gen_server,init_it,6,
                              [{file,"gen_server.erl"},{line,328}]},
                          {proc_lib,init_p_do_apply,3,
                              [{file,"proc_lib.erl"},{line,240}]}]}}
     Offender:   [{pid,undefined},
                  {name,limiter},
                  {mfargs,
                      {rabbit_limiter,start_link,
                          [{<<"172.28.0.51:33150 -> 172.28.8.220:5672">>,1}]}},
                  {restart_type,transient},
                  {shutdown,4294967295},
                  {child_type,worker}]


=CRASH REPORT==== 28-Apr-2018::01:57:14 ===
  crasher:
    initial call: rabbit_reader:init/2
    pid: <0.29614.92>
    registered_name: []
    exception error: a system limit has been reached
      in function  erlang:spawn_opt/1
         called as erlang:spawn_opt({erlang,apply,
                                     [#Fun<rabbit_net.1.86161000>,[]],
                                     [monitor]})
      in call from spawn_monitor/1 
      in call from rabbit_net:fast_close/1 
      in call from rabbit_reader:'-start_connection/5-after$^0/0-0-'/1 (src/rabbit_reader.erl, line 286)
      in call from rabbit_reader:start_connection/5 (src/rabbit_reader.erl, line 269)
    ancestors: [<0.30344.91>,rabbit_tcp_client_sup,rabbit_sup,<0.88.0>]
    messages: []
    links: [<0.30344.91>]
    dictionary: [{{channel,1},
                   {<0.32356.92>,{method,rabbit_framing_amqp_0_9_1}}},
                  {process_name,
                      {rabbit_reader,
                          <<"172.28.1.76:37986 -> 172.28.8.220:5672">>}},
                  {{ch_pid,<0.32356.92>},{1,#Ref<0.0.262188.104540>}}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 8756
  neighbours:

=SUPERVISOR REPORT==== 28-Apr-2018::01:57:14 ===
     Supervisor: {<0.30344.91>,rabbit_connection_sup}
     Context:    child_terminated
     Reason:     system_limit
     Offender:   [{pid,<0.29614.92>},
                  {name,reader},
                  {mfargs,{rabbit_reader,start_link,[<0.29579.92>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]





Luke Bakken

unread,
Jul 13, 2018, 10:43:45 AM7/13/18
to rabbitmq-users
RabbitMQ 3.5.7 is very, very, very old. Do not use it in any way.

Please use RabbitMQ 3.7.7 and Erlang 21.

If you still experience errors, you must tell us exactly what commands you are running. Just describing what you are doing is not sufficient for anyone to try to reproduce what you're testing.

Thanks,
Luke

Michael Klishin

unread,
Jul 13, 2018, 12:17:04 PM7/13/18
to rabbitm...@googlegroups.com
> a system limit has been reached

in a module responsible for handling connections suggests it is a kernel/ulimit limit such as [1].



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Jul 13, 2018, 12:17:29 PM7/13/18
to rabbitm...@googlegroups.com
…in which case [1] is also relevant here.


To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Message has been deleted

Fei Jiang

unread,
Jul 13, 2018, 10:08:42 PM7/13/18
to rabbitmq-users
step detailed:
1、start Node0 and Node1 2 cluster, aslo set_policy ha_mode=all
2、 use oslo.messaging (producer and consumer) connected to Node0,we create 25K queues and 25k channel(connection) in cluster
3、rabbitmqctl stop Node0
4、oslo.messaging (producer and consumer) connected to Node1
5、Node1 accepting 25K tcp connection , and error in Node1,heartBeat missed close tcp
6、rabbitmqctl stop Node1,oslo.messaging (producer and consumer)  also connected to Node1
7、rabbitmq-server --detached, restart Node1, work well, also 25K TCP connected Node1,there is no error in Node1


rabbimq.config
 {disk_free_limit,1024000000}, --1G disk
    {vm_memory_high_watermark,0.257420640158}, --- 100G mem
    {vm_memory_high_watermark_paging_ratio, 0.9},


{tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 4096}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]},

ulimit info:
core file size (blocks, -c) 0 data seg size (kbytes, -d) 31457280 scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 127234 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 16384 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited


cat /etc/security/limits.conf |grep nofile
#        - nofile - max number of open file descriptors
* soft nofile 65536
* hard nofile 65536


在 2018年7月13日星期五 UTC+8下午10:43:45,Luke Bakken写道:
Message has been deleted

Fei Jiang

unread,
Jul 14, 2018, 2:41:25 AM7/14/18
to rabbitmq-users
thanks michael, 
step detailed:
1、start Node0 and Node1 2 cluster, aslo set_policy ha_mode=all
2、 use oslo.messaging (producer and consumer) connected to Node0,we create 25K queues and 25k channel(connection) in cluster
3、rabbitmqctl stop Node0
4、oslo.messaging (producer and consumer) connected to Node1
5、Node1 accepting 25K tcp connection , and error in Node1,heartBeat missed close tcp
6、rabbitmqctl stop Node1,oslo.messaging (producer and consumer)  also connected to Node1
7、rabbitmq-server --detached, restart Node1, work well, also 25K TCP connected Node1,there is no error in Node1

my problem is not system resource-limit. I suspect that it was due to too much TCP for a short time, which triggered a heartbeat problem that caused the resource to leak

if i restart the Node1, it will work well,there is only little connection will result ‘missed heartbeat and closing connection'; 

rabbitmq process info:

/usr/lib64/erlang/erts-7.3.1/bin/beam.smp -W w -A 128 -P 1048576 -K true -- -root /usr/lib64/erlang -progname erl -- -home /usr/local/lib/rabbitmq -- -pa /usr/local/lib/rabbitmq/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@rabbitmqNode0 -boot start_sasl -config /usr/local/lib/rabbitmq/sbin/../etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<15000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<15000:64/native>>}] 
在 2018年7月14日星期六 UTC+8上午12:17:29,Michael Klishin写道:
…in which case [1] is also relevant here.

On Fri, Jul 13, 2018 at 7:16 PM, Michael Klishin <mkli...@pivotal.io> wrote:
> a system limit has been reached

in a module responsible for handling connections suggests it is a kernel/ulimit limit such as [1].


On Fri, Jul 13, 2018 at 5:43 PM, Luke Bakken <lba...@pivotal.io> wrote:
RabbitMQ 3.5.7 is very, very, very old. Do not use it in any way.

Please use RabbitMQ 3.7.7 and Erlang 21.

If you still experience errors, you must tell us exactly what commands you are running. Just describing what you are doing is not sufficient for anyone to try to reproduce what you're testing.

Thanks,
Luke

On Friday, July 13, 2018 at 12:39:18 AM UTC-7, Fei Jiang wrote:
 am working on rabbitmq performance. I am using https://github.com/openstack/oslo.messaging/blob/master/tools/simulator.py  measure the performance.
So, I am running the tool for following use-cases and test setup

Memory :  100GB
CPU:  40 Vcpu
RabbitMQ version : 3.5.7
socket and fd  : 65535

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages