Rabbitmq 3.8.7 ha is broken after after upgrade from erlan 22.0.2 to 23.2.7

323 views
Skip to first unread message

jasvinde...@gmail.com

unread,
Mar 8, 2021, 12:09:04 PM3/8/21
to rabbitmq-users
Team,

I am wondering if someone know what is going on or i should file a bug. So we are using rabbitmq HA setup. And after upgrading the erlang OTP to 22.0.2 to 23.2.7, we have problems  with HA cluster. The sudden restart of one of the member produce "emulator discarding messages" and that member never recovers. If we rollback to 22.0.2 we are not having any of the these problems.

2021-03-08 16:53:39.052 [error] emulator Discarding message {'$gen_call',{<0.1341.0>,#Ref<0.3121386697.1169686531.88317>},stat} from <0.1341.0> to <0.4171.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1338.0> Discarding message {'$gen_call',{<0.1338.0>,#Ref<0.3121386697.1169686532.73676>},stat} from <0.1338.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1338.0>,#Ref<0.3121386697.1169686532.73676>},stat} from <0.1338.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1343.0> Discarding message {'$gen_call',{<0.1343.0>,#Ref<0.3121386697.1169686531.88323>},stat} from <0.1343.0> to <0.4157.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1332.0> Discarding message {'$gen_call',{<0.1332.0>,#Ref<0.3121386697.1169686529.99253>},stat} from <0.1332.0> to <0.4160.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1343.0>,#Ref<0.3121386697.1169686531.88323>},stat} from <0.1343.0> to <0.4157.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1337.0> Discarding message {'$gen_call',{<0.1337.0>,#Ref<0.3121386697.1169686531.88324>},stat} from <0.1337.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1332.0>,#Ref<0.3121386697.1169686529.99253>},stat} from <0.1332.0> to <0.4160.0> in an old incarnation (1615219349) of this node (1615222369)

RABBITMQ version 3.8.9
Erlang OTP 23.2.7  ---  Non working version
Erlang OTP 22.0.2 --- works fine 

with following policies for all the components.

neutron ha_ttl_neutron  ^(?!(amq\.|reply_)).*   all     {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":70000}        0
neutron ro-ha-policy    ^ro\.*  queues  {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":86400000}     1

rabbitmq@clcp-openstack-rabbitmq-rabbitmq-0:/$ rabbitmqctl cluster_status
RABBITMQ_ERLANG_COOKIE env variable support is deprecated and will be REMOVED in a future version. Use the $HOME/.erlang.cookie file or the --erlang-cookie switch instead.
Cluster status of node rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local ...
Basics

Cluster name: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local

Disk Nodes

rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local
rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local

Running Nodes

rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local
rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local

Versions

rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local: RabbitMQ 3.8.7 on Erlang 23.2.7
rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local: RabbitMQ 3.8.7 on Erlang 23.2.7

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags

Flag: drop_unroutable_metric, state: disabled
Flag: empty_basic_get_metric, state: disabled
Flag: implicit_default_bindings, state: enabled
Flag: quorum_queue, state: enabled
Flag: virtual_host_metadata, state: enabled
rabbitmq@clcp-openstack-rabbitmq-rabbitmq-0:/$ rabbitmqctl list_policies -p neutron
RABBITMQ_ERLANG_COOKIE env variable support is deprecated and will be REMOVED in a future version. Use the $HOME/.erlang.cookie file or the --erlang-cookie switch instead.
Listing policies for vhost "neutron" ...
vhost   name    pattern apply-to        definition      priority
neutron ha_ttl_neutron  ^(?!(amq\.|reply_)).*   all     {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":70000}        0
neutron ro-ha-policy    ^ro\.*  queues  {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":86400000}     1
rabbitmq@clcp-openstack-rabbitmq-rabbitmq-0:/$ 


Can any of you guys please suggest if i should file a bug on Erlang or Rabbit for this ?


Regards,
Jasvinder


jo...@cloudamqp.com

unread,
Mar 8, 2021, 12:34:31 PM3/8/21
to rabbitmq-users
Hi,
I fail to see what is not working.
"[error] emulator Discarding message" is a benign message (search the archives for it) and does not indicate that your node or cluster is "broken". You say that the member never recovers? What is the full log for that member?

jasvinder singh kwatra

unread,
Mar 8, 2021, 12:36:27 PM3/8/21
to rabbitm...@googlegroups.com
There are no logs all except these logs which fills up log fill like crazy for hours.

Regards,
Jasvinder
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/1a3b9ba8-d2ad-4699-b555-d45f0aef1587n%40googlegroups.com.

jasvinder singh kwatra

unread,
Mar 8, 2021, 3:16:26 PM3/8/21
to rabbitm...@googlegroups.com
Hello Team,


Please find the restart log attached, let me know if anything else is needed. Thank you for looking into this.

Regards,
Jasvinder
clcp-openstack-rabbitmq-rabbitmq-1-after-first-restart.log

jo...@cloudamqp.com

unread,
Mar 8, 2021, 5:26:26 PM3/8/21
to rabbitmq-users
The log you sent was from a node that started fine, so it did indeed recover.
2021-03-08 20:10:05.672 [info] <0.7487.0> Server startup complete; 5 plugins started.

You can even see that it is accepting connections:
2021-03-08 20:10:27.668 [info] <0.13271.0> accepting AMQP connection <0.13271.0> (10.97.197.110:45440 -> 10.97.189.182:5672)
2021-03-08 20:10:27.669 [info] <0.13271.0> Connection <0.13271.0> (10.97.197.110:45440 -> 10.97.189.182:5672) has a client-provided name: health-probe.py:729714:3c06f80a-6b97-4360-8c55-1de0a0df2bc1
2021-03-08 20:10:27.669 [info] <0.13271.0> connection <0.13271.0> (10.97.197.110:45440 -> 10.97.189.182:5672 - health-probe.py:729714:3c06f80a-6b97-4360-8c55-1de0a0df2bc1): user 'ranger-agent-rabbitmq-user' authenticated and granted access to vhost 'ranger-agent'

jasvinder singh kwatra

unread,
Mar 8, 2021, 5:45:04 PM3/8/21
to rabbitm...@googlegroups.com
Only this agent recovered if you get for discard messages the count is 10k. All the openstack service are not able to connect to it. Though the cluster recovered as apeared on paper its not functional. Yeah i saw only ranger application recovered but this guy is supposed to handle 15 more services.  The frequency of discard messgage per sec is huge, i can see from tailing the logs. So i need to know what is triggering this behavioral problem on 23.x otp.

Regards,
Jas

jo...@cloudamqp.com

unread,
Mar 10, 2021, 11:31:41 AM3/10/21
to rabbitmq-users
I think the first thing to try is to update RabbitMQ to 3.8.14.

jasvinder singh kwatra

unread,
Mar 10, 2021, 1:14:17 PM3/10/21
to rabbitm...@googlegroups.com
I tried 3.8.11 with the same results. I will give 3.8.14 a shot.

Regards,
Jasvinder 

Message has been deleted

jasvinde...@gmail.com

unread,
Mar 17, 2021, 12:22:43 PM3/17/21
to rabbitmq-users
Tried the new version also with same result. I got the chance to talk to Erlang dev. See inline replies 

What i have got uptill now is restarting of rabbitmq along with epmd triggers the recreation of  "Creation" value in erlang to trigger this. I still be able to reproduce this in erlang 22 but in very less frequency, but in 23.x its very prominent due to the fact that creation value can be 32 bits rather 2 bits in 22.x(which can be reused on node restarts).  I would like to open a bug for this in rabbitmq forum, do you guys agree ?


Reply form Erlang dev is below.

Hi,

 

The process identifiers used when sending these messages identifies a node with the same nodename as the receiving node, but it is an old instance of the node that is identified.

 

A node is identified by its name and an integer value called "creation" which is assigned when the Erlang distribution is started.

Both nodename and creation is stored in all process identifiers. If the nodename match but creation doesn't match when sending a message using a process identifier, the receiving node will print messages like below and drop the message (since the receiving process doesn't exist on the node). In your case below, the creation of the old instance is

1615219349 and the new instance is 1615222369.

 

Either the receiving node has been restarted with the same name, or the Erlang distribution on the receiving node has been restarted under the same name. In both cases a new creation will be assigned to the node and it will reject messages directed to the old instance of the node.

 

In OTP 23 we began using 32-bit creation values. In OTP 22 these values were only 2-bits. That is, in OTP 22 creation values were reused *very* quickly. This is probably the reason to why you don't see this issue as often with OTP 22.

 

 

I think you have to turn to the RabbitMQ team for support regarding this and/or the person(s) that have configured your system.

 

Regarding Erlang support in general. If you work at Ericsson you can open a support issue in our Ericsson internal Jira on our intranet, but otherwise you either have to turn to the community via the erlang- questions mailing list < https://urldefense.com/v3/__https://erlang.org/mailman/listinfo/erlang-questions__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x6T_Hsh0g$ > for help or if you want to pay for support turn to e.g. Erlang solutions < https://urldefense.com/v3/__https://www.erlang-solutions.com/__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x673LNYzQ$ >. Bug reports for Erlang/OTP can be issued at <https://urldefense.com/v3/__https://github.com/erlang/otp/issues__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x4-BObHdg$ >.

 

 

Regards,

Rickard

 

 


> Hello Rickard,

>  

> I Hope you are doing great. I am working in one of the AT&T account

> from Ericsson side and I need your help in figuring out what is

> exactly these errors that are causing the problem.

> I have raised the question at the forum also.

>

https://urldefense.com/v3/__https://groups.google.com/g/rabbitmq-users/c/Uyo-YjfEuB4/m/IBjTe2KTAwAJ__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x46IC64Cw$

> We did some work and we still able to reproduce the problem with

> erlang 22.0.2 also but frequency is very less. Also these messages are

> seen during the fresh cluster boot also. Can you please guide if this

> is a bug or something we need to do to start and stop app cleanly ?

> I really appreciate your any help in understand this.

> Regards,

> Jasvinder

Reply all
Reply to author
Forward
Reply all
Reply to author
Forward
0 new messages