Rabbitmq 3.8.7 ha is broken after after upgrade from erlan 22.0.2 to 23.2.7

jasvinde...@gmail.com

unread,

Mar 8, 2021, 12:09:04 PM3/8/21

to rabbitmq-users

Team,

I am wondering if someone know what is going on or i should file a bug. So we are using rabbitmq HA setup. And after upgrading the erlang OTP to 22.0.2 to 23.2.7, we have problems with HA cluster. The sudden restart of one of the member produce "emulator discarding messages" and that member never recovers. If we rollback to 22.0.2 we are not having any of the these problems.

2021-03-08 16:53:39.052 [error] emulator Discarding message {'$gen_call',{<0.1341.0>,#Ref<0.3121386697.1169686531.88317>},stat} from <0.1341.0> to <0.4171.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1338.0> Discarding message {'$gen_call',{<0.1338.0>,#Ref<0.3121386697.1169686532.73676>},stat} from <0.1338.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1338.0>,#Ref<0.3121386697.1169686532.73676>},stat} from <0.1338.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1343.0> Discarding message {'$gen_call',{<0.1343.0>,#Ref<0.3121386697.1169686531.88323>},stat} from <0.1343.0> to <0.4157.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1332.0> Discarding message {'$gen_call',{<0.1332.0>,#Ref<0.3121386697.1169686529.99253>},stat} from <0.1332.0> to <0.4160.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1343.0>,#Ref<0.3121386697.1169686531.88323>},stat} from <0.1343.0> to <0.4157.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1337.0> Discarding message {'$gen_call',{<0.1337.0>,#Ref<0.3121386697.1169686531.88324>},stat} from <0.1337.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1332.0>,#Ref<0.3121386697.1169686529.99253>},stat} from <0.1332.0> to <0.4160.0> in an old incarnation (1615219349) of this node (1615222369)

RABBITMQ version 3.8.9

Erlang OTP 23.2.7 --- Non working version

Erlang OTP 22.0.2 --- works fine

with following policies for all the components.

neutron ha_ttl_neutron ^(?!(amq\.|reply_)).* all {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":70000} 0

neutron ro-ha-policy ^ro\.* queues {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":86400000} 1

rabbitmq@clcp-openstack-rabbitmq-rabbitmq-0:/$ rabbitmqctl cluster_status

RABBITMQ_ERLANG_COOKIE env variable support is deprecated and will be REMOVED in a future version. Use the $HOME/.erlang.cookie file or the --erlang-cookie switch instead.

Cluster status of node rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local ...

Basics

Cluster name: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local

Disk Nodes

rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local

rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local

Running Nodes

rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local

rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local

Versions

rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local: RabbitMQ 3.8.7 on Erlang 23.2.7

rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local: RabbitMQ 3.8.7 on Erlang 23.2.7

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 15672, protocol: http, purpose: HTTP API

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-0.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 15672, protocol: http, purpose: HTTP API

Node: rab...@clcp-openstack-rabbitmq-rabbitmq-1.openstack-rabbitmq.openstack.svc.cluster.local, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags

Flag: drop_unroutable_metric, state: disabled

Flag: empty_basic_get_metric, state: disabled

Flag: implicit_default_bindings, state: enabled

Flag: quorum_queue, state: enabled

Flag: virtual_host_metadata, state: enabled

rabbitmq@clcp-openstack-rabbitmq-rabbitmq-0:/$ rabbitmqctl list_policies -p neutron

RABBITMQ_ERLANG_COOKIE env variable support is deprecated and will be REMOVED in a future version. Use the $HOME/.erlang.cookie file or the --erlang-cookie switch instead.

Listing policies for vhost "neutron" ...

vhost name pattern apply-to definition priority

neutron ha_ttl_neutron ^(?!(amq\.|reply_)).* all {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":70000} 0

neutron ro-ha-policy ^ro\.* queues {"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":86400000} 1

rabbitmq@clcp-openstack-rabbitmq-rabbitmq-0:/$

Can any of you guys please suggest if i should file a bug on Erlang or Rabbit for this ?

Regards,

Jasvinder

jo...@cloudamqp.com

unread,

Mar 8, 2021, 12:34:31 PM3/8/21

to rabbitmq-users

Hi,

I fail to see what is not working.

"[error] emulator Discarding message" is a benign message (search the archives for it) and does not indicate that your node or cluster is "broken". You say that the member never recovers? What is the full log for that member?

jasvinder singh kwatra

unread,

Mar 8, 2021, 12:36:27 PM3/8/21

to rabbitm...@googlegroups.com

There are no logs all except these logs which fills up log fill like crazy for hours.

Regards,

Jasvinder

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/1a3b9ba8-d2ad-4699-b555-d45f0aef1587n%40googlegroups.com.

jasvinder singh kwatra

unread,

Mar 8, 2021, 3:16:26 PM3/8/21

to rabbitm...@googlegroups.com

Hello Team,

Please find the restart log attached, let me know if anything else is needed. Thank you for looking into this.

Regards,

Jasvinder

clcp-openstack-rabbitmq-rabbitmq-1-after-first-restart.log

jo...@cloudamqp.com

unread,

Mar 8, 2021, 5:26:26 PM3/8/21

to rabbitmq-users

The log you sent was from a node that started fine, so it did indeed recover.

2021-03-08 20:10:05.672 [info] <0.7487.0> Server startup complete; 5 plugins started.

You can even see that it is accepting connections:

2021-03-08 20:10:27.668 [info] <0.13271.0> accepting AMQP connection <0.13271.0> (10.97.197.110:45440 -> 10.97.189.182:5672)
2021-03-08 20:10:27.669 [info] <0.13271.0> Connection <0.13271.0> (10.97.197.110:45440 -> 10.97.189.182:5672) has a client-provided name: health-probe.py:729714:3c06f80a-6b97-4360-8c55-1de0a0df2bc1
2021-03-08 20:10:27.669 [info] <0.13271.0> connection <0.13271.0> (10.97.197.110:45440 -> 10.97.189.182:5672 - health-probe.py:729714:3c06f80a-6b97-4360-8c55-1de0a0df2bc1): user 'ranger-agent-rabbitmq-user' authenticated and granted access to vhost 'ranger-agent'

jasvinder singh kwatra

unread,

Mar 8, 2021, 5:45:04 PM3/8/21

to rabbitm...@googlegroups.com

Only this agent recovered if you get for discard messages the count is 10k. All the openstack service are not able to connect to it. Though the cluster recovered as apeared on paper its not functional. Yeah i saw only ranger application recovered but this guy is supposed to handle 15 more services. The frequency of discard messgage per sec is huge, i can see from tailing the logs. So i need to know what is triggering this behavioral problem on 23.x otp.

Regards,

Jas

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/c44bad4e-7cbe-4ffd-9610-bda27ec69810n%40googlegroups.com.

jo...@cloudamqp.com

unread,

Mar 10, 2021, 11:31:41 AM3/10/21

to rabbitmq-users

I think the first thing to try is to update RabbitMQ to 3.8.14.

jasvinder singh kwatra

unread,

Mar 10, 2021, 1:14:17 PM3/10/21

to rabbitm...@googlegroups.com

I tried 3.8.11 with the same results. I will give 3.8.14 a shot.

Regards,

Jasvinder

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f5133d04-695b-4a48-bed3-9353594528e8n%40googlegroups.com.

Message has been deleted

jasvinde...@gmail.com

unread,

Mar 17, 2021, 12:22:43 PM3/17/21

to rabbitmq-users

Tried the new version also with same result. I got the chance to talk to Erlang dev. See inline replies

What i have got uptill now is restarting of rabbitmq along with epmd triggers the recreation of "Creation" value in erlang to trigger this. I still be able to reproduce this in erlang 22 but in very less frequency, but in 23.x its very prominent due to the fact that creation value can be 32 bits rather 2 bits in 22.x(which can be reused on node restarts). I would like to open a bug for this in rabbitmq forum, do you guys agree ?

Reply form Erlang dev is below.

Hi,

The process identifiers used when sending these messages identifies a node with the same nodename as the receiving node, but it is an old instance of the node that is identified.

A node is identified by its name and an integer value called "creation" which is assigned when the Erlang distribution is started.

Both nodename and creation is stored in all process identifiers. If the nodename match but creation doesn't match when sending a message using a process identifier, the receiving node will print messages like below and drop the message (since the receiving process doesn't exist on the node). In your case below, the creation of the old instance is

1615219349 and the new instance is 1615222369.

Either the receiving node has been restarted with the same name, or the Erlang distribution on the receiving node has been restarted under the same name. In both cases a new creation will be assigned to the node and it will reject messages directed to the old instance of the node.

In OTP 23 we began using 32-bit creation values. In OTP 22 these values were only 2-bits. That is, in OTP 22 creation values were reused *very* quickly. This is probably the reason to why you don't see this issue as often with OTP 22.

I think you have to turn to the RabbitMQ team for support regarding this and/or the person(s) that have configured your system.

Regarding Erlang support in general. If you work at Ericsson you can open a support issue in our Ericsson internal Jira on our intranet, but otherwise you either have to turn to the community via the erlang- questions mailing list < https://urldefense.com/v3/__https://erlang.org/mailman/listinfo/erlang-questions__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x6T_Hsh0g$ > for help or if you want to pay for support turn to e.g. Erlang solutions < https://urldefense.com/v3/__https://www.erlang-solutions.com/__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x673LNYzQ$ >. Bug reports for Erlang/OTP can be issued at <https://urldefense.com/v3/__https://github.com/erlang/otp/issues__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x4-BObHdg$ >.

Regards,

Rickard

> Hello Rickard,

>

> I Hope you are doing great. I am working in one of the AT&T account

> from Ericsson side and I need your help in figuring out what is

> exactly these errors that are causing the problem.

>

> I have raised the question at the forum also.

>

https://urldefense.com/v3/__https://groups.google.com/g/rabbitmq-users/c/Uyo-YjfEuB4/m/IBjTe2KTAwAJ__;!!BhdT!yWgo4YfodBeCew-JJLxDJ5q7eynZPb7n7hEB57D1iB1OOO2jjEyG1x46IC64Cw$



> We did some work and we still able to reproduce the problem with

> erlang 22.0.2 also but frequency is very less. Also these messages are

> seen during the fresh cluster boot also. Can you please guide if this

> is a bug or something we need to do to start and stop app cleanly ?

>

> I really appreciate your any help in understand this.

>

> Regards,

> Jasvinder



Reply all

Reply to author

Forward

Reply all

Reply to author

Forward