Problems with x-consistent-hash exchange after upgrade to 3.8.7

1,120 views
Skip to first unread message

JK

unread,
Nov 20, 2020, 7:33:19 PM11/20/20
to rabbitmq-users
Hello everyone,

I upgraded a cluster recently from 3.7.24 to 3.8.7. We have this cluster set up with the x-consistent-hash exchange plugin and are seeing warnings that seem to be generated from that plugin:
[root@heka1-2a ~]# tail /var/log/rabbitmq/rab...@heka1-2a.log
2020-11-21 00:07:24.005 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.005 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.005 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.005 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.005 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.005 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.006 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.006 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.006 [warning] <0.13187.3> Bucket 215 not found
2020-11-21 00:07:24.006 [warning] <0.13187.3> Bucket 215 not found

And I think all those messages are the cause of this warning I'm seeing:
2020-11-21 00:27:52.001 [warning]  lager_file_backend dropped 968 messages in the last second that exceeded the limit of 50 messages/sec
I assume that means log messages dropped, not messages from queues?

I did the upgrade on Nov 10th, and those have been flooding the logs since then.  At first we were still seeing messages make it through the broker to the consumers, but on Monday it suddenly stopped working. We are seeing publishes but nothing hitting the consumers and messages aren't even being retained. The image I've attached shows this behavior. I don't see any errors in the logs though, just those warnings about bucket not found and lager discarding messages. Any thoughts on how I should proceed here?

Thank you,
-J 
publishes but no messages.png

met...@gmail.com

unread,
Nov 22, 2020, 9:50:29 AM11/22/20
to rabbitmq-users
Saw this too. You can recreate the exchange to fix it but the problem will come back eventually. In the end we just switched to using the x-modulus-hash exchange instead.

Michael Klishin

unread,
Nov 23, 2020, 8:22:41 AM11/23/20
to rabbitmq-users
We need some information to be able to reproduce. Can you share an runnable example that demonstrates what your applications do? Or all definitions except for,
say, users?

Michael Klishin

unread,
Nov 23, 2020, 8:24:38 AM11/23/20
to rabbitmq-users
Note that there has been an important core change that required the plugin to be updated, that happened in 3.8.5
and the change was confirmed by several users as effective, and we have never seen this reported again since then [1].

JK

unread,
Nov 23, 2020, 3:30:53 PM11/23/20
to rabbitmq-users
Hi Michael,

I've included our sanitized definitions for this broker. The exchange most impacted is heka.logs. I ran this and found it odd:

[root@heka1-2a ~]# rabbitmq-diagnostics consistent_hash_exchange_ring_state heka.logs
Inspecting consistent hashing ring state for exchange heka.logs in virtual host '/'...
Error:
Exchange does not exist or is of a different type

This is a two node cluster and the other node shows the same error output.

As for runnable examples, we are using Heka to ingest logs on application hosts that publishes them to rabbitmq. On our elasticsearch hosts, we have heka consuming those messages and putting them in elastic search.  I think it was set up on 3.5.x or 3.6.x, so this is a relatively old to me setup that I inherited. I'm still trying to understand this plugin more and I know what it does at a high level, but I'm still learning what it does underneath the hood.

This is the Heka config on the various application hosts. The url uses a round robin CNAME that will resolve to either node:
[RabbitmqLogsOutput]
type = 'AMQPOutput'
encoder = 'ProtobufEncoder'
exchange = 'heka.logs'
exchange_auto_delete = false
exchange_durability = true
exchange_type = 'x-consistent-hash'
message_matcher = '(Logger != "hekad" && Type != "sandbox.output" && Type != "stats" && Type != "heka.sandbox.stats") || Fields[http_user_agent] == NIL) && (Fields[programname] !~ /puppet-agent|xinetd/ || Fields[programname] == NIL)'
persistent = true
url = 'amqps://[REDACTED]@heka.las:5671/heka'
use_buffering = true 

This is the Heka config on the elasticsearch hosts, there are actually two config files, one for each rabbitmq node. They're identical except the other node uses logs-es0.las-b for the queue, and the url in both use specific node names instead of the CNAME. There are three of these ES hosts, so the queue name referenced in each one is specific to the host but the all follow the same scheme, logs-es[012].las-[ab]:
[RabbitmqLogsInput-Heka1-2a]
type = 'AMQPInput'
can_exit = true
decoder = 'ProtobufGeoIp2MultiDecoder'
exchange = 'heka.logs'
exchange_auto_delete = false
exchange_durability = true
exchange_type = 'x-consistent-hash'
prefetch_count = 100
queue = 'logs-es0.las-a'
queue_auto_delete = true
queue_durability = false
queue_ttl = 1200000
routing_key = '224'
splitter = 'HekaFramingSplitter'
url = 'amqps://[REDACTED]@heka1-2a.las:5671/heka'

The elasticsearch hosts also use this DLX config that uses the CNAME. And like the config above, uses the queue name specific to each host:
[RabbitmqDlxLogsInput]
type = 'AMQPInput'
can_exit = true
decoder = 'ProtobufGeoIp2MultiDecoder'
exchange = 'heka.dlx.logs'
exchange_auto_delete = false
exchange_durability = true
exchange_type = 'x-consistent-hash'
prefetch_count = 100
queue = 'dlx-logs-es0.las'
queue_auto_delete = true
queue_durability = false
queue_ttl = 1200000
routing_key = '224'
splitter = 'HekaFramingSplitter'
url = 'amqps://[REDACTED]@heka.las:5671/heka'

Thank you,
-J
las-heka-definitions.json

JK

unread,
Dec 1, 2020, 8:17:47 PM12/1/20
to rabbitmq-users
Recreating the exchange seemed to resolve this. Since the issue Michael linked to involved that error coming up on restarts, I had made sure to restart a node to see if the problem returned, it did not return and I'm seeing normal levels of traffic through this cluster. I assume this was an edge case for that issue that was not covered due to the fact that it was an upgrade from 3.7.24. I don't expect you all to try and verify this using hekad, but that's what we're using if you still wanted to try to reproduce the issue. OS is Centos 7 running the latest kernel from them.

Thanks,
-J

JK

unread,
Dec 3, 2020, 2:06:18 PM12/3/20
to rabbitmq-users
I spoke too soon. We did a blue/green cutover to a fresh 3.8.7 install for production, it happened after a reboot of one of the nodes and fairly quickly at that:
[root@heka1-2a ~]# grep Bucket /var/log/rabbitmq/rab...@heka1-2a.log | head
2020-12-03 03:35:27.000 [warning] <0.11477.1> Bucket 1530 not found
2020-12-03 03:35:27.000 [warning] <0.11404.1> Bucket 1747 not found
2020-12-03 03:35:27.001 [warning] <0.11477.1> Bucket 1530 not found
2020-12-03 03:35:27.001 [warning] <0.11404.1> Bucket 1747 not found
2020-12-03 03:35:27.002 [warning] <0.11021.1> Bucket 1356 not found
2020-12-03 03:35:27.002 [warning] <0.10754.3> Bucket 2142 not found
2020-12-03 03:35:27.003 [warning] <0.19490.0> Bucket 2142 not found
2020-12-03 03:35:27.003 [warning] <0.19490.0> Bucket 2142 not found
2020-12-03 03:35:27.003 [warning] <0.18431.0> Bucket 2142 not found
2020-12-03 03:35:27.004 [warning] <0.19490.0> Bucket 2142 not found

The restart of the node was at 00:32 UTC, so ~3 hours after the node came back we started seeing this. It ate all the messages going into it, like I reported before for our DR cluster. I tried recreating the exchange hoping it would resolve this like in our DR cluster, but that did not work and we ended up downgrading back to 3.7.24 since that was a known good version with this plugin.

The two clusters have nearly identical configuration with the exception of the erlang cookie and location specific host names. Since this cluster was a fresh install, and not an upgrade from a previous version like our DR cluster, it looks like the bug in this plugin was not completely resolved. Let me know what other information I can provide.

Thank you,
-J

JK

unread,
Dec 8, 2020, 4:59:17 PM12/8/20
to rabbitmq-users
Hi everyone,

This did come back on the first cluster I reported this for on 12/04. It was working since 11/23, when I recreated the exchanges, which was far longer than the other two times I reported (days, hours, and weeks respectively), so the time until failure seems inconsistent. I'll be downgrading that cluster as well. It looks like the bug fix Micheal mentioned was in 3.8.4 and addressed this bug according to the release notes, but reading that bug it doesn't believe we're running into the same issue. Similar symptom with the same plugin, but probably a different cause? I'll see if I can cook up some python scripts that replicate this problem in our test environment that don't use Heka but use the same exchange plugin. If I have "success" replicating the failure I'll post the scripts and more information.

Until then, if there's any other info I can provide then please let me know.

Thank you,
-J

JK

unread,
Feb 2, 2021, 7:11:07 PM2/2/21
to rabbitmq-users
Hello again, 

I hadn't had a chance to create those test scripts to see if I can replicate this with python producers/consumers. I'm circling back on this now on my end and will get started on those tests. I'll test with 3.8.11 since that has been released recently, and also 3.8.7 since that is where we encountered the issue and would still like to replicate this with the same setup but without heka involved to confirm whether rabbitmq is causing this or heka. In the meantime, is there anything that can be determined from the information I've already provided, let me know if more info is needed.

Thank you,
-J

yuki

unread,
Mar 3, 2021, 9:41:42 AM3/3/21
to rabbitmq-users

I have a similar problem, but it's not exactly the same.

Rabbitmq version: 3.8.8, exchange: x-consistent-hash

I use spring boot @RabbitListener automatic create queue(autoDelete:true) and bind(binding key: 1). 
3 node cluster.
policy: ha-mode: all,  partten: ^

problem: when i publish a message:  message published, but not routed.  But i have queue bind with this exchange.
Sometimes all routing key can't routed. Sometimes a part of routing key can't routed. 

I can't always reproduce this problem.
Sometimes rabbitctl stop_app or  reboot spring boot application will reproduce this problem.






yuki

unread,
Mar 3, 2021, 9:00:10 PM3/3/21
to rabbitmq-users
IMG_20210304_095525.jpg

yuki

unread,
Mar 3, 2021, 9:01:57 PM3/3/21
to rabbitmq-users
IMG_20210304_095427.jpg

在2021年3月4日星期四 UTC+8 上午10:00:10<yuki> 写道:
IMG_20210304_095525.jpg

yuki

unread,
Mar 3, 2021, 10:04:45 PM3/3/21
to rabbitmq-users
For more info:
When i published message, (3 node)the log always print like:
[error] <0.27208.1523> Supervisor {<0.27208.1523>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.27218.1523>, <<"<rab...@xxxx1.3.27218.1523>">>) at undefined exit with reason shutdown in context shutdown_error.
the queue node print:
Bucket xx not found.

When i stop published message, the error log stoped

M K

unread,
Mar 4, 2021, 5:01:31 AM3/4/21
to rabbitmq-users
We cannot suggest much without any server logs or a way to reproduce
with 3.8.14.

The plugin now provides a command that lets you inspect the state of the consistent hashing ring,
and inspection of bindings has been available for many years. See `rabbitmq-diagnostics help`
(the plugin must be enabled on the node where you run it).

Reply all
Reply to author
Forward
0 new messages