Scaling down rabbitmq pods in a rabbitmq cluster ( I am using operator) in GKE

1,402 views

Skip to first unread message

Anjitha M

unread,

Apr 7, 2021, 10:58:29 AM4/7/21

to rabbitmq-users

Hi all,

I am using rabbitmq operator:0.8.0 and rabbitmq:3.8.3 in GKE, with 3 rabbitmq replica pods. Adding cr.yaml used herewith.

crYaml: |-
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: rabbitmqcluster
spec:
replicas: 3
image: gcr.io/<path>/rabbitmq:3.8.3
service:
type: LoadBalancer
annotations:
cloud.google.com/load-balancer-type: "Internal"
networking.gke.io/internal-load-balancer-allow-global-access: "true"
cloud.google.com/neg: '{"ingress": true}'
persistence:
storageClassName: standard
storage: 10Gi
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 1000m
memory: 2Gi
rabbitmq:
additionalPlugins:
- rabbitmq_sharding
- rabbitmq_stomp
- rabbitmq_shovel
- rabbitmq_federation
- rabbitmq_federation_management
additionalConfig: |
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
log.console.level = debug

I have been observing this weird behaviour with my rabbitmq cluster pods when I am trying to scale the rabbitmq statefulset to 0.

Here's what I am doing

Scale the operator pod down to 0: kubectl -n <namespace> scale --replicas=0 deployment.apps/rabbitmq-cluster-operator
Scale the rabbitmq statefulset down to 0: kubectl -n <namespace> scale --replicas=0 statefulset.apps/rabbitmqcluster-rabbitmq-server

What I noticed is that though I get a message like statefulset.apps/rabbitmqcluster-rabbitmq-server scaled, I can still see the statefulset required replica count as 3 (as it was initially). For some reason, the scale command works for a moment but then the statefulset is coming back up to its initial count.

Another point to bring up in relation to this: I am noticing one of my pods repeatedly going into terminating state and restarting - this happens almost every minute and it means that almost all the time, my statefulset has only 2/3 pods in ready state. I am concerned if that could be the reason why the scale command doesn’t work ( See attached screenshot).

Screenshot (153).png

Screenshot (151).png

But I did verify this by making sure my pods were all up and running before applying the scale command - again I could see same behaviour as before?

Is it some issue related to rabbitmq specifically? Could I solve this by upgrading to a newer version? I know the version I am using is old but this is a production issue we're facing and the team in question is currently using this version.

Any advice/suggestion would be much appreciated.

Thanks,

Anjitha M.

Michal Kuratczyk

unread,

Apr 7, 2021, 4:25:00 PM4/7/21

to rabbitm...@googlegroups.com

Hi,

First of all, please upgrade as soon as possible. The version you are using was never meant to be used in production in the first place.

Second, scaling down is not a trivial operation and is not supported: https://github.com/rabbitmq/cluster-operator/issues/223. It may be feasible in a specific case but do not assume that you can just change the number of replicas to 0 and expect it to work. Current version of the operator ignores attempts to decrease the `replicas` value at the RabbitmqCluster level (it will leave the statefulset as-is). You may be able to achieve scale-down by pausing reconciliation (see below) and performing some manual tasks but I can't tell you what these tasks are exactly. If you just don't need that cluster for some time, you could perhaps delete the statefulset altogether so that no pods are running, but that sounds suspicious given it's a production cluster.

Regarding changes to the statefulset - they are immediately overwritten by the Operator and that's the expected behaviour. It is in line with the idea of reconciliation - since your RabbitmqCluster definition says you should have 3 replicas, and the statefulset is a child/managed object, when it doesn't reflect the desired state of the RabbitmqCluster, it gets corrected. This is the same idea as when you delete a pod, the statefulset recreates it because the number of running pods doesn't match the `replicas` value of the statefulset.

Newer versions of the operator, allow you to pause reconciliation temporarily: https://www.rabbitmq.com/kubernetes/operator/using-operator.html#pause. It allows you to perform some operations on the statefulset and other managed resources without them being immediately overwritten by the Operator.

As for the pod getting terminated, if it only happens when you try to scale the cluster to zero, then I guess what happens is:

1. you change the statefulset

2. the statefulset starts deleting pods to match the new lower `repliacs` value

3. Operator corrects the statefulset to match RabbitmqCluster `replicas` value

4. pod is recreated to match the old-new `replicas` value

If it happens all the time, not only when you try to scale down, then you need to properly investigate logs, Kubernetes events, etc.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/68ca56e6-f6f8-41e5-b112-2e71a13370cdn%40googlegroups.com.

Michał

RabbitMQ team

Reply all

Reply to author

Forward

0 new messages