Regression: airflow-redis gets evicted after upgrade from 2.8.6 to 2.13.5

114 views
Skip to first unread message

Andrii Abramov

unread,
Jul 14, 2025, 3:59:41 AMJul 14
to cloud-composer-discuss
We recently upgraded Cloud Composer from image composer-2.8.6-airflow-2.6.3 to composer-2.13.5-airflow-2.10.5.
Right after upgrading, we started to observe regular eviction of airflow-redis pod, which causes airflow-scheduler to be restarted since it can't connect to airflow-redis:

> future: <Future finished exception=OperationalError('Error 111 connecting to airflow-redis-service.composer-system.svc.cluster.local:6379. Connection refused.')>

Previously (v2.8.6), airflow-redis had a PDB that prevented the cluster autoscaler from evicting airflow-redis:

{
  "kind": "PodDisruptionBudget",
  "apiVersion": "policy/v1",
  "metadata": {
    "name": "airflow-redis-pdb",
    "namespace": "composer-system",
    "uid": "16ebc83a-68cf-49ff-8de4-54b30ba98698",
    "resourceVersion": "1748750859365743017",
    "generation": 1,
    "creationTimestamp": "2025-02-17T18:17:55Z"
  },
  "spec": {
    "selector": {
      "matchLabels": {
        "run": "airflow-redis"
      }
    },
    "maxUnavailable": 0
  }
}

In the new version, this PDB is missing:

$ kubectl -n composer-system get pdb
NAME                            MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
airflow-database-init-job-pdb   N/A             0                 0                     146d
airflow-sqlproxy-pdb            N/A             1                 0                     146d
composer-agent-pdb              N/A             0                 0                     146d
composer-snapshots-pdb          N/A             0                 0                     146d

Here's a cluster autoscaler log entry:
{
  "insertId": "1ae29c21-f9af-4fb8-b3f4-0f6e0eb51515@a1",
  "jsonPayload": {
    "decision": {
      "eventId": "533694d2-4ea2-4ff1-8016-9069e2a89764",
      "scaleDown": {
        "nodesToBeRemoved": [
          {
            "node": {
              "cpuRatio": 66,
              "mig": {
                "nodepool": "nap-xxx",
                "zone": "xxx",
                "name": "xxx"
              },
              "memRatio": 62,
              "name": "xxx"
            },
            "evictedPodsTotalCount": 12,
            "evictedPods": [
              {
                ...
              },
              {
                "namespace": "composer-system",
                "name": "airflow-redis-0",
                "controller": {
                  "name": "airflow-redis",
                  "apiVersion": "apps/v1",
                  "kind": "StatefulSet"
                }
              },
              {
                ...
              }
            ]
          }
        ]
      },
      "decideTime": "1752135241"
    }
  },
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "project_id": "xxx",
      "location": "xxx",
      "cluster_name": "xxx"
    }
  },
  "timestamp": "2025-07-10T08:14:01.666488867Z",
  "logName": "projects/xxx/logs/container.googleapis.com%2Fcluster-autoscaler-visibility",
  "receiveTimestamp": "2025-07-10T08:14:02.170197809Z"
}

Is it a regression?

Matthijs Verkaaik

unread,
Jul 14, 2025, 4:09:44 AMJul 14
to cloud-composer-discuss
+1

Kien Truong Duc

unread,
Jul 15, 2025, 4:55:53 AMJul 15
to cloud-composer-discuss
Hi, 

I'm having the same problem after upgrading from composer-2.13.6-airflow-2.9.3 to with composer-2.13.6-airflow-2.9.3
`airflow-redis` is getting evicted a few times per day causing brief schedulers downtime.


Kien Truong Duc

unread,
Jul 15, 2025, 5:06:22 AMJul 15
to cloud-composer-discuss
Judging from our log, frequent Redis eviction has started happening since July 01, even before our upgrade.
Probably related to the hidden infrastructure change on July 1st: 


This release includes internal infrastructure improvements to Cloud Composer. There are no user-visible changes.
Reply all
Reply to author
Forward
0 new messages