Kubernetes pods restarting in Composer environment

766 views
Skip to first unread message

sbob...@kabaminc.com

unread,
Oct 4, 2018, 1:05:47 PM10/4/18
to cloud-composer-discuss
I've noticed that pods in one of my composer environments are frequently restarting.

kubectl get pods
NAME                                 READY     STATUS    RESTARTS   AGE
airflow
-redis-0                      1/1       Running   16         3d
airflow
-scheduler-85ffc686b9-g97s8   2/2       Running   167        17h
airflow
-sqlproxy-9485d6c74-rng7z     1/1       Running   16         3d
airflow
-worker-9bf7fbddb-26tlw       2/2       Running   10         1d
airflow
-worker-9bf7fbddb-6nhrm       2/2       Running   1          2d
airflow
-worker-9bf7fbddb-tftrf       2/2       Running   22         2d
composer
-fluentd-daemon-n648t        1/1       Running   42         28d
composer
-fluentd-daemon-q79j5        1/1       Running   228        3d
composer
-fluentd-daemon-xmxwr        1/1       Running   10         28d

One of the worst offenders is the airflow-scheduler pod that restarts when its liveliness probe fails with the following error:

kubectl describe pods airflow-scheduler

... Output Truncated ...

Events:
 
Type     Reason     Age                  From                                                          Message
 
----     ------     ----                 ----                                                          -------
 
Normal   Killing    19m (x166 over 17h)  kubelet, gke-us-central1-analytic-default-pool-8df91bc2-hgf6  Killing container with id docker://airflow-scheduler:Container failed liveness probe.. Container will be killed and recreated.
 
Warning  Unhealthy  8m (x336 over 17h)   kubelet, gke-us-central1-analytic-default-pool-8df91bc2-hgf6  Liveness probe failed: Traceback (most recent call last):
 
File "/var/local/scheduler_checker.py", line 87, in <module>
    main
()
 
File "/var/local/scheduler_checker.py", line 76, in main
    check_freshness
(vars(parser.parse_args()))
 
File "/var/local/scheduler_checker.py", line 59, in check_freshness
    dt
= iso8601.parse_date(res[0]['timestamp'])
IndexError: list index out of range
 
Warning  BackOff  4m (x1976 over 17h)  kubelet, gke-us-central1-analytic-default-pool-8df91bc2-hgf6  Back-off restarting failed container

While a new replica is being created, we are not able to schedule anything in the environment.

Has anyone else come across this issue?

Thanks,

Yudi Xue

unread,
Oct 4, 2018, 2:47:40 PM10/4/18
to Stas Bobylev, cloud-composer-discuss
Adding some context, this has been pretty consistent (airflow-scheduler pod restarts plenty of times then stuck at restart-backoff state) when we do a backfill that inserts for about 200 tasks (today till the beginning of the year)

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/9d183cb3-4073-419b-801f-6aada7effcfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wilson Lian

unread,
Oct 4, 2018, 2:55:25 PM10/4/18
to Yudi Xue, Stas Bobylev, cloud-composer-discuss
What Composer image version is your environment running?

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CAO0Q59CtLjEOS7csqHMe5Qnt8AcpKm%3Dxuq3kx9NTwdB-8268MQ%40mail.gmail.com.

Stas Bobylev

unread,
Oct 4, 2018, 2:58:11 PM10/4/18
to Wilson Lian, Yudi Xue, cloud-composer-discuss
It is composer-1.1.0-airflow-1.9.0

Stas

Wilson Lian

unread,
Oct 4, 2018, 3:04:59 PM10/4/18
to Stas Bobylev, Yudi Xue, cloud-composer-discuss
Looks like you've got Stackdriver logs ingestion turned off? The scheduler liveness checker in that version relied on being able to successfully retrieve Stackdriver logs emitted by the scheduler. composer-1.1.1-airflow-1.9.0 fixes the IndexError: list index out of range error. Please try using a newer version. This script will help with the migration: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/composer/tools/copy_environment.py

best,
Wilson

It is composer-1.1.0-airflow-1.9.0

Stas


To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.

To post to this group, send email to cloud-composer-discuss@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.

Yudi Xue

unread,
Oct 4, 2018, 3:42:10 PM10/4/18
to wwl...@google.com, Stas Bobylev, cloud-composer-discuss
OK we'll try composer-1.2.0-airflow-1.9.0

Glad to see python 3 support is in beta. 

Also would be very helpful to leave the "release note" link at top level documentation page. (I think someone suggested this before)

It was difficult for me to recognize release note would be part of "Resources section"

(Also added this to the feedback section to that particular page)

Thanks!

It is composer-1.1.0-airflow-1.9.0

Stas


To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Yudi Xue

unread,
Oct 4, 2018, 3:43:15 PM10/4/18
to wwl...@google.com, Stas Bobylev, cloud-composer-discuss
Also confirming we got Stackdriver logs ingestion turned off for the gcp project for other reasons, didn't know this would cause a side effect.
Reply all
Reply to author
Forward
0 new messages