Liveness Probe Failed for Worker / Scheduler - Sharing my experiences with upgrading 2018-05-10.RC0 to 2018-07-03.RC0 and tinkering with the deployment.yaml

2,604 views
Skip to first unread message

Tobias Kaymak

unread,
Jul 5, 2018, 3:00:46 AM7/5/18
to cloud-composer-discuss
Hi,

I want to share my experiences that I gathered while trying to upgrade and fix a Cloud Composer deployment in place. My team faced a situation where the workers were killed and constantly restarted when under heavy load - 
(that usually happens when you enable a new DAG that needs to load data for past days).

I had experience with the docker images of airflow and the logs showed no errors, so I went to the StackDriver logs and saw that K8s killed the worker pods since the health checks failed. I migrated the node pool of the cluster to bigger instances from n1-standard-1 to n1-standard-2 nodes (you have to be careful to grant the same IAM roles so that everything keeps working) [0], but that did not solve the problem. 

Then I thought it might be beneficial to upgrade the image on the nodes for the worker and the Scheduler, as their might be bugfixes, especially as the health check seems to be a custom Python script that has been developed by the Cloud Composer Team (?) - so I looked at [01 and rolled out the newest image to the worker and scheduler pods (2018-07-03.RC0). This didn't fix the problem either.

The thing that fixed it, and made it quite stable, was disabling the health check on the workers (which is of course not a long-term solution) by editing their deployment via kubectl. Now the whole thing runs quite smoothly, but I was wondering if there have been adjustments other than 
1. The image
2. The deployment.yaml

that I could have missed and that are rolled out when I would recreate the environment. (Read as: Has this been fixed somewhere?) - The Scheduler is right now still being killed from time to time, but this does not affect the execution of the DAGs.

Below I attached the log lines that could be of interest:

Liveness probe failed: Traceback (most recent call last): File "/var/local/worker_checker.py", line 23, in <module> main() File "/var/local/worker_checker.py", line 17, in main checker_lib.host_name, task_counts))  Exception: Worker airflow-worker-75d56659cc-vfzv5 seems to be dead. Task counts details:{'scheduled': 9L, 'recently_done': 0L, 'running': 0L, 'queued': 0L}

Liveness probe failed: Task count details: {'scheduled': 2L, 'recently_done': 0L, 'running': 0L, 'queued': 0L} Traceback (most recent call last): File "/var/local/scheduler_checker.py", line 83, in <module> main() File "/var/local/scheduler_checker.py", line 79, in main raise Exception('Scheduler seems to be dead.') Exception: Scheduler seems to be dead.

I will continue to investigate this and I am happy to collaborate :)
Thank you for this great product

Tobi

Tobias Kaymak

unread,
Jul 6, 2018, 2:42:53 AM7/6/18
to cloud-composer-discuss
So - after running a diff against the deployment.yaml of a new Cloud Composer Environment against the old (0.5) one - I can confirm that the livenessProbe configuration did not change:

         livenessProbe:
           exec:
             command:
             - python
             - /var/local/worker_checker.py
           failureThreshold: 3
           initialDelaySeconds: 120
           periodSeconds: 120
           successThreshold: 1
           timeoutSeconds: 60

Which means that even after upgrading the image and using the newest checker scripts, the pod for workers and schedulers still get killed. Could this be because of network/system latency? When I was running apt-get to install a debugging tool inside the container, it took a while before the system was fetching new aptitude sources. It felt clogged, but I did not do measurements with perftools at that time.

Feng Lu

unread,
Jul 10, 2018, 5:47:02 PM7/10/18
to tobias...@gmail.com, cloud-composer-discuss
Hi Tobias, 

Thanks a lot for sharing your debugging experience with the group. 

As I replied in another email thread, the problem is that some airflow tasks get stuck in the "SCHEDULED" or "QUEUED" state when worker gets restarted.
It creates a state mismatch where the CeleryExecutor thinks the task is being executed happily but the ground truth is that these tasks are dead. 
The liveness probe detects such a mismatch and tried to restart scheduler/worker to mitigate. It helps in certain cases but not always. 

The short-term fix is to manually delete "QUEUED" or "SCHEDULED" tasks if they have been around long enough. 
We're working on some fixes now which should be available in the next release. 
Thanks.

Feng 

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CAL-8VMWBZmaWs-q7xVcqMqzeTAPvtYsGuuQaoXzJ713oEYShaw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Alexander Nordin

unread,
Dec 10, 2019, 9:23:37 AM12/10/19
to cloud-composer-discuss
Hi Tobias and Feng,

I'm running into what I believe to be the same issue. Has this issue been resolved?

I'm trying to run a large, around 10k tasks and 300 DAG runs, backfill on cloud composer. 

Here are some of the logs from one of the airflow-worker pods. 

Screen Shot 2019-12-10 at 9.21.07 AM.png


Additionally, the backfill job is backed up with about 3000 queued tasks and 5000 scheduled tasks, according to the Airflow Webserver UI. 


To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

Tobias Kaymak

unread,
Dec 12, 2019, 8:04:45 AM12/12/19
to cloud-composer-discuss
Hi Alexander,

I think the issue you are experiencing is that you are trying to do too much at once. The issue here has been fixed by the releases in the past year.
I would recommend to disable the DAG so that the automatic scheduler does not do it's thing and then backfill only a certain number of days in a bash loop to not overload the system. Make sure that your workers have enough memory (you can monitor it via Kubernetes Engine interface).


Best,
Tobias 
Reply all
Reply to author
Forward
0 new messages