Hi,
I want to share my experiences that I gathered while trying to upgrade and fix a Cloud Composer deployment in place. My team faced a situation where the workers were killed and constantly restarted when under heavy load -
(that usually happens when you enable a new DAG that needs to load data for past days).
I had experience with the docker images of airflow and the logs showed no errors, so I went to the StackDriver logs and saw that K8s killed the worker pods since the health checks failed. I migrated the node pool of the cluster to bigger instances from n1-standard-1 to n1-standard-2 nodes (you have to be careful to grant the same IAM roles so that everything keeps working) [0], but that did not solve the problem.
Then I thought it might be beneficial to upgrade the image on the nodes for the worker and the Scheduler, as their might be bugfixes, especially as the health check seems to be a custom Python script that has been developed by the Cloud Composer Team (?) - so I looked at [01 and rolled out the newest image to the worker and scheduler pods (2018-07-03.RC0). This didn't fix the problem either.
The thing that fixed it, and made it quite stable, was disabling the health check on the workers (which is of course not a long-term solution) by editing their deployment via kubectl. Now the whole thing runs quite smoothly, but I was wondering if there have been adjustments other than
1. The image
2. The deployment.yaml
that I could have missed and that are rolled out when I would recreate the environment. (Read as: Has this been fixed somewhere?) - The Scheduler is right now still being killed from time to time, but this does not affect the execution of the DAGs.
Below I attached the log lines that could be of interest:
Liveness probe failed: Traceback (most recent call last): File "/var/local/worker_checker.py", line 23, in <module> main() File "/var/local/worker_checker.py", line 17, in main checker_lib.host_name, task_counts)) Exception: Worker airflow-worker-75d56659cc-vfzv5 seems to be dead. Task counts details:{'scheduled': 9L, 'recently_done': 0L, 'running': 0L, 'queued': 0L}
Liveness probe failed: Task count details: {'scheduled': 2L, 'recently_done': 0L, 'running': 0L, 'queued': 0L} Traceback (most recent call last): File "/var/local/scheduler_checker.py", line 83, in <module> main() File "/var/local/scheduler_checker.py", line 79, in main raise Exception('Scheduler seems to be dead.') Exception: Scheduler seems to be dead.
I will continue to investigate this and I am happy to collaborate :)
Thank you for this great product
Tobi