how to control zombie killing?

Justin Kiggins

unread,

Jan 8, 2016, 1:52:47 AM1/8/16

to Airflow

Airflow is running great for our lab. Just one small problem.

Airflow keeps killing tasks as zombies even though they aren't zombies.

Most of my tasks are BashOperators which can easily run anywhere from 2-36 hours.

But airflow keeps telling me it has killed them as zombies. Nonetheless, it doesn't seem to *actually* kill them as the processes are still running in the background and writing to the log file.

- Can I turn off zombie-killing?

- What is the task's heartbeat?

- Can I change the heartbeat interval (e.g. extend it to 3+ hours?)?

- Any other tips?

Thanks!

Justin

Maxime Beauchemin

unread,

Jan 8, 2016, 11:46:42 AM1/8/16

to Airflow

Oh interesting, this really should not be happening. A few pointers to help figure things out:

* Airflow workers run airflow jobs as sub processes, the parent process emits a heartbeat to the metadata database on a regular basis (defined in airflow.cfg under [scheduler] -> job_heartbeat_sec). A healthy setting might be 30 seconds - 60 seconds. That heartbeat process is also used to receive external kill signals emitted by clearing (UI or CLI)

* The scheduler (soon to be renamed "supervisor" as it doesn't only schedule anymore) kills zombies. It happens in this method:

https://github.com/airbnb/airflow/blob/master/airflow/models.py#L219

* Reading the code, it looks like the job needs to skip 3 heartbeats + 120 seconds to be identified and killed as a zombie.

* You can track your heartbeats in the job table

* If you use a LocalExecutor it's effectively a multiprocess pool that is attached to the scheduler process. If you kill the scheduler, you may be creating zombies

* There's a distinction with what I call "undead" which we don't handle at the moment (soon though!). Undead are running processes that Airflow doesn't know about anymore. For instance if you "mark success" on a running task, the task isn't killed, and Airflow doesn't think of it as running anymore

What type of executor and database backend are you using? Are you sure your processes are actually still running? System policy or cgroups killing them? Are the workers loosing connectivity to the DB?

Max

Justin Kiggins

unread,

Jan 8, 2016, 6:16:36 PM1/8/16

to Airflow

LocalScheduler w/ Postgres database (on a separate machine).

The processes are definitely still running. In fact, they continue writing to their task logfile.

I think I might have found some candidates for the problem, however... after getting airflow working in an environment under my own user, I redeployed it under root. However, "ps aux | grep airflow" reveals a bunch of scheduler processes (the workers?) running under the "old" environment. So perhaps I have an extra scheduler running and I'm getting something weird happening in the database?

I've been using supervisor to start and stop the scheduler and webserver and noticed that stopping them through supervisor doesn't always kill the gunicorn processes or scheduler's workers. Perhaps this is causing my problem?

Another candidate: here is ps aux for one of my "killed as zombie" tasks, supposedly killed ~21 hours ago:

root 31110 0.0 0.0 1015216 39224 ? S Jan05 3:06 /usr/local/anaconda/envs/airflow/bin/python /usr/local/anaconda/envs/airflow/bin/airflow run Pen03_Rgt_AP2350_ML1400__Site04_Z2023__B957_cat_P03_S04_Epc07-14 phy_spikesort 2016-01-01T16:27:23.698899 --local --pool phy -sd DAGS_FOLDER/jk.py

root 31138 0.0 0.0 1013704 2296 ? S Jan05 0:15 /usr/local/anaconda/envs/airflow/bin/python /usr/local/anaconda/envs/airflow/bin/airflow run Pen03_Rgt_AP2350_ML1400__Site04_Z2023__B957_cat_P03_S04_Epc07-14 phy_spikesort 2016-01-01T16:27:23.698899 --job_id 688 --pool phy --raw -sd DAGS_FOLDER/jk.py

There seem to be 2 "airflow run" processes running, each with slightly different arguments ("--local" vs "--job_id 688 ... --raw"). Is this normal or suspicious?

I'm going to try shutting down, killing anything with "airflow", and restarting.

Justin

Yuri Titov Bendana

unread,

Jan 8, 2016, 8:29:44 PM1/8/16

to Airflow

I have a similar issue using systemd to stop the webserver which I've reported here: