Best practices for stopping a DAG with a failed task - Killing Halted Tasks

4,154 views

Skip to first unread message

rmur...@merkleinc.com

unread,

Aug 17, 2018, 11:14:16 AM8/17/18

to cloud-composer-discuss

GCP Cloud Composer Gurus,

Anyone have a "best practices" or guidance for operations types of folks for them to "kill" a DAG execution? I'm basically getting ready to hand off a build using composer to operations type of technicians and I'm trying to figure out the best way to approach it.

Use case: we have one DAG with 10 tasks, which is triggered via a cloud storage event (w/ cloud functions and pub/sub, etc...). We drop 10 files on cloud storage and the dag kicks off 10 times (good), but it failing at task #5 on all of the DAG executions, and you want your ops folks to just cancel the executions. Do you suggest we just have ops people leave everything in the FAILED state?

When working this through by myself, I was just marking tasks and the DAGS as success, but then I noticed some of the failed processes hanging and still running on the composer compute engines. I know they were "still running" because I can see XComs getting set and when I SSH into the compute engines I see the processes running.

I found the following link where someone created a DAG to run to kill off halted tasks, is that a good way to move forward?

https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master/kill-halted-tasks

I guess my fundamental questions are as follows:

1. do you recommend when the above situation happens, that ops types of folks leave the DAG runs in failed state? Or is there some better way to manage this?

2. if you have zombie tasks running, should I try to implement that "kill halted tasks" DAG for my ops folks to run every so often?

Thanks and best...Rich Murnane

Wilson Lian

unread,

Aug 20, 2018, 8:19:17 PM8/20/18

to rmur...@merkleinc.com, cloud-composer-discuss

Hi Rich,

Leaving DAG runs in the FAILED state seems like a reasonable course of action when a task that you consider critical fails.

As you noticed, changing the state of a DAG run/task doesn't affect in-flight task instances, as it only influences whether the scheduler will try to schedule task instances. Unfortunately, Airflow doesn't have a good way of linking task instance state to the low-level process that executes the task instance. I'd advise caution with a kill-halted-tasks style DAG because there are multiple Airflow worker nodes, and you'd need to ensure that the DAG runs on every node to fully clean up all halted tasks. It might be better to do this at a layer above Airflow. For example:

Determine the GKE cluster for the environment
Get the GKE cluster's credentials
List the worker pods: kubectl get pods | grep airflow-worker
SSH into the worker: kubectl exec -it <AIRFLOW_WORKER_POD_NAME> bash
Execute task killing logic

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/d23d914a-38bb-400e-bf81-7be3cfdac8c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages