GCP Cloud Composer Gurus,
Anyone have a "best practices" or guidance for operations types of folks for them to "kill" a DAG execution? I'm basically getting ready to hand off a build using composer to operations type of technicians and I'm trying to figure out the best way to approach it.
Use case: we have one DAG with 10 tasks, which is triggered via a cloud storage event (w/ cloud functions and pub/sub, etc...). We drop 10 files on cloud storage and the dag kicks off 10 times (good), but it failing at task #5 on all of the DAG executions, and you want your ops folks to just cancel the executions. Do you suggest we just have ops people leave everything in the FAILED state?
When working this through by myself, I was just marking tasks and the DAGS as success, but then I noticed some of the failed processes hanging and still running on the composer compute engines. I know they were "still running" because I can see XComs getting set and when I SSH into the compute engines I see the processes running.
I found the following link where someone created a DAG to run to kill off halted tasks, is that a good way to move forward?
I guess my fundamental questions are as follows:
1. do you recommend when the above situation happens, that ops types of folks leave the DAG runs in failed state? Or is there some better way to manage this?
2. if you have zombie tasks running, should I try to implement that "kill halted tasks" DAG for my ops folks to run every so often?
Thanks and best...Rich Murnane