Hi CDAP-Dev,
I'd like to give a heads-up about a bug fix I'm proposing for CDAP. Currently, Data Fusion can incorrectly mark a pipeline as FAILED if, during the ephemeral Dataproc cluster deprovisioning, it attempts to cancel a Dataproc job that has already completed successfully.
The Issue:
The CDAP RemoteExecutionTwillController sends a CancelJob request to Dataproc. If the job is already in the DONE state, Dataproc returns an error. This error is then caught in AbstractDataprocProvisioner, which treats it as a pipeline failure, even though the pipeline logic was successful. This leads to false-negative pipeline statuses.
The Fix:
I've implemented changes to:
Unit tests have been added to cover these changes.
Internal tracking for this issue is in Buganizer: b/460875216
A Pull Request on GitHub will follow shortly.
Thanks,
C.J. Collier
Dataproc Subject Matter Expert