Hi, please review the exception request below.
There is one PR (#115331) which is important for k8s Jobs, but it also fixes an issue in Kubelet on its own (fixing the Kubelet issue #116410). The review is in-progress with one outstanding issue, which is understood.
Enhancement name: Retriable and non-retriable Pod failures for Jobs
Enhancement status (alpha/beta/stable): Beta
SIG: sig-apps (sig-node participating)
k/enhancements repo issue #: 3329
PR #’s: 115331
Additional time needed (in days): 3
Reason this enhancement is critical for this milestone: To fix the bug in Beta (see here). Also, to open the possibility of fixing #115844 in the next release.
Risks from adding code late: (to k8s stability, testing, etc.) The PR includes unit tests, and node e2e tests providing nearly 100% coverage for the new code. The e2e node tests for this PR were run in a loop with over 100 repeats to detect flakes.
Risks from cutting enhancement: (partial implementation, critical customer usecase, etc.):
Delayed bugfix for Beta, resulting in delayed adoption of the pod failure policy feature to support use-cases around avoiding unnecessary costs for running batch workloads. Also, a delayed bugfix for the Kubelet issue #116410. Finally, a delayed solution for the #115844 issue (see the comment) for TensorFlow users.