Hello Vorrarit,
at this point I need to explain a little about the concepts of the system.
If a Job runs into a FAILURE which you have defined to be a RESTARTABLE state, this job has not yet completed successfully.
First when it reaches a FINAL state (or is CANCELLED), the system considers the Job as done.
This means that the system expects that you'd like to do something with the failed Jobs.
(You want to get work done, which is usually not achieved by failing).
Since the system can't do anything itself, it marks such Jobs that are waiting for a manual action red.
The operator actions that are valid in this situation are:
1. RESTART the Job
You've removed the cause of the FAILURE and you want to run the program again to get work done
2. Set the Exit State of the Job (to some FINAL state)
You can't remove the cause of the FAILURE and you want to flag the Job as done
3. CANCEL the Job
You can't remove the cause of the FAILURE and want to get rid of the Job
It is important to see the difference between cancelling and setting the Exit State.
If you cancel a job, all dependent jobs will become UNREACHABLE. Another operator action will be required to decide what to do with the successors of the CANCELLED Job.
If the successors have defined some Exit State to be the UNREACHABLE state, this doesn't show effect.
If you set the Exit State of a Job, successors that become UNREACHABLE will acquire the UNREACHABLE Exit State (like e.g. SKIPPED).
You'll have to invest some work to clean up the current situation.
The Jobs waiting for an operator decision litter your system up to a point where newer Jobs aren't visible any more.
And since those Jobs are waiting, they can't be removed from the system. This occupies loads of memory and slows down your queries.
I think you are no longer interested in those ancient jobs. And since they are all standalone Jobs, as it seems, it'll be safe to CANCEL them.
Just mark all Jobs you want to get rid of with the Checkbox at the start of the row and hit the Cancel Button (Tombstone with RIP).
Now you want to prevent this from happening again.
You can make a copy of the STANDARD Exit State Profile and give it another name (Like e.q. STANDARD_FF, meaning something like STANDARD with FAILURE is FINAL).
To do so: edit the name of the Exit State Profile to copy and hit the Clone Button.
Now you mark the FAILURE Exit State as a FINAL state.
Give all Job Definitions that are allowed to fail the new Exit State Profile.
Future Jobs will now have a FINAL Exit State after failing. You cannot any longer restart such Jobs, but they also don't litter your system.
Alternatively you can create an Exit State Definition called PURGED (or give it whatever name with a comparable meaning).
You now add this Exit State to the STANDARD Exit State Profile. I think the best place will be below SUCCESS.
You define it to be a FINAL state.
Now if Jobs fail in the future, you can choose to rerun them, or to set their Exit State to PURGED, which then documents that you've decided not to try to run the Job again.
I hope the above makes sense to you. If not, please ask, and I'll try to explain it better.
Best regards,
Ronald