Hi Anthony,
well, your ability to solve the problem yourself more or less proves there's some logic behind the several conf-files :-)
Jobs in a restartable state will remain in memory forever. And thousands of restartable jobs will indeed fill the memory.
I was pleased to read your understanding of a restartable state: it does indeed mean a manual intervention is required.
This means that it can be perfectly valid to have final error states.
schedulix can be regarded a high level distributed programming language. This means if you have standard error situations, the treatment of such errors can possibly be automated.
Implementing an automated treatment of standard errors will significantly reduce the number of "red" jobs that require a manual intervention. This again will reduce the backlog problem.
There are a few methods to treat errors automatically:
1. If your standard reaction is to cancel the failed job, you can just as well use a profile where FAILURE is defined to be a final state.
The job will be "green" after running into a FAILURE and can be cleaned out of memory later.
2. If your standard reaction is to simply rerun the job, you could define a rerun trigger which retries to run the job after some period.
This is a very good method to cope with "soft" errors, i.e. errors that occurs because of timing issues.
3. If you have a standard repair procedure, that could be integrated into your job flow.
This means, you'll build a loop:
do
run JOBX
if ERROR; then
run REPAIR
fi
while ERROR
Example E0170 shows how to build a simple loop within schedulix.
And maybe the rerun of JOBX isn't even required, which means that the situation will be fine if the REPAIR job has run. So you'd only have to implement the "IF".
Depending on the exact situation I might find more strategies, but I am not bothered by knowledge of your system at the moment.
Apart from reducing the number of restartable jobs, it might make sense to have a look at the history configuration.
This defines which jobs can be cleaned out. The parameters can be found in the $BICSUITECONFIG/server.conf file:
In my system I don't care too much about jobs that ran successfully. I mainly use it to test my newest developments.
This is why the parameters are set to relatively low values. Still the relevant parameters are:
#
# History: In Core Job History in minutes. Finished masters are kept in memory
# at least this long
#
History=60
#
# HistoryLimit: since the number of jobs to load can be specified, the previously
# defined History can be exceeded.
# The HistoryLimit defines a hard limit. Final jobs elder than this
# limit won't be loaded, disregarding any counts
#
HistoryLimit=120
#
#
# MinHistoryCount: Minimum number of masters loaded (if present), disregarding the
# History. 0 means no minimum
#
MinHistoryCount=1
#
# MaxHistoryCount: Maximum number of masters loaded, even if History is larger
# 0 means ignore
#
MaxHistoryCount=3
In your system where you have a lot of "heartbeat jobs", it'll probably make sense to keep only a few of them.
It won't make sense to keep the runs of the last 10 days in memory (History=14400; Min and Max are zero).
Although this doesn't reduce the problem caused by the restartable jobs, it provides a possibility of making a far more efficient use of the memory which on itself relaxes the main problem somewhat.
I hope these thoughts help you to take a decision about a strategy to stabilise your system.
(apart from throwing memory at it).
Regards,
Ronald