Hi,
it's all a little confusing. I'll have to think about this.
Just a bad performance of the system is usually not enough to produce BROKEN_FINISHED states.
Obviously, if the performance is bad enough all kind of things can happen, still it's not very likely that it's the sole reason for BROKEN_FINISHED states.
This is also shown by your statement: "We have expanded the cpu and the memory of the server since it was a little tight but the broken_finished are still reproduced. ".
(Memory shortage can potentially cause BROKEN_FINISHED states on Linux systems, because IIRC Linux starts killing processes randomly if it runs low on memory).
To answer your last question first:
The problem with a BROKEN_FINISHED job is that nobody knows what happened to it.
It is not known it ran to completion successfully. And i fact you write the BROKEN_FINISHED processes are still running (on some occasions).
Therefor it would be dangerous to set its (outgoing) dependencies to fulfilled.
It is even possible that the job that ended in a BROKEN_FINISHED state actually was a kind of switch point where one of the five successful terminations decide about the further continuation of the entire flow.
Setting all five dependencies to fulfilled would end up in a disaster.
In my opinion a BROKEN_FINISHED state indicates an error which needs to be repaired. Eliminating the symptoms doesn't provide a cure.
It is possible to create an asynchronous trigger that runs some script if the job reaches the BROKEN_FINISHED state, which then either ignores the dependencies of the successors, or sets the job state to FINISHED.
But i'd feel bad with such a "solution".
BROKEN_FINISHED is a state in which the job server can't find the process any more, but there's no indication that the process has terminated in the task file.
(BROKEN_RUNNING means that the jobexecutor can't be found but the user process is still running).
The hard link thing is a bit tricky. Especially if there's not much time between the job being started and the job reaching BROKEN_FINISHED.
And if it only occurs (more or less) rarely, it would be a lot of work to make a link for each task file for jobs that run without issues.
Still, it is interesting to know the contents of the task file just before it gets deleted.
I'm not a huge specialist on Unix file systems, but basically a file consists of a disk area and one or more directory entries that point to that disk area.
For each of those directory entries the link count (of the disk area) is incremented by one. If a process opens the file, the link count is again incremented.
If a "file" is deleted (`rm myfile`), the directory entry is removed and the link count is decremented. If the link count gets zero, the disk area will be freed.
If a process closes a file, the link count is decremented too, and if the link count gets zero, the disk area is released.
But if the link count remains positive, nothing happens to that disk area.
In order to create a hard link, you use the `ln`command without the "-s" option (which stands for "symbolic").
A hard link cannot span file systems because the reference within the directory specifies an inode (number). And inodes are only unique within a single file system.
One method to create a hard linked "copy" of each task file is to use a wrapper for the jobexecutor.
A jobexecutor is called with a bunch of parameters, one of which is the name of the task file.
It'll give you some help if you ask for it:
```
schedulix@oncilla:~/schedulix/lib$ jobexecutor
Wrong number or type of arguments
Usage:
jobexecutor [--version|-v] [--help|-h] [<boottime_how> <taskfileName> [boottime]]
Exactly one of the optional argument sets must be specified
i.e. either the version request, or this help request or some specification on what to do
```
The second parameter will be the name of the task file.
Now if we make a wrapper like e.g.
```
#!/bin/bash
if [ $# -lt 3 ]; then
# this is just a --version or --help;
exec $BICSUITEHOME/bin/jobexecutor $*
fi
TASKBASE=`dirname $2`
TFNAME=`basename $2`
mkdir -p $TASKBASE/backuptf # just in case it doesn't exist yet
ln $2 $TASKBASE/backuptf/$TFNAME
exec $BICSUITEHOME/bin/jobexecutor $*
```
You store this script somewhere and make it executable.
Please test it before breaking your production!! (I didn't test it; I just wrote down the basic idea)
Now you change the configuration of one of the job servers that cause problems and instead of the jobsexecutor you enter the name of the script.
(Jobserver and Resources, Config Tab, Entry JOBEXECUTOR).
The wrapper script should create a hard link for each task file. And as soon as you find a job in BROKEN_FINISHED, you can have a look at the task file.
If we're lucky, the process is still running and you can have a look at the starttimes file (which also resides in the task file directory) and we can start comparing pids and times.
The used method to determine the start time of a process (which is, in a Unix/Linux environment not a piece of standard information) isn't very accurate.
This is why we compare the start time found with the start time we've measured with an uncertainty of several seconds. (Basically something like `abs(T1 - T2) < jitter`).
One of my plans for the next release (2.11) is to make that "jitter" configurable, where a value of zero indicates that the user wants us to disregard the start times alltogether.
In your case (2.8) I can build a special job server that uses a huge jitter value, which should eliminate all BROKEN_FINISHED states, except for the real ones (computer rebooted and alike).
But I'd like to investigate the matter a little further.
And last but not least, I don't have a problem with your English at all! No need to apologize.
I hope you don't have issues with my English. I'm not a native speaker either.
Best regards,
Ronald