Broken_finished - Schedulix 2.8

Vanessa Romero

unread,

Jun 11, 2021, 8:10:58 AM6/11/21

to schedulix

Hi,

we have schedulix version 2.8. Sometimes we have jobs with status broken_finish. How can we detect what is the cause and there is any way to monitoring this status? the logs desappear after the job finish.

Thanks.

Ronald Jeninga

unread,

Jun 11, 2021, 9:18:36 AM6/11/21

to schedulix

Hi,

the status BROKEN_FINISHED means that the connection between job server and job got lost, such that the scheduling system wasn't able to retrieve the exit code of the (now) finished process.

This usually doesn't happen, but if, there are a few possible causes:

1. The jobexecutor process died prematurely

2. There's a problem with the identification of the started processes

In case of a died jobexecutor process it should be investigated what caused its death.

Usually jobexecutor processes are pretty rock solid. They basically do a fork() - exec() - wait() and write some information into some task file.

But I don't expect this to be the cause of your problem.

It is unlikely but possible that some system administrator has killed the jobexecutor processes, even if they ignore all signals except for the unmaskable ones (KILL, STOP, CONT).

A very effective way of producing this situation is to reboot the server. (It should be obvious we can't protect ourselves against a reboot though).

The second possibility is pretty "popular" on Windows systems, just after a time change to or from DST.

The Windows Utility we use to determine the start time of a process (WMIC) reports invalid times just after a time change.

This then leads to a situation that the reported start time of a process is off by 1 hour.

The job server now thinks that this must be a different process than the process that has been started by the job server system, and therefor follows that the original process must have died unexpectedly and unnoticed.

We are working on this issue and I expect to have it repaired within the next week.

To please you I'll consider a back port from 2.10 to 2.9 and 2.8, but I can't promise this. (It's all about the height of the piles of paper on my desk).

Another cause for this problem can be a unreliable clock that runs slow.

We've seen this happen in virtual machines that don't sync their system time.

And I've seen it happen on an ancient HP-UX box.

The best measure here is to set up some NTP client to keep the clock sync'ed.

In case of HP-UX we ignore the start times of the processes we've started (and hope for the best).

But I'm not sure if this is part of 2.8. (But if not, there's no HP-UX support in 2.8 yet).

What operating system are you using?

Does it happen on a regular base (i.e. reliably)?

Are some systems more often affected than other systems?

Do the jobs reach the BROKEN_FINISHED state nearly immediately after submit, or does it take some time (several minutes or more)?

The taskfile should be cleaned up automatically, but the log files (those that catch stdout and stderr) aren't.

If the jobs live some time before reaching BROKEN_FINISHED, you could create a hard link to make the taskfile accessible under another name.

As soon as the original taskfile is removed, the area on disk will still have a positive reference count (because of the hard link) and won't be deleted.

Hope this helps somewhat.

But no worries, we'll find the underlying cause of your problem. We only have to do some digging.

Best regards,

Ronald

Vanessa Romero

unread,

Jun 14, 2021, 6:34:58 AM6/14/21

to schedulix

Hi Ronald,

Thanks for your quick response. In our case, I'm going to tell you about the last two times that Jobs have appeared to us with this status. Both Jobs had been running for several hours and both Jobs appeared with broken_finished at the same time. We have verified that no one has restarted the server at that time, in fact other Jobs were still running.

The client where these jobs are running is RHEL 7.4. Months ago we had the same problem on another client and it was fixed by improving disk performance. This time we already have this server on an SSD disk, the only difference between the other client and this one is that the other has RDM disks and this one has a virtual disk. We have expanded the cpu and the memory of the server since it was a little tight but the broken_finished are still reproduced.

Both the job server and the client have the same ntp server configured and are synchronized to the second.

We were reviewing the solution of hard links to monitor this status and act accordingly, but if we generate a hard link from a file that later disappears, the hard link would not work, right? How would this solution be implemented?

Another cuestion that we would like to ask ... is it possible to configure a job so that if it ends in broken_finished state, its dependencies are executed? These jobs usually finish correctly.

Sorry for my english.....

Thanks a lot.

Ronald Jeninga

unread,

Jun 14, 2021, 8:08:22 AM6/14/21

to schedulix

Hi,

it's all a little confusing. I'll have to think about this.

Just a bad performance of the system is usually not enough to produce BROKEN_FINISHED states.

Obviously, if the performance is bad enough all kind of things can happen, still it's not very likely that it's the sole reason for BROKEN_FINISHED states.

This is also shown by your statement: "We have expanded the cpu and the memory of the server since it was a little tight but the broken_finished are still reproduced. ".

(Memory shortage can potentially cause BROKEN_FINISHED states on Linux systems, because IIRC Linux starts killing processes randomly if it runs low on memory).

To answer your last question first:

The problem with a BROKEN_FINISHED job is that nobody knows what happened to it.

It is not known it ran to completion successfully. And i fact you write the BROKEN_FINISHED processes are still running (on some occasions).

Therefor it would be dangerous to set its (outgoing) dependencies to fulfilled.

It is even possible that the job that ended in a BROKEN_FINISHED state actually was a kind of switch point where one of the five successful terminations decide about the further continuation of the entire flow.

Setting all five dependencies to fulfilled would end up in a disaster.

In my opinion a BROKEN_FINISHED state indicates an error which needs to be repaired. Eliminating the symptoms doesn't provide a cure.

It is possible to create an asynchronous trigger that runs some script if the job reaches the BROKEN_FINISHED state, which then either ignores the dependencies of the successors, or sets the job state to FINISHED.

But i'd feel bad with such a "solution".

BROKEN_FINISHED is a state in which the job server can't find the process any more, but there's no indication that the process has terminated in the task file.

(BROKEN_RUNNING means that the jobexecutor can't be found but the user process is still running).

The hard link thing is a bit tricky. Especially if there's not much time between the job being started and the job reaching BROKEN_FINISHED.

And if it only occurs (more or less) rarely, it would be a lot of work to make a link for each task file for jobs that run without issues.

Still, it is interesting to know the contents of the task file just before it gets deleted.

I'm not a huge specialist on Unix file systems, but basically a file consists of a disk area and one or more directory entries that point to that disk area.

For each of those directory entries the link count (of the disk area) is incremented by one. If a process opens the file, the link count is again incremented.

If a "file" is deleted (`rm myfile`), the directory entry is removed and the link count is decremented. If the link count gets zero, the disk area will be freed.

If a process closes a file, the link count is decremented too, and if the link count gets zero, the disk area is released.

But if the link count remains positive, nothing happens to that disk area.

In order to create a hard link, you use the `ln`command without the "-s" option (which stands for "symbolic").

A hard link cannot span file systems because the reference within the directory specifies an inode (number). And inodes are only unique within a single file system.

One method to create a hard linked "copy" of each task file is to use a wrapper for the jobexecutor.

A jobexecutor is called with a bunch of parameters, one of which is the name of the task file.

It'll give you some help if you ask for it:

```

schedulix@oncilla:~/schedulix/lib$ jobexecutor

Wrong number or type of arguments

Usage:

jobexecutor [--version|-v] [--help|-h] [<boottime_how> <taskfileName> [boottime]]

Exactly one of the optional argument sets must be specified

i.e. either the version request, or this help request or some specification on what to do

```

The second parameter will be the name of the task file.

Now if we make a wrapper like e.g.

```

#!/bin/bash

if [ $# -lt 3 ]; then

# this is just a --version or --help;

exec $BICSUITEHOME/bin/jobexecutor $*

fi

TASKBASE=`dirname $2`

TFNAME=`basename $2`

mkdir -p $TASKBASE/backuptf # just in case it doesn't exist yet

ln $2 $TASKBASE/backuptf/$TFNAME

exec $BICSUITEHOME/bin/jobexecutor $*

```

You store this script somewhere and make it executable.

Please test it before breaking your production!! (I didn't test it; I just wrote down the basic idea)

Now you change the configuration of one of the job servers that cause problems and instead of the jobsexecutor you enter the name of the script.

(Jobserver and Resources, Config Tab, Entry JOBEXECUTOR).

The wrapper script should create a hard link for each task file. And as soon as you find a job in BROKEN_FINISHED, you can have a look at the task file.

If we're lucky, the process is still running and you can have a look at the starttimes file (which also resides in the task file directory) and we can start comparing pids and times.

The used method to determine the start time of a process (which is, in a Unix/Linux environment not a piece of standard information) isn't very accurate.

This is why we compare the start time found with the start time we've measured with an uncertainty of several seconds. (Basically something like `abs(T1 - T2) < jitter`).

One of my plans for the next release (2.11) is to make that "jitter" configurable, where a value of zero indicates that the user wants us to disregard the start times alltogether.

In your case (2.8) I can build a special job server that uses a huge jitter value, which should eliminate all BROKEN_FINISHED states, except for the real ones (computer rebooted and alike).

But I'd like to investigate the matter a little further.

And last but not least, I don't have a problem with your English at all! No need to apologize.

I hope you don't have issues with my English. I'm not a native speaker either.

Best regards,

Ronald

Vanessa Romero

unread,

Jun 18, 2021, 9:00:30 AM6/18/21

to schedulix

Hi Ronald!

sorry for the delay in answering, we were doing tests.

The solution that you have given us to make an ln of the taskfile helps us to monitor the jobs that fail. We thought that when doing an ln, when the taskfile disappeared, the ln file would not show the information but it does show it correctly. In this part of monitoring failed jobs we have some doubts, to see if you can help us.

- The taskfile always ends with status finished and what we have to look at is the returncode to identify in which state it has finished.

0 = Success

1 = Failure

What code would correspond to a broken_finish?

- The other issue is that when a taskfile is generated it is generated with the GLOBAL.'JOBSERVER '.' USER'-TASKID ?. For example, in our case the last one we have executed is GLOBAL.'DW1PWCDES '.' CDI'-2708668. Is there any way that instead of the final ID, the name of the job appears so that in this way we do not see one ln per launched task but that we have a single file per job where all the states of all the executions are collected?

thanks a lot!

Ronald Jeninga

unread,

Jun 18, 2021, 10:11:29 AM6/18/21

to schedulix

Hi,

since the task files usually perform their task unnoticed, it doesn't make that much sense to use the name instead of the jobid to name them.

We'd need the jobid anyway, because the job's name isn't guaranteed to be unique.

The ln utility does nothing more than to create an extra directory entry that points to the original file. Afterwards the file has 2 names, but is still a single file.

There's no way to aggregate multiple files into one.

And it seems that you didn't understand the meaning of BROKEN_FINISHED yet.

There is no exit code that will map to BROKEN_FINISHED. BROKEN_FINISHED just means that the jobserver can't find the process (and the jobexecutor can't be found either).

But the task file proves that everything runs fine; it shows the exit code of the process.

It still looks like somehow there's a problem with the start times of the processes (which is used to identify them, together with the PID (which is not unique over time)).

The good idea now is to test that hypothesis, which can easily be confirmed by using another value for the jitter variable.

I guess you can't easily modify the sources and recompile.

Hence I've uploaded a BICsuite.jar to our webserver.

You can download it from https://www.independit.de/Downloads/Vanessa/BICsuite.jar

For completeness:

[ronald@ocelot lib]$ md5sum BICsuite.jar

de54d64847f2c79c74a4c732325093df BICsuite.jar

and

[ronald@ocelot lib]$ jar xvf BICsuite.jar META-INF/MANIFEST.MF && cat META-INF/MANIFEST.MF && rm -rf META-INF

inflated: META-INF/MANIFEST.MF

Manifest-Version: 1.1

Program-Version: 2.8

Level: OPEN

Company: independIT Integrative Technologies GmbH

Created-By: 1.7.0_261 (Oracle Corporation)

GmbH

Build: 546c4bf1c6aede9482e93e5499e4e8d367fcca1a

Build-Date: 18.06.2021 16:00

The jitter now accepts a difference between measured and calculated start time of up to 3600s.

If you don't find any BROKEN_FINISHED jobs afterwards any more, the hypothesis has been confirmed.

If the problem remains, we'll have to do some more experiments in order to find the cause and a solution.

Enjoy your weekend!

Ronald

Vanessa Romero

unread,

Jun 22, 2021, 9:11:52 AM6/22/21

to schedulix

Hi Ronald!

thanks a lot for your quickly and detailed answers. I get a little lost with "jitter" parameter. That parameter ... is the maximum jitter value the job server can have without connecting to the job? By default in how many seconds is it set?

Best regards.

Vanessa

Ronald Jeninga

unread,

Jun 22, 2021, 11:51:16 AM6/22/21

to schedulix

Hi Vanessa,

OK, let me explain.

As we started the development of schedulix, pids used to have 16 bits and ranged from 1 to 65535.

Pids are reused.

Hence in order to reliably identify a process, we decided to compare both the pid and the start time of the process.

It's highly unlikely that a pid is used twice within the same second, if not impossible.

Now the reasoning is fine, but the problem is that Unix/Linux systems don't have a portable system call that gives us the start time of a process.

In order to obtain the start times, we use a ps command. That doesn't give us the start time, but more the time the process is alive.

We obtain the start time by calculating: `now - run time = start time`

The remaining problem is that computers aren't high quality time measuring devices.

This forces us to compare the start time with a certain tolerance, what we call "jitter".

if `abs(ps-start-time - measured-start-time) < jitter` then they are regarded equal.

The unmodified 2.8 code uses a jitter of 2 seconds which is usually no problem at all.

But I've seen e.g. ancient HP-UX boxes that were very inaccurate.

The longer a process ran, the greater grew the difference between the two times.

I've modified that one constant in the jar file I've built for you.

The jitter is set to 3600, which should reduce the number of BROKEN_FINISHED states significantly.

If your system behaves like the ancient HP-UX, an hour might not be sufficient for processes that run very long, but we'll find that out soon enough.

The plan for 2.11 is to make the jitter configurable.

Even if I prefer to run the system with the jitter set to some small value, some systems will force me to make it large (e.g. 86400s) or to switch it off (that's going to be 0, since a jitter of zero is nonsense).

I hope I've been able to throw some light on the matter :-)

Best regards,

Ronald

Ronald Jeninga

unread,

Jul 1, 2021, 4:49:44 AM7/1/21

to schedulix

Hi Vanessa,

did you run into any BROKEN_FINISHED issues after applying the patch?

If not, I'd say that my diagnosis (jitter too small) is correct.

In 2.11 (current development) I'll make the jitter configurable.

I didn't decide yet to backport this to 2.10, but if there are good reasons to do so, I will.

(Feel free to argue with me about this and to explain why backporting would be the decision of the century).

Best regards,

Ronald

Vanessa Romero

unread,

Aug 24, 2021, 4:45:07 AM8/24/21

to schedulix

Hi Ronald,

about 3 weeks ago the .jar you provided us was applied and at the moment no trace of the broken_finish. The fact that this parameter in later versions is configurable would be great.

Thanks for everything!

Ronald Jeninga

unread,

Aug 26, 2021, 1:40:54 AM8/26/21

to schedulix

Hi Vanessa,

thank you for the feedback, I appreciate it!

To me, your results prove that the anticipated added parameter is a good idea.

I'll add it to the current development tree and it will be available in 2.11.

Best regards,

Ronald

Ronald Jeninga

unread,

Oct 7, 2021, 6:02:09 AM10/7/21

to schedulix

Hi Vanessa,

I've changed my mind somewhat, in the sense that the STARTTIME_JITTER configuration parameter will be available for both 2.10 and 2.9 as well.

Hence if you upgrade to 2.9 or 2.10 (which shouldn't be a problem, but if you run into problems, please advise), you can configure your jobservers with either a very tolerant jitter value (e.g. 3600), or to switch the start time check off all together.

The GUI is not aware of this extra configuration parameter, which means that you'd have to set the parameter using sdmsh or the shell window in the web interface.

Currently I'm busy creating the rpms for both RHEL7 and RHEL8 (both Intel). They will be available at latest tomorrow.

If you need an example command that sets the configuration parameter, please tell me.