Monitoring the system for job and system failures so automated alerts can be configured

126 views
Skip to first unread message

Shane Farrell

unread,
Aug 8, 2018, 8:29:52 AM8/8/18
to schedulix
Hi,

I need some help with setting up a monitoring system around schedulix. 

Very important batch jobs are triggered via schedulix and we need a way to automate failure alerts. Currently we have to manually check in on the system "Running Master Jobs" to find failed jobs. Instead, simply sending of an email upon failed job would be preferred. I'm just not sure of the best practices around this. We would need to catch job level failures as well as system/server level failures (i.e. Couldn't open logfile (2 / No such file or directory) . 

Any help will be much appreciated!

Thanks,
Shane

Ronald Jeninga

unread,
Aug 8, 2018, 9:05:07 AM8/8/18
to schedulix
Hi Shane,

notifications are usually implemented using triggers.
If some job runs into some failure state you can define a trigger to submit a job that can send e-mails, call the police or whatever.

In an exit state profile you can define some restartable exit state as the Error state.
The consequence will be that a job that can't open a log file or can't start for some other reason will acquire that Error state and this can be used to trigger something.

The E100_TRIGGER example shows the principle.

Does this help?

Best regards,

Ronald

Shane Farrell

unread,
Aug 9, 2018, 8:10:08 AM8/9/18
to schedulix
Hi Ronald,

Thank you for the direction. It does help get me started. I will try to use E100_TRIGGER example to help me create my alerts. If I run into any problems I will update.

Thanks again,
Shane

Shane Farrell

unread,
Aug 14, 2018, 10:50:58 AM8/14/18
to schedulix
Hi Ronald,

I have set up an alert that is triggered by our important jobs if they go into certain states. But there is a problem regarding the issue of "Couldn't open logfile" or other such errors that prevent a job from even starting. There is no exit state associated. As in the job exit state is "NONE". See screenshot:

OriginalProdError.PNG


How do we associate this ERROR state to a job exit state that can be listed in the Triggered By states list?

Thanks,
Shane

Ronald Jeninga

unread,
Aug 14, 2018, 1:29:49 PM8/14/18
to schedulix
Hi Shane,

if you look at the used exit state profile, you need a state marked as "broken". (From our point of view a definition is broken if the corresponding process can't be started).
If there isn't such a state, the job state ERROR won't induce an exit state (usually FAILURE or something alike).

See attached screen shots.
One is a job that can't open its log file in /bin (not really a surprise), and the other one the definition of the used exit state profile (where I made a mistake in naming the screen shot; ESP_STANDARD would have been perfect).

HTH

Regards,

Ronald


broken_state.png
ESD_STANDARD.png

Shane Farrell

unread,
Aug 16, 2018, 6:03:15 AM8/16/18
to schedulix
Hi Ronald,

I have now got an exit state I named 'ERROR' that is marked as "broken". Thanks for that tip.

Another issue I now face is that I can only use "FINAL" states as "trigger on" states for a "triggered by" job.

E.g. I have alert job EMAIL_ALERT that is triggered by JOB_X which has Exit State Profile Y. Only the states that are "FINAL" friom the Exit State Profile Y show as an option in "trigger on". See attached screenshot of Exit State Profile and the Triggered On selection screen.

We need states that are "RESTARTABLE" to show as an option in Triggered On. How do I do that?

ExitStateProfile.PNG

TriggerDetailsTriggerOnStateOptions.PNG

Ronald Jeninga

unread,
Aug 16, 2018, 6:27:56 AM8/16/18
to schedulix
Hi Shane,

first of all, the order of the exit states in an exit state profile plays an important role.
Operators are like company bosses. They seem not very interested if everything runs flawlessly, but they want to be alerted immediately if something goes wrong.
In fact, you're trying to automate that process.

Anyway, if a job runs into a restartable state like FAILURE or ERROR, you want to see that in the GUI.
But most jobs are child of some batch and are not visible in the master list.
The master batch is visible. And if somewhere deep down below an error occurs, you want that master batch to display that.
But the master batch only has a single state it can display. This means that somehow you must tell the master batch which state is most important.
And this is where the preference in the exit state profiles comes into play.
The topmost state in the list is regarded most important. The last entry will only be displayed if there's nothing else to tell.

This is why a profile normally has a few restartable states on top, followed by the final states.
Of course, it is possible that some error states are final, just because the job is broken beyond repair; a rerun wouldn't fix the problem.

Restartable states are displayed in red because "restartable" means that an operator intervention is required. And yes a rerun is one of the options.

Which exit state you can trigger on depends on the trigger type. Obviously a trigger type AFTER_FINAL and BEFORE_FINAL require a final state.
(AFTER_FINAL triggers after a job reached a FINAL state which can only happen if its exit state is a final state; BEFORE_FINAL triggers if a job would reach a final state, thus the exit state must be a final state as well).
IMMEDIATE_LOCAL and IMMEDIATE_MERGE can trigger on any state, just as the FINISH_CHILD.
Especially the FINISH_CHILD trigger is a convenient tool for notifications. As soon as a child somewhere below get the exit state you're interested in, the trigger will fire.

HTH

Regards,

Ronald

wim.ve...@billinghouse.nl

unread,
Sep 26, 2018, 8:08:20 AM9/26/18
to schedulix
Hi,

I have a similar issue, we want a mail when something fails.

In order to help the configurators of schedulix, the jobs triggered use a standardized set of exit codes, which I translate into readable exit states. This works very nice, as when a job fails, the type of error is immediately clear.
But when adding triggers this becomes a nuisance, because I have to put the complete list of failures in the trigger.

Noticing there was an exit state mapping in the run tab, I thought that might be of help (so I do not have to put the entire list in the trigger, but then I can not save the trigger as it complains).
What would be the best approach to simplify the trigger definition ?
Would a condition be of use here ?

Where do exit state translations come in handy ?

I created the following profiles/exit translations :

begin multicommand

create or alter exit state definition 'SUCCESS';
create or alter exit state definition 'UNEXPECTED_ERROR';
create or alter exit state definition 'MISSING_PARAMETER';
create or alter exit state definition 'INVALID_PARAMETER';
create or alter exit state definition 'PARSING_ERROR';
create or alter exit state definition 'UNKNOWN_PARAMETER';
create or alter exit state definition 'CALL_FAILED';
create or alter exit state definition 'ABORTED';
create or alter exit state definition 'NO_RESPONSE';
create or alter exit state definition 'BILLING_FAILED';

create or alter exit state mapping 'EXIT-MAPPING'
with map = (
   'UNEXPECTED_ERROR',
0, 'SUCCESS',
1, 'MISSING_PARAMETER',
2, 'INVALID_PARAMETER',
3, 'PARSING_ERROR',
4, 'UNKNOWN_PARAMETER',
5, 'CALL_FAILED',
6, 'ABORTED',
7, 'NO_RESPONSE',
8, 'BILLING_FAILED',
9, 'UNEXPECTED_ERROR'
);

create or alter exit state profile 'EXIT-PROFILE'
with
default mapping = 'EXIT-MAPPING',
states = (
'UNEXPECTED_ERROR' final batch default,
'MISSING_PARAMETER' final,
'INVALID_PARAMETER' final,
'PARSING_ERROR' final,
'UNKNOWN_PARAMETER' final,
'CALL_FAILED' final,
'ABORTED' final,
'NO_RESPONSE' final,
'BILLING_FAILED' final,
'SUCCESS' final dependency default
);

create or alter exit state translation 'EXIT-TRANSLATION'
with translation = (
'UNEXPECTED_ERROR' to 'FAILURE',
'MISSING_PARAMETER' to 'FAILURE',
'INVALID_PARAMETER' to 'FAILURE',
'PARSING_ERROR' to 'FAILURE',
'UNKNOWN_PARAMETER' to 'FAILURE',
'CALL_FAILED' to 'FAILURE',
'ABORTED' to 'FAILURE',
'NO_RESPONSE' to 'FAILURE',
'BILLING_FAILED' to 'FAILURE',
'SUCCESS' to 'SUCCESS'
);

end multicommand;

Ronald Jeninga

unread,
Sep 26, 2018, 8:35:56 AM9/26/18
to schedulix
Hi Wim,

as always, the best solution depends on the exact specification.
But in many cases a FINISH CHILD trigger is what you want for failure notifications.
This trigger can be attached to the top level batch or job, which means that creating it is hardly work.

At top level (e.g. in the list of running masters) you're not interested in the details, you only want to know that something went wrong.
So it is a good idea to translate the detailed exit states to a generic one (FAILURE/SUCCESS). And after having done this, you can tell the FINISH CHILD trigger to trigger on FAILURE.
The standard parameters in the trigger context will provide the information of the job that effectively caused the trigger to fire. (I think it's the TRIGGERREASONID, but it's a good idea to test this).

Hence the exit state translation both allows you to display status information in a compact way and simplifies the definition of the notification trigger.

I don't see much use for a condition here, but then again that might change with the exact specification.

For the rest, your definitions look quite good. I probably would want to define SUCCESS as a Batch Default though. In some cases you need empty batches and it would be annoying if the system starts reporting failures because of that.
(If you have complex dependency conditions, like "(A or B) and (C or D)", where A, B, C, D are jobs finishing with SUCCESS, you'll need a couple of empty batches to build this).

HTH

Regards,

Ronald

wim.ve...@billinghouse.nl

unread,
Sep 27, 2018, 5:19:31 AM9/27/18
to schedulix
The thing I am most struggling with here is how to convert the exit profiles into exit states. Ideally I would somehow call the exit-translation to transform the exit-profile value into a more top level exit state.
There must be an easy way to do this, but I haven't found it so far.

When I use the child finished in combination with exit state FAILURE, it doesn't trigger. The job monitor shows me the reason for failure was CALL_FAILED. 

Wim.


wim.ve...@billinghouse.nl

unread,
Sep 27, 2018, 7:38:02 AM9/27/18
to schedulix
Hi, 

I tried to make an example, so I can test stuff.

I think I do everything correclty, but obviously not, because I get an error when trying to create the example.

The Error I get is : ERROR:02112201828, Error in Statement 19 (AlterJobDefDependents) : Profile doesn't contain translated child state INVALID_PARAMETER of SYSTEM.TEST.FAIL_JOB Translation EXIT-TRANSLATION

What I intent to do is to translate the exit state of the FAIL_JOB into a more generic exit state using the exit translation on the child. 
What am I doing wrong here ?

This is the sdmsh script :

begin multicommand

/* Create exit states. profile and translation */

create or alter exit state definition 'SUCCESS';
create or alter exit state definition 'UNEXPECTED_ERROR';
create or alter exit state definition 'MISSING_PARAMETER';
create or alter exit state definition 'INVALID_PARAMETER';
create or alter exit state definition 'PARSING_ERROR';
create or alter exit state definition 'UNKNOWN_PARAMETER';
create or alter exit state definition 'CALL_FAILED';
create or alter exit state definition 'ABORTED';
create or alter exit state definition 'NO_RESPONSE';
create or alter exit state definition 'BILLING_FAILED';

/* stmt 11 */
create or alter exit state mapping 'EXIT-MAPPING'
with map = (
   'UNEXPECTED_ERROR',
0, 'SUCCESS',
1, 'MISSING_PARAMETER',
2, 'INVALID_PARAMETER',
3, 'PARSING_ERROR',
4, 'UNKNOWN_PARAMETER',
5, 'CALL_FAILED',
6, 'ABORTED',
7, 'NO_RESPONSE',
8, 'BILLING_FAILED',
9, 'UNEXPECTED_ERROR'
);

/* stmt 12 */
create or alter exit state profile 'EXIT-PROFILE'
with
default mapping = 'EXIT-MAPPING',
states = (
'UNEXPECTED_ERROR' final batch default,
'MISSING_PARAMETER' final,
'INVALID_PARAMETER' final,
'PARSING_ERROR' final,
'UNKNOWN_PARAMETER' final,
'CALL_FAILED' final,
'ABORTED' final,
'NO_RESPONSE' final,
'BILLING_FAILED' final,
'SUCCESS' final dependency default
);

/* stmt 13 */
create or alter exit state translation 'EXIT-TRANSLATION'
with translation = (
'UNEXPECTED_ERROR' to 'FAILURE',
'MISSING_PARAMETER' to 'FAILURE',
'INVALID_PARAMETER' to 'FAILURE',
'PARSING_ERROR' to 'FAILURE',
'UNKNOWN_PARAMETER' to 'FAILURE',
'CALL_FAILED' to 'FAILURE',
'ABORTED' to 'FAILURE',
'NO_RESPONSE' to 'FAILURE',
'BILLING_FAILED' to 'FAILURE',
'SUCCESS' to 'SUCCESS'
);

/* Create test folder */
/* stmt 14 */
create or alter folder SYSTEM.TEST;

/* Create child that will fail when executed */
/* stmt 15 */
create or alter job definition SYSTEM.TEST.FAIL_JOB
with
   profile = 'EXIT-PROFILE',
   environment='SCHEDULIX_LOCALHOST_1@SCHEDULIX-EU-WEST-1A-1',
   type = job,
   errlog = 'logs/${JOBID}.log' NOTRUNC,
   logfile = 'logs/${JOBID}.log' NOTRUNC,
   master,
   run program = './runjob.sh DELAY -duration 2 -units ms -exitcode "$EXITCODE"',
   group = public, 
   parameters = ( 'EXITCODE' );

/* Create sendmail job that will be triggered */
/* stmt 16 */
create or alter job definition SYSTEM.TEST.SENDMAIL
with
   profile = 'NOVA-EXIT-PROFILE',
   environment='SCHEDULIX_LOCALHOST_1@SCHEDULIX-EU-WEST-1A-1',
   type = job,
   errlog = 'logs/${JOBID}.log' NOTRUNC,
   logfile = 'logs/${JOBID}.log' NOTRUNC,
   master,
   run program = './sendmail.sh "$TRIGGERBASE" "$TRIGGERNEWSTATE"',
   group = public;

   

/* Create top level job */
/* stmt 17 */
create or alter job definition SYSTEM.TEST.TEST_SEND_MAIL
with 
   profile = 'EXIT-PROFILE',
   environment='SCHEDULIX_LOCALHOST_1@SCHEDULIX-EU-WEST-1A-1',
   type = job,
   errlog = 'logs/${JOBID}.log' NOTRUNC,
   logfile = 'logs/${JOBID}.log' NOTRUNC,
   master,
   run program = '0',
   parameters = ( 'EXITCODE' ),
   group = public;

/* Define parameter to be correct */
/* stmt 18 */
alter job definition SYSTEM.TEST.TEST_SEND_MAIL
alter 
parameter = (
'EXITCODE' PARAMETER default = 'NO_RESPONSE'
); 

/* Add the child */
/* stmt 19 */
alter job definition SYSTEM.TEST.TEST_SEND_MAIL
add or alter children = (
SYSTEM.TEST.FAIL_JOB
static
enable
CHILDSUSPEND
translation='EXIT-TRANSLATION'
ignore dependency = none
);

/* Add the trigger */
/* stmt 20 */
create or alter trigger 'ON_FAILURE' on job definition SYSTEM.TEST.TEST_SEND_MAIL
with 
active,
condition = none,
nomaster,
nowarn,
limit state = none,
nosuspend,
state = (
'FAILURE'
),
submitcount = 1,
submit SYSTEM.TEST.SENDMAIL,
type=FINISH CHILD;
end multicommand;

Ronald Jeninga

unread,
Sep 27, 2018, 7:47:43 AM9/27/18
to schedulix
Hi Wim,

the exit state translation to use is defined within the parent-child relationship.

Each job has some exit state profile. This exit state profile can have a default mapping, but it is very well possible that this special job has a special opinion about exit codes and their meaning.
This is why you can override the profile's default exit state mapping at job level.

Now a job can have multiple parents. And depending on the parent you might need one or another translation of exit states.
Hence the place to define this is within the parent-child relationship.

If you define your EST at the correct pace, the finish child trigger should work.

You already identified correctly that the logical names for exit states (SUCCESS, FAILURE, OUT_OF_INPUT, PARSE_ERROR, ROOKWORST) give a lot more information than the numerical values.
An exit state profile is a kind of contract: it's the set of possible exit states a job can have and information on how to treat them.
An exit state mapping is a function that maps values between -2^31 and 2^31 - 1 to some exit state.
An exit state translation is a function that maps exit states to exit states.

I hope this clears the confusion about the terminology somewhat.

Regards,

Ronald

Ronald Jeninga

unread,
Sep 27, 2018, 7:56:19 AM9/27/18
to schedulix
Hi Wim,

small question: what is the definition of the NOVA-EXIT-PROFILE exit state profile?

Regards,

Ronald

wim.ve...@billinghouse.nl

unread,
Sep 27, 2018, 8:21:04 AM9/27/18
to schedulix
That is a small error in the script. It should have been EXIT-PROFILE.

What wonders me is that according to you reply I basically have everything correct.

The error message is ERROR:02112201828, Error in Statement 19 (AlterJobDefDependents) : Profile doesn't contain translated child state INVALID_PARAMETER of SYSTEM.TEST.FAIL_JOB Translation EXIT-TRANSLATION

Stmt 19 is the adding of the child FAIL_JOB to TEST_SEND_MAIL job.
The FAIL_JOB has the EXIT_PROFILE states (this one includes the INVALID_PARAMETER child state).
The given EXIT-TRANSLATION maps 'INVALID_PARAMETER' to 'FAILURE',
Basically the EXIT-TRANSLATION only results in FAILURE or SUCCESS. All incoming states from EXIT-PROFILE are mapped to one of these.

So I do not understand what is wrong.



Ronald Jeninga

unread,
Sep 27, 2018, 8:25:03 AM9/27/18
to schedulix
Hi Wim,

yes, it looks good. And that's why I'm trying to find out what's wrong.

I'll tell you if I find something. Sometimes it's just a detail that's causing the hassle.

Regards,

Ronald

Ronald Jeninga

unread,
Sep 27, 2018, 8:41:21 AM9/27/18
to schedulix
Hi Wim,

if you know what's going wrong, it's all obvious.

The parent of SYSTEM.TEST.FAIL_JOB is SYSTEM.TEST.TEST_SEND_MAIL
You try to translate some special exit states to exit state FAILURE and SUCCESS, but those are not member of the used exit state profile (EXIT-PROFILE).

So yes, there is room for improvement regarding the error message, but the fact that the system refuses to create the objects it perfectly OK.

I added a profile called STANDARD_FF (that FF stands for FAILURE FINAL):

create or alter exit state profile 'STANDARD_FF'
with
		default mapping = 'UNIX',
		states = (
			'FAILURE' final,
			'SUCCESS' final batch default dependency default,
			'SKIPPED' final unreachable
		);

and changed the profile in the definition of the parent:

create or alter job definition SYSTEM.TEST.TEST_SEND_MAIL with profile = 'STANDARD_FF', environment='SERVER@LOCALHOST', type = job, errlog = 'logs/${JOBID}.log' NOTRUNC, logfile = 'logs/${JOBID}.log' NOTRUNC, master, run program = '0', parameters = ( 'EXITCODE' ), group = public;

And with these changes I was able to create the test case.

HTH

Regards,

Ronald



wim.ve...@billinghouse.nl

unread,
Sep 27, 2018, 8:42:44 AM9/27/18
to schedulix
Hi, 

I found the issue.

The problem seems to be that the exit translation from the child to the parent requires the parent to have a profile that matches the result of the exit translation.
So I had to add a new mapping (SIMPLE-EXIT-MAPPING) and a new profile (SIMPLE-PROFILE) which must be used by the parent.

Now when I run the job, I see the expected end states at the different levels and the mail is being sent.

The correct script is :
/* stmt 14 */
create or alter exit state mapping 'SIMPLE-EXIT-MAPPING'
with map = (
   'FAILURE',
0, 'SUCCESS',
1, 'FAILURE'
);

/* stmt 15 */
create or alter exit state profile 'SIMPLE-PROFILE'
with 
default mapping = 'SIMPLE-EXIT-MAPPING',
states = (
'FAILURE' final batch default,
'SUCCESS' final dependency default
);
/* Create test folder */
/* stmt 16 */
create or alter folder SYSTEM.TEST;

/* Create child that will fail when executed */
/* stmt 17 */
create or alter job definition SYSTEM.TEST.FAIL_JOB
with
   profile = 'EXIT-PROFILE',
   environment='SCHEDULIX_LOCALHOST_1@SCHEDULIX-EU-WEST-1A-1',
   type = job,
   errlog = 'logs/${JOBID}.log' NOTRUNC,
   logfile = 'logs/${JOBID}.log' NOTRUNC,
   master,
   run program = './runjob.sh DELAY -duration 2 -units ms -exitcode "$EXITCODE"',
   group = public, 
   parameters = ( 'EXITCODE' );

/* Create sendmail job that will be triggered */
/* stmt 18 */
create or alter job definition SYSTEM.TEST.SENDMAIL
with
   profile = 'EXIT-PROFILE',
   environment='SCHEDULIX_LOCALHOST_1@SCHEDULIX-EU-WEST-1A-1',
   type = job,
   errlog = 'logs/${JOBID}.log' NOTRUNC,
   logfile = 'logs/${JOBID}.log' NOTRUNC,
   master,
   run program = './sendmail.sh "$TRIGGERBASE" "$TRIGGERNEWSTATE"',
   group = public;

   

/* Create top level job */
/* stmt 19 */
create or alter job definition SYSTEM.TEST.TEST_SEND_MAIL
with 
   profile = 'SIMPLE-PROFILE',
   environment='SCHEDULIX_LOCALHOST_1@SCHEDULIX-EU-WEST-1A-1',
   type = job,
   errlog = 'logs/${JOBID}.log' NOTRUNC,
   logfile = 'logs/${JOBID}.log' NOTRUNC,
   master,
   run program = '0',
   parameters = ( 'EXITCODE' ),
   group = public;

/* Define parameter to be correct */
/* stmt 20 */
alter job definition SYSTEM.TEST.TEST_SEND_MAIL
alter 
parameter = (
'EXITCODE' PARAMETER default = 'NO_RESPONSE'
); 

/* Add the child */
/* stmt 21 */
alter job definition SYSTEM.TEST.TEST_SEND_MAIL
add or alter children = (
SYSTEM.TEST.FAIL_JOB
static
enable
CHILDSUSPEND
translation='EXIT-TRANSLATION'
ignore dependency = none
);

/* Add the trigger */
/* stmt 22 */

Ronald Jeninga

unread,
Sep 27, 2018, 9:10:13 AM9/27/18
to schedulix
Hi Wim,

so we both have found the cause, and you'll have to admit there is some logic behind it.

Glad it works now.

Best regards,

Ronald
Reply all
Reply to author
Forward
0 new messages