Assigning jobs to inactive server

51 views
Skip to first unread message

Artur Mironchik

unread,
Aug 12, 2022, 3:56:56 AM8/12/22
to schedulix
Hello,

We are using 4 jobservers, and the fourth is needed only on some hot hours, so to save money, it was decided to automatically turn it on and off according to a schedule

And when GCP schedule shuts down jobserver (or if it's done manually via script), on schedulix status of this jobserver(in Properties) is changing to Unconnected, but still Enabled and Registered

Schedulix still assigning jobs to this jobserver, even though it's down
These jobs don't execute and end up with an error

What can be done here? Is there a way to tell schedulix not to assign jobs to inactive server? Or should schedulix automatically reasign jobs to another server in this case?

Dieter Stubler

unread,
Aug 12, 2022, 5:07:07 AM8/12/22
to schedulix

Hello Artur,

This problem can be solved by deregistering the jobserver after shutdown with the sdmsh command:

deregister <jobserver name>

Example shutdown and deregister of a jobserver using sdmsh or web gui shell:

shutdown GLOBAL.EXAMPLES.LOCALHOST.SERVER;
deregister GLOBAL.EXAMPLES.LOCALHOST.SERVER;

When a jobserver is started and connects to schedulix it becomes registered automatically, telling the system he is ready to process jobs.

A shutdown of a jobserver or just being disconnected for a while like offline jobservers will still allow jobs to be assigned to it.
Think about jobs which can be processed by one jobserver only. Not allowing to still assign those jobs to the disconnect the jobserver will put all of those jobs in ERROR because of resource shortage.
This is not what we want if we want to just shutdown a jobserver temporarily for maintenance like changing environment variables.
Deregistering a jobserver will tell schedulix to not assign jobs to this jobserver any longer which may result in jobs in ERROR because of resource shortage if this is the only jobserver possible for those jobs.
Note that schedulix also puts all jobs still active on the deregistered jobserver into a BROKEN_FINISHED state to allow rerunning them on an other available jobserver after restart.

Does this answer your question?

Regards
Dieter

Artur Mironchik

unread,
Aug 12, 2022, 6:34:24 AM8/12/22
to schedulix

Hello, Dieter

Can this deregistering be automated?

Dieter Stubler

unread,
Aug 12, 2022, 7:14:48 AM8/12/22
to schedulix
Hello Artur,

I do not understand what you mean with 'automated'.
Since just stopping or shutting down a jobserver MUST not deregister automatically, you have do it explicitely using the web gui or the command api (sdmsh).
The script shutting down the jobserver can use a command like:

echo "shutdown <jobserver name>; deregister <jobserver name>;" | sdmsh ...

to shutdown the jobserver and deregister it.

Regards
Dieter

Artur Mironchik

unread,
Aug 12, 2022, 7:40:17 AM8/12/22
to schedulix
Got it, thank you

Artur Mironchik

unread,
Aug 16, 2022, 8:31:06 AM8/16/22
to schedulix
Hello,

another question is what would be the sdmsh command for checking that there are no assigned jobs to this particular jobserver before deregistering and shutting down a jobserver?

Dieter Stubler

unread,
Aug 16, 2022, 9:52:11 AM8/16/22
to schedulix
Hi Artur,

The following statements lists all active jobs for a jobserver:

list job with jobserver in (<jobserver name>) and job status in (STARTING, STARTED, RUNNING, TO_KILL, KILLED);

Jobs with state FINISHED (typically FAILED jobs) will not be listed because a rerun might be able to run them on another jobserver.
A typical shutdown for your server would look like this example sequence:

suspend GLOBAL.'EXAMPLES'.'LOCALHOST'.'SERVER';

Now the jobserver will not start any new jobs but is still able to handle the completion of jobs already active when suspend.

Repeat the statement :

list job with jobserver in (GLOBAL.EXAMPLES.LOCALHOST.SERVER) and job status in (STARTING, STARTED, RUNNING, TO_KILL, KILLED);

until it returns 0 rows (some seconds sleep time between excutions will help to avoid some load ;-)).

shutdown GLOBAL.EXAMPLES.LOCALHOST.SERVER;
deregister GLOBAL.EXAMPLES.LOCALHOST.SERVER;


When starting the jobserver again it will register automatically and starts to process jobs.
A resume is not required because deregister clears the suspend on the jobserver.

Hope this helps.

Regards
Dieter

Artur Mironchik

unread,
Aug 17, 2022, 4:43:53 AM8/17/22
to schedulix
Hello, Dieter

Is there a chain of commands to automatically wait after suspending for list job command to return 0 rows and then shutdown + deregister?
Something like
while( list job ... > 0 )
   {sleep}
shutdown
deregister

Also is this check even needed because documentation says about deregister command: "...Finally, a complete reschedule is executed so that jobs are redistributed among the jobservers"
Does this means that all jobs ( STARTING, STARTED, RUNNING, ... ) will be assigned to another jobserver with free resources and handled there?

Dieter Stubler

unread,
Aug 17, 2022, 5:02:20 AM8/17/22
to schedulix
Hi Artur,

On your first question:

No, there is no command in schedulix, waiting for something to happen before returning to the client issuing that command.
You have to script that as described in my previous answer.

Second question:
As already stated, deregister will put all jobs active on the deregistered jobserver into a BROKEN_FINISHED state.
After rerun the jobs will use other available jobservers if possible.
Note that this will not work if those jobs have allocated a resource on the deregisted jobserver in mode KEEP or KEEP_FINAL because then reruns will be bound to the same jobserver.
You can use the auto restart feature to restart those jobs automatically if the exit state profile has a state selected to be the broken state.

Regards
Dieter

Artur Mironchik

unread,
Aug 17, 2022, 8:43:46 AM8/17/22
to schedulix
Got it, 

First question:

You wrote:
"Repeat the statement :
list job ...
until it returns 0 rows ..."

but list job ... statement returns a big string with different information about jobs (including number of jobs)
is there a way to configure this command to return just the plain number of active jobs without additional text or should i extract this number from an output?

Second question:
If all jobs are configured with IMMEDIATE Restart ->  deregister will put all jobs active on the deregistered jobserver into a BROKEN_FINISHED state -> all jobs which were active on this jobserver will be immediately redistributed and restarted?

Dieter Stubler

unread,
Aug 17, 2022, 9:20:13 AM8/17/22
to schedulix
Hi Artur,

You wrote:
but list job ... statement returns a big string with different information about jobs (including number of jobs)
is there a way to configure this command to return just the plain number of active jobs without additional text or should i extract this number from an output?

No there is no command just returning the number of jobs found. You have to count the number of rows returned or look at the FEEDBACK of the list statement ('0 Object(s) found')

You wrote:
If all jobs are configured with IMMEDIATE Restart ->  deregister will put all jobs active on the deregistered jobserver into a BROKEN_FINISHED state -> all jobs which were active on this jobserver will be immediately redistributed and restarted?

Correct

Regards
Dieter 

Artur Mironchik

unread,
Aug 17, 2022, 9:49:52 AM8/17/22
to schedulix
Hello, Dieter

Is there an option to configure all existing jobs to IMMEDIATE Restart in one click or it has to be done manually for every single job?

Dieter Stubler

unread,
Aug 17, 2022, 11:23:58 AM8/17/22
to schedulix
Hi Artur,

There is no single click option to do that.

But you can do the following (linux):

echo "SELECT ID FROM SCI_C_SCHEDULING_ENTITY WHERE TYPE = 'JOB' WITH ID FOLDER QUOTED;" | 
sdmsh | 
grep '^SYSTEM' | 
awk '{ print "create trigger sys_restart on job definition " $1 " with rerun, immediate local, submitcount = 0;"; }' > all_jobs_set_rerun.sdms

This will output a script changing  all jobs to rerun immediately.
Review/Edit the script  all_jobs_set_rerun.sdms and run it with:

sdmsh  < all_jobs_set_rerun.sdms

This will do the job without a lot of clicking.
(if you didnt set up your .sdmsrc file you will have to provide login credentials parameters --host --port --uswer --password to sdmsh)

This method is very usefull in a lot of use cases if a lot off changes/opations have to be done without a lot of manual work.

Hope this helps.

Regards
Dieter
Reply all
Reply to author
Forward
0 new messages