Jobserver monitoring

Skip to first unread message

Nov 22, 2022, 3:25:14 AM11/22/22
to schedulix
Hello, I am trying to send notification when jobserver goes down.
At this moment I have script scheduled via cron that checks the schedulix jobserver process on the server where client is installed and if the process is not there I am sending a notification. I find this way of monitoring "not beautiful" and in case the process run and the jobserver is not able to connect I am not getting the notification. Do you know about nicer way to monitor the jobserver status?

Ronald Jeninga

Nov 22, 2022, 10:46:50 AM11/22/22
to schedulix
Hi Patrik,

I agree, it doesn't sound "beautiful". The problem is not to monitor a healthy system which could be a single script on the scheduling server that looks at the connections and uses the Jobserver's http thread to check if it is alive.
That will work most of the time, until problems arise.

There is this problem of having two perspectives: the perspective of the scheduling server and the perspective of the Jobserver.
If only one of them is checked, you only have half the information (e.g. Jobserver running but not connected).
And then there is the perspective of the system administrator, who might have shut down the Jobserver deliberately.

Hence you need a system that checks if the scheduling server is up and running, checks if the Jobserver is up, running and connected, and checks if the system administrator isn't busy maintaining the system.

The scheduling server can be checked by connecting to it. And while being connection, the server can be asked for its perception of the state of a certain (set of) Jobservers.
If you get an answer, you know the scheduling server is healthy. And if the scheduling server thinks that the Jobserver is connected, you can be pretty sure that there is a valid connection between the two.
If it tells you, the Jobserver is not connected, it most certainly isn't.
This all can be either checked from the machine running the scheduling server, or from the machine running the Jobserver.

If the Jobserver is connected, it must be alive and you don't have an issue. You'll have to check this twice with a few seconds between the two observations though.
It's hard to see something move on a single picture. If the Jobserver is still connected _and_ still has the same session id, you are on the safe side.

If the Jobserver is not connected, it might be down, in a restart loop (with some fatal error each time just after the process started), or shut down by the administrator.
The restart loop is mostly caused by some mistake in the configuration (if you don't believe me, try to define the jobexecutor with a relative path and see what happens).
This can happen while installing a Jobserver but also while installing or re-configuring another Jobserver by making a mistake higher up in the hierarchy.
And that's why you have to check the connection twice if you find one.

If it's just down, you might want to automatically restart it.
But that means you'll have to be able to distinguish between "just down" and "deliberately down".
If you extend the start and stop scripts, you can write a file if the Jobserver is shut down, and delete it if it is started again.
An administrator will use the stop script to halt the server and a file is created. As soon as the administrator finished his work, he'll start the Jobserver again and the file is removed.
Hence, if the Jobserver is down and there is no flag file, you can restart the Jobserver, else you do nothing.

If the scheduling server isn't running, you can't blame the Jobservers for not being connected.
And again this situation can be part of an administrator's action (upgrade, ...), or it can be a mistake.
Detecting the scheduling server's process isn't enough. It could be hanging, it could be working hard, it could be starting up.
I've seen it all: the hanging server was a Java issue, working hard is most likely done by the scheduling thread but the effect is that the server seems unresponsive and starting up can take several minutes.
In the first case, a restart is the only escape. In the second case, a manual decision is required. And in the last case, one can only wait.

The bottom line is that you'll need scripts on both ends.
The scripts should be administrator friendly and detect if the server has been shut down deliberately or not.
If the scheduling server is down or unresponsive, don't act on the Jobserver's side. They can't change it.
If the scheduling server is starting up, wait for it is the best; from the Jobserver's side it'll look as if the server is down though.

I guess the short answer to your question is: no, there's no slim and elegant way of monitoring the components.
(But I hope my considerations above will help you to build some waterproof scripts).

Best regards,

Reply all
Reply to author
0 new messages