Schedulix 2.9 job server has been stopped without any clue.

54 views
Skip to first unread message

Vorrarit Luengwattanakij

unread,
Oct 27, 2023, 2:28:26 AM10/27/23
to schedulix
Hi,

I'm running schedulix server 2.9 with postgresql and also has another running multiple jobservers.   Every couple of days jobservers were stopped without any information.  I'd added TraceLevel 3 in both jobserver.conf and web ui but no trace or debug information shown in the log files.

Screenshot 2566-10-27 at 13.18.01.png

Screenshot 2566-10-27 at 13.20.00.png

I see that jobserver.out files were generated multiple times in a few minute.  It seems like the jobserver is trying to restart it self or something.

Screenshot 2566-10-27 at 13.25.37.png

Anybody found the same issues like this? Or how can I get more detail log to troubleshoot the problem?


Best Regards,

Vorrarit

Ronald Jeninga

unread,
Oct 27, 2023, 4:43:51 AM10/27/23
to schedulix
Hi Vorrant,

thank you for asking. And I'm sure we'll find the cause of the problem.
It might take a while to investigate this though. I didn't see this pattern yet, I think.

Anyway, what I understand is that your jobservers run for a while without issues and then stop for no apparent reason.
Furthermore you see that a several jobserver.out files are created within a few minutes time.
Did I rephrase your issue correctly?

You are right, this is not a normal behaviour and we'll have to find out what is going on.
If you trust me, you can send me information by e-mail to ronald dot jeninga at independit dot de.
At this stage I'm interested in the scheduling server's log file (BICserver.out.somenumber) as well as a jobserver log file (any of them, as long as the time it is written is covered by the scheduling server's log).
I might need some more information, but I'll find that out after analyzing the log files.
Feel free to obfuscate sensitive information. I'll tell you if you've hidden too much.

And if I my ask, in what area of the world are you located? I'm just curious and you don't need to answer my question.

Best regards,

Ronald

Vorrarit Luengwattanakij

unread,
Oct 30, 2023, 2:57:20 AM10/30/23
to schedulix
Hi Ronald,

Thank you for your kindly response.  I've attached here the log BICserver log file during period that my jobservers were terminated (approximately 09:55 - 10:00 GMT).

I now start only 1 job servers and stop the others, so that I can send you a more readable log.  I'll update you soon, once the job server is terminated.
One thing I might not mentioned clearly in the first email is that,  the multiple jobservers, I mentioned earlier, were running on the same VM.

By the way, I'm from Thailand.

Best Regards,

BICserver.out.6.zip

Ronald Jeninga

unread,
Oct 30, 2023, 4:53:49 AM10/30/23
to schedulix
Hi Vorrarit,

thank you for the log file!
It is indeed very weird. I've looked at the log messages of session 2102:

ronald@oncilla:~/Downloads$ grep ,2102 BICserver.out.6
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:13:07 GMT Server Execution time for class de.independit.scheduler.server.parser.Connect : 0 ms -- Start Committing
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:13:07 GMT Execution time for class de.independit.scheduler.server.parser.Connect : 0 ms
MESSAGE [7092,2102(2102)] 29 Oct 2023 09:13:07 GMT register with pid = '12023';
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:13:07 GMT Server Execution time for class de.independit.scheduler.server.parser.RegisterServer : 1 ms -- Start Committing
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:13:07 GMT Execution time for class de.independit.scheduler.server.parser.RegisterServer : 1 ms
MESSAGE [7092,2102(2102)] 29 Oct 2023 09:13:07 GMT get next job;
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:13:07 GMT Server Execution time for class de.independit.scheduler.server.parser.GetNextJob : 0 ms -- Start Committing
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:13:07 GMT Execution time for class de.independit.scheduler.server.parser.GetNextJob : 0 ms
...
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:14:32 GMT Server Execution time for class de.independit.scheduler.server.parser.GetNextJob : 0 ms -- Start Committing
MESSAGE [7092,2102(Worker0)] 29 Oct 2023 09:14:32 GMT Execution time for class de.independit.scheduler.server.parser.GetNextJob : 1 ms
MESSAGE [7092,2102(2102)] 29 Oct 2023 09:14:37 GMT UserConnection terminated

It runs for about 1.5 minutes and gets some work done (that is, it didn't execute jobs, it just asked the server for jobs every 5 seconds) and then, when the next "Get Next Job" is expected, it silently terminates the connection.
I definitely need to see this from the perspective of the jobserver itself. But any other than trace level 3 won't output enough information.
Please set the trace level of the jobserver to 3. You can do that on the Jobserver's config tab in the GUI. It will pickup the new value automatically.
If you also set the trace level to 3 in the Jobserver's config file, we'll also see any messages from the startup phase.
(If you don't set the trace level to 3 in the GUI, the jobserver will reset its trace level from the config file to the trace level as set in the GUI. Hence at least within the GUI, ideally both in the GUI and the config file).

It is possible that the problem comes from the outside. E.g. if your computer is short on memory, the Linux OOM Killer could decide to kill that "huge" Java process.

Best regards,

Ronald

Vorrarit Luengwattanakij

unread,
Oct 31, 2023, 4:22:16 AM10/31/23
to schedulix
Hi Ronald,

I've attached BICserver.out and jobserver.out with TraceLevel=3.  From jobserver.out, you can see that jobserver was terminated around 13:44 GMT+7.
Date in BICserver.out is GMT.  So it must be approximately 06:44 GMT.

The VM that runs jobserver has 2 vcpu and 8GB memory.  Since, we are in development cycle, so the data size is quite small, I don't think we have OOM issue.

schedulix-20231031.tar.gz

Ronald Jeninga

unread,
Oct 31, 2023, 11:34:08 AM10/31/23
to schedulix
Hi Vorrarit,

well that's really mystifying....
The jobserver starts anew about every 92 seconds and writes log files that are _all_ 31,553 bytes long.

This sounds a little as if the jobserver is installed as a service that needs to be restarted automatically and at the same time it has a limit set on the maximum file size.
When scrolllog attempts to write the next chunk of log output it gets an (unexpected) error because of file size exceeded.
The scrolllog program is pretty robust, but at the same time it is also fail-fast. If something happens it doesn't expect, it terminates.

Could you please verify if limits are set (ulimit -a)?
I'd expect a maximum file size of at most 32k, or 8, 16 or 32 blocks (maybe even 64 blocks of 512 bytes), if my hypothesis applies.
Of course you must do this as the user that runs the jobserver, maybe even within the jobserver start script (and save the output somewhere (ulimit -a > /tmp/js_ulimit.$$.log)).

If my hypothesis proves wrong, we'll have to try and find the real jobserver-murderer that haunts your place.
A real Halloween challenge :-P

Best regards,

Ronald

Vorrarit Luengwattanakij

unread,
Oct 31, 2023, 11:23:11 PM10/31/23
to schedulix
Hi Ronald,

This is the result of ulimit -a for the user that run jobserver.

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31200
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I've systemd unit file that start jobserver and found some unusual behavior after checking with journalctl -xe

Nov 01 06:19:26 batch-server systemd[1]: jobs...@uatddeadm.service stop-sigterm timed out. Killing.
Nov 01 06:19:26 batch-server systemd[1]: Unit jobs...@uatddeadm.service entered failed state.
Nov 01 06:19:26 batch-server systemd[1]: jobs...@uatddeadm.service failed.
Nov 01 06:19:27 batch-server systemd[1]: jobs...@uatddeadm.service holdoff time over, scheduling restart.
Nov 01 06:19:27 batch-server systemd[1]: Stopped JobServer service uatddeadm.
-- Subject: Unit jobs...@uatddeadm.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit jobs...@uatddeadm.service has finished shutting down.
Nov 01 06:19:27 batch-server systemd[1]: Started JobServer service uatddeadm.
-- Subject: Unit jobs...@uatddeadm.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit jobs...@uatddeadm.service has finished starting up.
--
-- The start-up result is done.
Nov 01 06:19:28 batch-server systemd[1]: jobs...@uatddeadm.service holdoff time over, scheduling restart.
Nov 01 06:19:28 batch-server systemd[1]: Stopped JobServer service uatddeadm.
-- Subject: Unit jobs...@uatddeadm.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit jobs...@uatddeadm.service has finished shutting down.
Nov 01 06:19:28 batch-server systemd[1]: Started JobServer service uatddeadm.
-- Subject: Unit jobs...@uatddeadm.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit jobs...@uatddeadm.service has finished starting up.
--
-- The start-up result is done.
-- Subject: Unit session-1644.scope has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit session-1644.scope has finished starting up.
--
-- The start-up result is done.
Nov 01 06:20:58 batch-server systemd[1]: jobs...@uatddeadm.service stop-sigterm timed out. Killing.
Nov 01 06:20:58 batch-server systemd[1]: Unit jobs...@uatddeadm.service entered failed state.
Nov 01 06:20:58 batch-server systemd[1]: jobs...@uatddeadm.service failed.
Nov 01 06:21:00 batch-server systemd[1]: jobs...@uatddeadm.service holdoff time over, scheduling restart.


It seems like systemd tried to stop jobserver.  Below is my systemd unit file.

[Unit]
Description=JobServer service %i
After=network.target
StartLimitIntervalSec=0

[Service]
Environment=BICSUITEHOME=/opt/schedulix/schedulix-2.9
Type=simple
Restart=always
RestartSec=1
User=%i
ExecStart=/usr/bin/bash /opt/schedulix/schedulix-2.9/bin/jobserver-run /app/config/%i/jobserver.conf "/app/logs/%i/jobserver-%i.out"

[Install]
WantedBy=multi-user.target



Best Regards,

Vorrarit L.

Ronald Jeninga

unread,
Nov 1, 2023, 2:22:25 AM11/1/23
to schedulix
Hi Vorrarit,

Dieter found the following post:


which shows that systemd is very well capable of killing services instead of keeping them running.
My suggestion, based on the above post, is to change a few things in the systemd configuration:

Type=idle          # instead of Type=simple
Restart=on-failure # instead of Restart=always

And maybe to relax the timing (StartLimitIntervalSec and RestartSec).

Note that the jobserver-run script is a simple wrapper thatstarts the scrolllog process.
scrolllog does a fork() twice to become a daemon and then spawns the java process that runs the Jobserver code.
This might not be detected correctly by systemd, hence its behavior.
I'm not a specialist on systemd though and I can be wrong.

In any case it will make sense to experiment with the systemd unit file a little.
And if you find a specification that works flawlessly, feel free to post it here (and the same is true for a unit description of the schedulix server).
Others might benefit from your efforts. And at the same time it will give me a starting point to add the unit files as a kind of suggestion to the schedulix source tree or even integrate them into the rpm packages. 

Best regards,

Ronald

Vorrarit Luengwattanakij

unread,
Nov 1, 2023, 2:49:49 AM11/1/23
to schedulix
Thank you very much.  I now set Type and Restart option as recommend and will update here after getting further information.

Best Regards,

Vorrarit L.

Vorrarit Luengwattanakij

unread,
Nov 1, 2023, 10:33:49 PM11/1/23
to schedulix
Hi Ronald,

It seems the result is even worse.  Jobserver seems to be terminated after a few hour.

So now I tried to run the jobserver with nohup.

How do you normally configure jobserver to start automatically (as user) after VM start.  Right now, I'm thinking of cornjob, but you might have a cleaner way.


Best Regards,

Ronald Jeninga

unread,
Nov 2, 2023, 6:33:19 AM11/2/23
to schedulix
Hi Vorrarit,

actually you don't need to use nohup.
If you run the jobserver-run script with 2 parameters, the config file and a basename of a logfile, the jobserver will be started by the utility scrolllog (which is part of our system).
The scrolllog utility has multiple tasks:
1. daemonize -- this is because Java, as portable programming language, can't do it itself
2. Write logging output into files of limited size. Delete the oldest file if a configurable number of files has been written
    (I've seen systems that stalled because of a file system full condition that was in fact caused by logging; that can't happen with the schedulix servers).
3. If the process started by scrolllog exits with an exit code unequal zero, start it again; a kind of watchdog functionality.

Being old-school (I'm working with Linux more than 25 years now, and before I started with Linux I already had several years of experience with other Unix systems), I'm still a fan of init.d logic.
That's easy to understand, pretty portable and it does what it should do. The only thing it doesn't do is to restart services after a failure.
But services shouldn't fail, unless something severe is the case wich would have to be resolved by an admin anyway. Restarting a service after resolving the issue isn't that much work; it's a one-liner.
Of course, this is my very personal opinion and there will be many others that see it differently.

Anyway, in the bin directory there are a few scripts that can be used as a starting point for init.d scripts.
Those scripts are written for users that install the system via rpm and have used the setup_jobserver script to create jobservers.
You can have a look at those scripts to see what they do. The scripts aren't bullet-proof though and certainly not compatible with every thinkable setup.
But as a basic guideline or idea they are certainly useful.

In your case, I would start with starting the jobservers by hand.
If that works (and I don't doubt that the least), you could write an init.d script and use that to start and stop the jobservers manually.
As a next step you can tell your systemd that there are a few init.d script you'd like to have executed at system startup. (e.g. systemctl enable schedulix-client).
systemd will scan the init.d directory, at least it used to do so, and find the script. It knows how dinosaurs used to run Unix systems and will mimic that.

If all of that works, you can try to find a more native (or modern) way to let systemd start, guard and stop your jobservers.
And though I'm very old-fashioned and very much a fan of simple solutions for simple problems, I'm also very interested in keeping schedulix modern and up-to-date.
Hence if you find a unit-definition that does exactly the right thing, I'm highly interested in the code/configuration you've found.

Best regards,

Ronald
Reply all
Reply to author
Forward
0 new messages