schedulix server cannot start

103 views
Skip to first unread message

Zhiming Su

unread,
Sep 18, 2017, 8:14:19 AM9/18/17
to schedulix
Hi,
      Today, when I backed up the server, I ran into an error. When I backed up mysql, file, and some configuration files, started server-start error and looked at the attached log in detail,
I can't start the service. I've decided that the service process information in the MySQL can not be started, but I don't have the exact information, and then I'll restore the last MySQL file after server-start is OK
      Please help me look at the log. What's the problem?
Thank you very much
log.txt

Ronald Jeninga

unread,
Sep 18, 2017, 8:49:43 AM9/18/17
to schedulix
Hi Zhiming,

I think the key message is

FATAL   [Thread-1] 18 Sep 2017 10:56:14 GMT java.lang.OutOfMemoryError: Java heap space

So please change the memory configuration and grant some more memory :-)
You find this in 

$BICSUITECONFIG/etc/java.comf

(probably /opt/schedulix/etc/java.conf; depending on where you installed the system).
There you'll find an entry like

 BICSUITEMEM=300

which is the memory size in MB. Maybe 1024 is a good starting point.

Regards,

Ronald

Zhiming Su

unread,
Sep 18, 2017, 10:29:40 PM9/18/17
to schedulix
Hi Ronald,
 
Thank you very much for your answer,

After I changed my memory, server started successfully. It was probably too much data in the mysql, and Java didn't load enough data memory.

Besides, I have another question, I want to change the time zone. I see the time zone in the log is GMT(FATAL   [Thread-1] 18 Sep 2017 10:56:14 GMT java.lang.OutOfMemoryError: Java heap space)

Regards,

Suzm

Ronald Jeninga

unread,
Sep 19, 2017, 5:32:14 AM9/19/17
to schedulix
Hi Suzm,

there are two views on the system. One is the user's view in which time zones play an important role.
For this view you can set the relevant time zone in the GUI. Different users can have different configurations, maybe because they simply live in different time zones.
There are several answers here that cover that topic.

The second view is the administrator's view. The server's log file is, more or less, for his eyes only.
And for server messages I find it definitely important to have a linear time. No daylight saving issues.
This is why we use GMT to format the time stamp. This is a time format that can be unambiguously interpreted (and compared) everywhere on the world.

I might be wrong in this opinion. If you think this is the case, please try to convince me. (BTW. I don't live in a place with GMT. It's more CEST (Central Europe Daylight-saving Time)).

Regards,

Ronald

Zhiming Su

unread,
Sep 19, 2017, 5:51:03 AM9/19/17
to schedulix


Hi Ronald,
 
I'll tell you more about the time zone problem
 
Now I have a server machine, 2 sets of client
 server  : system date is  Tue Sep 19 17:39:45 CST 2017   
clinet : system date is Tue Sep 19 17:43:11 CST 2017

 GUI is Asia/Shanghai



What's wrong with it?


Regards,

Suzm



Ronald Jeninga

unread,
Sep 19, 2017, 7:17:54 AM9/19/17
to schedulix
Hi Suzm,

the big mistake is to let politicians agree on time zone names:

CST = Central Standard Time = UTC-6
CST = China Standard Time = UTC+8
CST = Cuba Standard Time = UTC-5

Obviously the engine mixes up CST with CST, making a difference of UTC-6 to UTC+8 = 14 hours.
(I can't really blame the engine for this, but we'll definitely have to repair the issue).

Our first step now is to investigate where the data is correct and where the error occurs.
I hope you don't mind helping us.

Let's investigate the jobserver first.
If you have a standard schedulix installation, you'll have it installed in /opt/schedulix.
Below this directory there's a directory called "taskfiles".(at least on your clients).

Now create a job with some harmless run program that gives you some time to hack. e.g. "sleep 60".
We'll do two things:
a. We look at the times reported by the jobexecutor
b. We look at the times reported by the jobserver to the server

To achieve the second (b), please first set the trace level to 2.

echo "alter server with trace level = 2;" | sdmsh

Now you submit that test job and switch to that taskfiles directory.
As soon as the job gets a state RUNNING there'll be a file in the taskfiles directory that corresponds to the job.

-bash-4.2$ ls -l
total 48
-rw-rw-r--. 1 schedulix schedulix 16384 Sep 19 13:00 localhost-GLOBAL.'EXAMPLES'.'LOCALHOST'.'SERVER'-660148
-rw-rw-r--. 1 schedulix schedulix  6318 Sep 19 13:00 starttimes.GLOBAL.'EXAMPLES'.'HOST_1'.'SERVER'
-rw-rw-r--. 1 schedulix schedulix  6318 Sep 19 13:00 starttimes.GLOBAL.'EXAMPLES'.'HOST_2'.'SERVER'
-rw-rw-r--. 1 schedulix schedulix  5148 Feb 14  2017 starttimes.GLOBAL.'EXAMPLES'.'IRGENDWAS'.'SERVER'
-rw-rw-r--. 1 schedulix schedulix  6300 Sep 19 13:00 starttimes.GLOBAL.'EXAMPLES'.'LOCALHOST'.'SERVER'

in my case it is the first file in the list.
To save the file, we make a hard link to it:

-bash-4.2$ ln localhost-GLOBAL.\'EXAMPLES\'.\'LOCALHOST\'.\'SERVER\'-660148 xxx

After the job finished, we'll have a copy of the taskfile in the file named xxx
We can look at the content:

-bash-4.2$ cat xxx
[19-09-2017 13:00:45 CEST] incomplete
[19-09-2017 13:00:45 CEST] id=660148
[19-09-2017 13:00:45 CEST] run=0
[19-09-2017 13:00:45 CEST] status=STARTED
[19-09-2017 13:00:45 CEST] command=SDMSpopup.sh
[19-09-2017 13:00:45 CEST] argument=SYSTEM.EXAMPLES.E0010_SINGLEJOB.SINGLEJOB
[19-09-2017 13:00:45 CEST] argument=-c
[19-09-2017 13:00:45 CEST] argument=?:1=FAILURE:0=SUCCESS
[19-09-2017 13:00:45 CEST] workdir=/opt/schedulix/tmp
[19-09-2017 13:00:45 CEST] usepath
[19-09-2017 13:00:45 CEST] verboselogs
[19-09-2017 13:00:45 CEST] logfile=660148.log
[19-09-2017 13:00:45 CEST] logfile_append
[19-09-2017 13:00:45 CEST] errlog=660148.log
[19-09-2017 13:00:45 CEST] errlog_append
[19-09-2017 13:00:45 CEST] samelogs
[19-09-2017 13:00:45 CEST] complete
[19-09-2017 13:00:45 CEST] status_tx=STARTED
[19-09-2017 11:00:45 GMT] execpid=15302@N0+1505818845
[19-09-2017 11:00:45 GMT] extpid=15303@N0+1505818845
[19-09-2017 11:00:45 GMT] status=RUNNING
[19-09-2017 13:00:46 CEST] status_tx=RUNNING
[19-09-2017 11:03:01 GMT] returncode=0
[19-09-2017 11:03:01 GMT] status=FINISHED
[19-09-2017 13:03:01 CEST] status_tx=FINISHED

You'll see two different messages. Those with the local timestamp (CST for you, CEST for me) which are written by the Jobserver (Java) process, and those with a GMT timestamp written by the jobexecutor process.
If those times look OK, this part of the system isn't responsible for the confusion. If not, we've found the culprit.

Assuming the above was OK, we now look at the server's log file.
You can simply open it and search for the JobId (660148 in my example).
You'll find messages like

MESSAGE [1037,1001(1001)]       19 Sep 2017 11:00:45 GMT alter job 660148 with status = started, run = 0, timestamp = '19-09-2017 13:00:45 CEST';
...
MESSAGE [1037,1001(1001)]       19 Sep 2017 11:00:46 GMT alter job 660148 with status = running, run = 0, exec_pid = '15302@N0+1505818845', ext_pid = '15303@N0+1505818845', timestamp = '19-09-2017 11:00:45 GMT';
...
MESSAGE [1037,1001(1001)]       19 Sep 2017 11:03:01 GMT alter job 660148 with status = finished, run = 0, exit_code = 0, timestamp = '19-09-2017 11:03:01 GMT';

The timestamps in these commands should match the timestamps of the taskfile.

And if everything is OK so far, we'll have to have a look at the server.
But let's do this first, else things get confusing :-)

Regards,

Ronald

Zhiming Su

unread,
Sep 19, 2017, 10:03:33 AM9/19/17
to schedulix
Hi Ronald,
 
I see log information is no problem, so what is the problem?


[19-09-2017 21:35:24 CST] incomplete
[19-09-2017 21:35:24 CST] id=1094547
[19-09-2017 21:35:24 CST] run=0
[19-09-2017 21:35:24 CST] status=STARTED
[19-09-2017 21:35:24 CST] command=sh
[19-09-2017 21:35:24 CST] argument=/tmp/test/sh
[19-09-2017 21:35:24 CST] workdir=/home/etl-user/work
[19-09-2017 21:35:24 CST] usepath
[19-09-2017 21:35:24 CST] logfile=1094547.log
[19-09-2017 21:35:24 CST] logfile_append
[19-09-2017 21:35:24 CST] errlog=1094547.log
[19-09-2017 21:35:24 CST] errlog_append
[19-09-2017 21:35:24 CST] samelogs
[19-09-2017 21:35:24 CST] complete
[19-09-2017 21:35:24 CST] status_tx=STARTED
[19-09-2017 13:35:24 GMT] execpid=5397@N0+1505828124
[19-09-2017 13:35:24 GMT] extpid=5398@N0+1505828124
[19-09-2017 13:35:24 GMT] status=RUNNING
[19-09-2017 21:35:25 CST] status_tx=RUNNING



MESSAGE [2,SchedulingThread(Worker0)] 19 Sep 2017 13:31:16 GMT ---------------> Start Synchronize Resource Scheduling <-------------------
StartTime = 0
MESSAGE [2,SchedulingThread(Worker0)] 19 Sep 2017 13:31:16 GMT ---------------> Start System Resource Scheduling <-------------------
/1094547
...skipping
MESSAGE [923657,1002(Worker0)] 19 Sep 2017 13:34:45 GMT Server Execution time for class de.independit.scheduler.server.parser.GetNextJob : 0 ms -- Start Committing
MESSAGE [923657,1002(Worker0)] 19 Sep 2017 13:34:45 GMT Execution time for class de.independit.scheduler.server.parser.GetNextJob : 3 ms
MESSAGE [923657,1002(1002)] 19 Sep 2017 13:34:45 GMT alter job 1094547 with status = started, run = 0, timestamp = '19-09-2017 21:35:24 CST';
MESSAGE [923657,1002(Worker0)] 19 Sep 2017 13:34:45 GMT Server Execution time for class de.independit.scheduler.server.parser.AlterJob : 0 ms -- Start Committing
MESSAGE [923657,1002(Worker0)] 19 Sep 2017 13:34:45 GMT Execution time for class de.independit.scheduler.server.parser.AlterJob : 3 ms
MESSAGE [923657,1002(1002)] 19 Sep 2017 13:34:45 GMT get next job;


Regards,

Suzm

Zhiming Su

unread,
Sep 19, 2017, 10:14:17 AM9/19/17
to schedulix
Hi Regards,

I find batch no problem, but the details of job below batch are problematic


Ronald Jeninga

unread,
Sep 19, 2017, 10:38:02 AM9/19/17
to schedulix
Hi Suzm,

well, your experiment showed that it is the server itself that does the wrong conversion (or maybe the GUI, but if we'll find that out).
I'll have to do some digging first, but at least I know that the server gets the correct data as an input (you proved that).

I'll get back on you as soon as I know more. Thank you for your help so far!

Regards,

Ronald

PS: I hope you liked the little excursion through the internals :-)

Zhiming Su

unread,
Sep 19, 2017, 10:54:59 AM9/19/17
to schedulix
 Hi Ronald,

Thanks for your help. Let me know more

I'm looking forward to your answer :-)

Regards,

Suzm

Ronald Jeninga

unread,
Sep 19, 2017, 6:55:59 PM9/19/17
to schedulix
Hi Suzm,

I did some digging :-)

The problem is indeed the non uniqueness of the time zone abbreviations.
Fortunately there's a simple patch to correct this:

diff --git a/src/server/SystemEnvironment.java b/src/server/SystemEnvironment.java
index 6a38034..df54ec0 100644
--- a/src/server/SystemEnvironment.java
+++ b/src/server/SystemEnvironment.java
@@ -255,7 +255,7 @@ public class SystemEnvironment implements Cloneable
        public static final String noneString = "<none>";
        public static final TimeZone systemTimeZone = TimeZone.getTimeZone ("GMT");
        public static final Locale systemLocale = new Locale("EN", "GB");
-       public static final SimpleDateFormat staticJSCommDateFormat = new SimpleDateFormat ("dd-MM-yyyy HH:mm:ss z", systemLocale);
+       public static final SimpleDateFormat staticJSCommDateFormat = new SimpleDateFormat ("dd-MM-yyyy HH:mm:ss Z", systemLocale);
        public static final SimpleDateFormat staticSystemDateFormat = new SimpleDateFormat ("dd MMM yyyy HH:mm:ss z", systemLocale);
        public SimpleDateFormat systemDateFormat = (SimpleDateFormat) staticSystemDateFormat.clone();
        public SimpleDateFormat jsCommDateFormat = (SimpleDateFormat) staticJSCommDateFormat.clone();

Or, to explain what's happening, the change of the lower case 'z' to the upper case 'Z' changes the abbreviation of the time zone (which is nice to read) to the UTC offset.
JSCommDateFormat is the date format used for the communication with the jobserver.

So if you make this change and then restart all components (don't forget to copy the jar file to the client nodes), it should work.
For me it did at least.

If this works for you without problems, please tell me and I'll add the change to the repository.

Regards,

Ronald

Zhiming Su

unread,
Sep 19, 2017, 10:41:52 PM9/19/17
to schedulix
Hi Regards,

When I changed the SystemEnvironment.java of server and clinet: public static final SimpleDateFormat staticJSCommDateFormat = new SimpleDateFormat ("dd-MM-yyyy HH:mm:ss Z", systemLocale);

It looks like it's OK, but it's still wrong when the job is running




Are there any other places to change?


for example:

-       public static final SimpleDateFormat staticJSCommDateFormat = new SimpleDateFormat ("dd-MM-yyyy HH:mm:ss z", systemLocale);
+       public static final SimpleDateFormat staticJSCommDateFormat = new SimpleDateFormat ("dd-MM-yyyy HH:mm:ss Z", systemLocale);
        public static final SimpleDateFormat staticSystemDateFormat = new SimpleDateFormat ("dd MMM yyyy HH:mm:ss z", systemLocale);


Regards,

Suzm

Ronald Jeninga

unread,
Sep 20, 2017, 12:16:30 AM9/20/17
to schedulix
Hi Suzm,

hm, I changed the time zone on my machine and was able to reproduce the flaw.
After my change (and of course a recompile) and restarting all components with  the new jar file, the behaviour was correct.

Your proposed change isn't very wrong, but in fact I kept it the way it is to allow for "lazy" upgrades.
If the server can't interpret a jobserver message (timestamp) with the staticJSCommDateFormat, it'll try again with the staticSystemDateFormat.
This way your system would still work even if you didn't deploy the BICsuite.jar file and restart all jobservers yet.

It is also worth to point out that your "old" jobs will have the old timestamp stored. The change will only affect the newly submitted jobs.

I just submitted another "singlejob" and made a screen shot. As you can see, the time is displayed correctly


Maybe you simply forgot to compile after your change?

Regards,

Ronald

Zhiming Su

unread,
Sep 20, 2017, 1:03:35 AM9/20/17
to schedulix
Hi Ronald,

It is after I changed the java file without re make,When I re compile and generate new  BICsuite.jar file
Look no problem,But I found a new problem, the running time is negative, it is JObserver server with the time inconsistency caused? Why not JObserver time to calculate?


Regards,

Suzm

Ronald Jeninga

unread,
Sep 20, 2017, 1:55:37 AM9/20/17
to schedulix
Hi Suzm,

in a previous post you wrote:

server  : system date is  Tue Sep 19 17:39:45 CST 2017   
clinet : system date is Tue Sep 19 17:43:11 CST 2017

which means that the clocks aren't sync'ed.
The displayed start time is coming from the jobserver and it is compared with the current time on the server's machine.
Now since the client is some time ahead the result can be negative.

After the job has finished both the start and finish time will be times reported by the jobserver and the running time will show a correct value.

The better idea is to use some NTP server to keep your computer clocks synchronised.
You can either use some internal system as a reference (but not necessarily correct), or you can use some publicly available time server.
At least it would avoid "problems" like these.

Regards,

Ronald

Zhiming Su

unread,
Sep 20, 2017, 2:04:02 AM9/20/17
to schedulix
Hi Ronald,

Thanks for your help :-)


Regards,

Suzm
Reply all
Reply to author
Forward
0 new messages