Schedulix crashing on "Error executing command (get next job;): null (java.io.EOFException)"

228 views
Skip to first unread message

Anthony Moyles

unread,
Aug 25, 2017, 5:21:58 AM8/25/17
to schedulix
Hi guys,

I'm new to Schedulix and am supporting it in our Dev environment. It's been working great for a few months, very stable, but 2 days ago we've run into an issue.

For some reason, under Monitor Batches and Jobs when a user clicks on the magnifying glass (Start Find) Schedulix crashes.
We get the following error on screen:

OMMAND
list job with mode = TREE , ( HISTORY = 1 DAYS), PARAMETERS = ('LOG_DISPLAY_LIMIT'), EXPAND = NONE
ERRORCODEZSI-10001
ERRORMESSAGE
ConnectError(ConnectError)

and tailing the /home/schedulix/log/jobserver.out log I see the following:

startTimeJob = 1503650737 startTimePid = 1503650733
ERROR   [Jobserver]     25-08-2017 04:45:43 EDT (04504092058) Error executing command (get next job;): null (java.io.EOFException)
ERROR   [Jobserver]     25-08-2017 04:45:43 EDT (04301271455) localhost:2506: Connection refused (Connection refused) (java.net.ConnectException)

Tailing the postgres database log at the same time I see the following error come in:

STATEMENT:  INSERT INTO REPOSITORY_LOCK ( LOCKID, TS, TICKET ) VALUES ( 1 , $1, $2)
ERROR:  duplicate key value violates unique constraint "pk_repository_lock"
DETAIL:  Key (lockid)=(1) already exists.

The Connection refused errors continue for 5-15 minutes before it says its connected again e.g.:

INFO    [Jobserver]     25-08-2017 05:11:42 EDT (04305151959) Connected!

But as soon as the user tries the above the whole process repeats, I'm stuck.

Any help appreciated,
Anthony

Anthony Moyles

unread,
Aug 25, 2017, 5:43:34 AM8/25/17
to schedulix
I notice that in the BICServer.out its looping, failing to start with an Out Of Memory Exception which I assume is related:

INFO    [Thread-1]      25 Aug 2017 09:40:16 GMT Read 25, Loaded 25 rows for RESSOURCE
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT java.lang.OutOfMemoryError: Java heap space
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT java.nio.CharBuffer.allocate(CharBuffer.java:335)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT sun.nio.cs.StreamEncoder.flushLeftoverChar(StreamEncoder.java:233)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:301)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT org.postgresql.core.PGStream.close(PGStream.java:531)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT org.postgresql.core.v3.ProtocolConnectionImpl.close(ProtocolConnectionImpl.java:138)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT org.postgresql.jdbc2.AbstractJdbc2Connection.close(AbstractJdbc2Connection.java:537)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT de.independit.scheduler.server.repository.TableLoader.SDMSrun(SDMSRepository.java:393)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT de.independit.scheduler.server.util.SDMSThread.run(SDMSThread.java:201)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT ****************** Start Stacktrace *********************
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT de.independit.scheduler.server.util.SDMSThread.doTrace(SDMSThread.java:147)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT de.independit.scheduler.server.util.SDMSThread.run(SDMSThread.java:220)
FATAL   [Thread-0]      25 Aug 2017 09:40:35 GMT ****************** End Stacktrace   *********************
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT ***********************************************
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT **                                           **
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT ** U N G R A C E F U L   S H U T D O W N ! ! **
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT **                                           **
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT ***********************************************
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT Release repository lock from 1503654014667
INFO    [Shutdown]      25 Aug 2017 09:40:35 GMT Shutdown completed

Is it possible there's some operation trying to be run now that exceeds the JVM arguments available to Schedulix somewhere and I need to increase the heapspace arguments?
The box itself has plenty free:

             total       used       free     shared    buffers     cached
Mem:          7986       1174       6812         37        197        379
-/+ buffers/cache:        597       7389
Swap:            0          0          0

Ronald Jeninga

unread,
Aug 25, 2017, 6:27:01 AM8/25/17
to schedulix
Hi Anthony,

first of all it is good to hear you ran the system for several months without any problems.
(Not because this is something special, but more because the system should be a help to the users, which it was apparently).

While reading your first mail my first thought was that it is probably a memory issue, which was confirmed by your second mail.

The remedy is simple. If you look at the $BICSUITECONFIG/java.conf file, you'll see a parameter defining the maximum amount of memory the server is allowed to allocate.
It looks like:

...
#
# Additional Java Flags to use
#
# java flags used for BICsuite Server
#
BICSUITEMEM=1024
#
MAXMEM=${BICSUITEMEM}m
STARTMEM=`expr $BICSUITEMEM / 2`m
NEWSIZE=`expr $BICSUITEMEM / 16`m
MEMSIZES="-Xmx${MAXMEM} -Xms${STARTMEM} -XX:NewSize=${NEWSIZE}"
...

In my case (which is on my development machine), I have configured 1024M as the maximum amount of memory.
It is the BICSUITEMEM parameter you change. (You are allowed to change the Java memory configuration that follows that parameter, but please leave it alone if you are not a Guru regarding Java memory management).
The default is set to 300. This is a nice value for evaluating, but not a sane value for any serious use. At least you were able to run the system for a few months without issues, but in my world uptimes should be like 365 days and more.

To avoid these kind of problems in the future it'll make sense to guard the memory usage periodically (like once per day).
In order to do so, you can install a job that evaluates the output of

echo "show system;" | sdmsh | grep MEM

On my system that yields

-bash-4.2$ echo "show system;" | sdmsh | grep MEM
                     MEM_USED : 530186240
                     MEM_FREE : 358930104
                      MEM_MAX : 1038876672

Now if MEM_USED (the amount of memory allocated) equals MEM_MAX (the max that can be allocated) and MEM_FREE is less than, let me say, 20% of MEM_USED, the time has come to reconsider the memory configuration.

HTH

Regards,

Ronald

Anthony Moyles

unread,
Aug 25, 2017, 7:19:53 AM8/25/17
to schedulix
Thanks Ronald for that detailed reply. Would you believe JUST before you posted I spotted where to increase the BICSUITEMEM argument and restarted and got past my issue :)
Also, thank you for the tip on how to monitor the memory, I'll incorporate that into our monitoring.

We believe this happened because we run batch jobs every 5 minutes, which amounts to thousands of jobs a day. Many of these failed which needed manual intervention and it seems the backlog grew so large that it exceeded the JVM's memory arguments at some point and was causing the crash. Can you tell me is there a way to limit this going forward so that the system won't store or try retrieve too many jobs and except the available memory?

Even though I've increased the memory I'm worried that after a certain amount of time this issue may reoccur if that makes sense.

Thanks so much,
Anthony

Ronald Jeninga

unread,
Aug 25, 2017, 8:11:43 AM8/25/17
to schedulix
Hi Anthony,

well, your ability to solve the problem yourself more or less proves there's some logic behind the several conf-files :-)

Jobs in a restartable state will remain in memory forever. And thousands of restartable jobs will indeed fill the memory.
I was pleased to read your understanding of a restartable state: it does indeed mean a manual intervention is required.
This means that it can be perfectly valid to have final error states.

schedulix can be regarded a high level distributed programming language. This means if you have standard error situations, the treatment of such errors can possibly be automated.
Implementing an automated treatment of standard errors will significantly reduce the number of "red" jobs that require a manual intervention. This again will reduce the backlog problem.

There are a few methods to treat errors automatically:

1. If your standard reaction is to cancel the failed job, you can just as well use a profile where FAILURE is defined to be a final state.
    The job will be "green" after running into a FAILURE and can be cleaned out of memory later.
2. If your standard reaction is to simply rerun the job, you could define a rerun trigger which retries to run the job after some period.
    This is a very good method to cope with "soft" errors, i.e. errors that occurs because of timing issues.
3. If you have a standard repair procedure, that could be integrated into your job flow.
    This means, you'll build a loop:
    do
       run JOBX
       if ERROR; then
           run REPAIR
       fi
    while ERROR
    Example E0170 shows how to build a simple loop within schedulix.
    And maybe the rerun of JOBX isn't even required, which means that the situation will be fine if the REPAIR job has run. So you'd only have to implement the "IF".

Depending on the exact situation I might find more strategies, but I am not bothered by knowledge of your system at the moment.

Apart from reducing the number of restartable jobs, it might make sense to have a look at the history configuration.
This defines which jobs can be cleaned out. The parameters can be found in the $BICSUITECONFIG/server.conf file:

In my system I don't care too much about jobs that ran successfully. I mainly use it to test my newest developments.
This is why the parameters are set to relatively low values. Still the relevant parameters are:

#
# History: In Core Job History in minutes. Finished masters are kept in memory
#          at least this long
#
History=60

#
# HistoryLimit: since the number of jobs to load can be specified, the previously 
#               defined History can be exceeded.
#               The HistoryLimit defines a hard limit. Final jobs elder than this
#               limit won't be loaded, disregarding any counts
#
HistoryLimit=120

#
#
# MinHistoryCount: Minimum number of masters loaded (if present), disregarding the
#                  History. 0 means no minimum
#
MinHistoryCount=1

#
# MaxHistoryCount: Maximum number of masters loaded, even if History is larger
#                  0 means ignore
#
MaxHistoryCount=3

In your system where you have a lot of "heartbeat jobs", it'll probably make sense to keep only a few of them.
It won't make sense to keep the runs of the last 10 days in memory (History=14400; Min and Max are zero).
Although this doesn't reduce the problem caused by the restartable jobs, it provides a possibility of making a far more efficient use of the memory which on itself relaxes the main problem somewhat.

I hope these thoughts help you to take a decision about a strategy to stabilise your system.
(apart from throwing memory at it).

Regards,

Ronald

Anthony Moyles

unread,
Aug 25, 2017, 9:47:47 AM8/25/17
to schedulix
Thanks Ronald,

That info is absolutely perfect. I think because it's Dev we'll go with option 1 as we don't mind not having failed jobs to be green in this environment - especially when we have this many.

And thank you for the information on $BICSUITECONFIG/server.conf, I'll review that and possible make a change there too.

I really appreciate your help here and quick responses, thanks again!
Anthony
Reply all
Reply to author
Forward
0 new messages