[scrolllog] Child exited with state 65280[scrolllog] Try to restart child (child terminated with exit code <> 0)(04307111431) (04301271558) Cannot stat() /proc/1: (2) No such file or directory[scrolllog] Waiting for child (19485) to terminate[scrolllog] Child exited with state 65280[scrolllog] Try to restart child (child terminated with exit code <> 0)(04307111431) (04301271558) Cannot stat() /proc/1: (2) No such file or directory[scrolllog] Waiting for child (19494) to terminate
There are two things different in a solaris environment:1. Identification of processesUsing only a PID is not a good idea, since PIDs are reused. So in case of a died jobexecutor we need more than just a PID in order to determine the state of a job (BROKEN_RUNNING or BROKEN_FINISHED)2. File locking of the taskfile
ConnectCONNECT_TIME : 17 Oct 2013 10:35:03 GMTConnected[SYS...@10.0.4.216:2506] SDMS> begin multicommandalter job server GLOBAL.'BL02_ZONE2'.'ZONE2'.'ZONE2_JS'withgroup = 'PUBLIC',rawpassword = 'password',node = '10.0.4.216',config = ('JOBFILEPREFIX' = '/opt/schedulix/schedulix/taskfiles/zone2/task-','NOTIFYPORT' = '45500','HTTPPORT' = '8900','BOOTTIME' = 'NONE');end multicommand;1 Command(s) processed[SYS...@10.0.4.216:2506] SDMS>
MESSAGE [1004(Listener)] 17 Oct 2013 10:35:03 GMT UserConnection initializedMESSAGE [1004(1004)] 17 Oct 2013 10:35:03 GMT UserConnection startedMESSAGE [1004(1004)] 17 Oct 2013 10:35:03 GMT connect SYSTEM identified by '**********' with protocol = SERIAL, timeout = 0, session = 'sdmsh[root@bl02-zone2]';MESSAGE [0,1004(Worker1)] 17 Oct 2013 10:35:03 GMT Execution time for class de.independit.scheduler.server.parser.Connect : 1 msMESSAGE [0,1004(1004)] 17 Oct 2013 10:35:03 GMT begin multicommand
alter job server GLOBAL.'BL02_ZONE2'.'ZONE2'.'ZONE2_JS'withgroup = 'PUBLIC',rawpassword = '**********',node = '10.0.4.216',config = ('JOBFILEPREFIX' = '/opt/schedulix/schedulix/taskfiles/zone2/task-','NOTIFYPORT' = '45500','HTTPPORT' = '8900','BOOTTIME' = 'NONE');end multicommand;
MESSAGE [0,1004(Worker0)] 17 Oct 2013 10:35:03 GMT Server Execution time for class de.independit.scheduler.server.parser.MultiCommand : 7 ms -- Start CommittingMESSAGE [0,1004(Worker0)] 17 Oct 2013 10:35:03 GMT Execution time for class de.independit.scheduler.server.parser.MultiCommand : 7 msMESSAGE [0,1004(Worker2)] 17 Oct 2013 10:35:03 GMT Execution time for class de.independit.scheduler.server.parser.Disconnect : 0 msMESSAGE [0,1004(1004)] 17 Oct 2013 10:35:03 GMT UserConnection terminated
#ifdef SOLARIS{*bt=0;return true;// struct stat buf;// if (stat ("/proc/1", &buf))// RETURN_FALSE (errText ("(04301271558) Cannot stat() /proc/1", errno));//// *bt = (long) buf.st_mtime;}return true;#endif
DEBUG [Jobserver] 17-10-2013 17:06:36 SAST > reassure 23030;DEBUG [Jobserver] 17-10-2013 17:06:36 SAST < container=[title="Jobserver Command", record=[COMMAND="STARTJOB", ID=23030, DIR="/opt/schedulix/schedulix/tmp", LOG="23030.log", LOGAPP=true, ERR="23030.log", ERRAPP=true, CMD="/var/smile/install/scripts/utils/SDMSpopup.sh", ARGS=["SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "-c", "?:1=FAILURE:0=SUCCESS"], ENV=["ERRORLOG", "23030.log", "EXPFINALTIME", "0", "EXPRUNTIME", "0", "FINISHTIME", "", "ISRESTARTABLE", "0", "JOBID", "23030", "JOBNAME", "SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "JOBSTATE", "", "JOBTAG", "", "KEY", "7764061444620056933", "LOGFILE", "23030.log", "MASTERID", "23030", "MERGEDSTATE", "", "PARENTID", "", "PID", "", "RERUNSEQ", "0", "RESOURCETIME", "20131017164750", "RUNNABLETIME", "20131017164750", "SCOPENAME", "GLOBAL.NG1BL02_ZONE2.ZONE2.ZONE2_JS", "SDMSHOST", "localhost", "SDMSPORT", "2506", "SEID", "21041", "STARTTIME", "20131017170636", "STATE", "STARTING", "SUBMITTIME", "20131017160013", "SYNCTIME", "20131017164749", "SYSDATE", "20131017170636", "TRIGGERBASE", "", "TRIGGERBASEID", "", "TRIGGERBASEJOBID", "", "TRIGGERNAME", "", "TRIGGERNEWSTATE", "", "TRIGGERORIGIN", "", "TRIGGERORIGINID", "", "TRIGGERORIGINJOBID", "", "TRIGGEROLDSTATE", "", "TRIGGERREASON", "", "TRIGGERREASONID", "", "TRIGGERREASONJOBID", "", "TRIGGERSEQNO", "0", "TRIGGERTYPE", "", "LAST_WARNING", "", "WORKDIR", "/opt/schedulix/schedulix/tmp"], RUN=0]]DEBUG [Jobserver] 17-10-2013 17:06:36 SAST > get next job;(04301271607) GetStaticMethodID() failed
$ cat task-GLOBAL.\'NG1BL02_ZONE2\'.\'ZONE2\'.\'ZONE2_JS\'-23030[18-10-2013 08:39:13 SAST] incomplete[18-10-2013 08:39:13 SAST] id=23030[18-10-2013 08:39:13 SAST] run=3[18-10-2013 08:39:13 SAST] status=STARTED[18-10-2013 08:39:13 SAST] command=/var/smile/install/scripts/utils/SDMSpopup.sh[18-10-2013 08:39:13 SAST] argument=SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB[18-10-2013 08:39:13 SAST] argument=-c[18-10-2013 08:39:13 SAST] argument=?:1=FAILURE:0=SUCCESS[18-10-2013 08:39:13 SAST] workdir=/opt/schedulix/schedulix/tmp[18-10-2013 08:39:13 SAST] usepath[18-10-2013 08:39:13 SAST] verboselogs[18-10-2013 08:39:13 SAST] logfile=23030.log[18-10-2013 08:39:13 SAST] logfile_append[18-10-2013 08:39:13 SAST] errlog=23030.log[18-10-2013 08:39:13 SAST] errlog_append[18-10-2013 08:39:13 SAST] samelogs[18-10-2013 08:39:13 SAST] complete[18-10-2013 08:39:13 SAST] status_tx=STARTED
$ ./stopZone2Jobserver.sh Stopping Jobserver zone2
$ rm task-GLOBAL.\'NG1BL02_ZONE2\'.\'ZONE2\'.\'ZONE2_JS\'-23030
$ ./alterJobState.shConnectCONNECT_TIME : 18 Oct 2013 06:42:36 GMTConnected[SYS...@10.0.0.216:2506] SDMS> begin multicommandalter job 23030 with state = finished, exit_code = 1;
end multicommand;1 Command(s) processed
[SYS...@10.0.0.216:2506] SDMS>
$ ./startZone2Jobserver.shStarting Jobserver zone2
I submitted the job, it went back to the previous state as indicated by the log:
DEBUG [Jobserver] 18-10-2013 08:48:07 SAST > reassure 23030;DEBUG [Jobserver] 18-10-2013 08:48:07 SAST < container=[title="Jobserver Command", record=[COMMAND="STARTJOB", ID=23030, DIR="/opt/schedulix/schedulix/tmp", LOG="23030.log", LOGAPP=true, ERR="23030.log", ERRAPP=true, CMD="/var/smile/install/scripts/utils/SDMSpopup.sh", ARGS=["SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "-c", "?:1=FAILURE:0=SUCCESS"], ENV=["ERRORLOG", "23030.log", "EXPFINALTIME", "0", "EXPRUNTIME", "0", "FINISHTIME", "", "ISRESTARTABLE", "0", "JOBID", "23030", "JOBNAME", "SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "JOBSTATE", "", "JOBTAG", "", "KEY", "7764061444620056933", "LOGFILE", "23030.log", "MASTERID", "23030", "MERGEDSTATE", "", "PARENTID", "", "PID", "", "RERUNSEQ", "4", "RESOURCETIME", "20131018084732", "RUNNABLETIME", "20131018084732", "SCOPENAME", "GLOBAL.NG1BL02_ZONE2.ZONE2.ZONE2_JS", "SDMSHOST", "localhost", "SDMSPORT", "2506", "SEID", "21041", "STARTTIME", "20131018084807", "STATE", "STARTING", "SUBMITTIME", "20131017160013", "SYNCTIME", "20131018084727", "SYSDATE", "20131018084807", "TRIGGERBASE", "", "TRIGGERBASEID", "", "TRIGGERBASEJOBID", "", "TRIGGERNAME", "", "TRIGGERNEWSTATE", "", "TRIGGERORIGIN", "", "TRIGGERORIGINID", "", "TRIGGERORIGINJOBID", "", "TRIGGEROLDSTATE", "", "TRIGGERREASON", "", "TRIGGERREASONID", "", "TRIGGERREASONJOBID", "", "TRIGGERSEQNO", "0", "TRIGGERTYPE", "", "LAST_WARNING", "", "WORKDIR", "/opt/schedulix/schedulix/tmp"], RUN=4]]DEBUG [Jobserver] 18-10-2013 08:48:07 SAST > get next job;(04301271607) GetStaticMethodID() failedDEBUG [Jobserver] 18-10-2013 08:48:07 SAST < container=[title="Jobserver Command", record=[COMMAND="NOP"]]DEBUG [Jobserver] 18-10-2013 08:48:07 SAST registered thread 0