Cannot stat() /proc/1: (2) No such file or directory (Solaris)

429 views
Skip to first unread message

Sabelo Dlangamandla

unread,
Oct 16, 2013, 8:49:43 AM10/16/13
to sche...@googlegroups.com
Good-day everyone,

I have, with the help of Jason, successfuly compiled the schedulix software on Solaris. He (Jason) promised to post to the forum on how to compile schedulix in Solaris. I have installed a jobserver on the Solaris machine when I view the log I can see the following logs from the jobserver:
[scrolllog] Child exited with state 65280
[scrolllog] Try to restart child (child terminated with exit code <> 0)
(04307111431) (04301271558) Cannot stat() /proc/1: (2) No such file or directory
[scrolllog] Waiting for child (19485) to terminate
[scrolllog] Child exited with state 65280
[scrolllog] Try to restart child (child terminated with exit code <> 0)
(04307111431) (04301271558) Cannot stat() /proc/1: (2) No such file or directory
[scrolllog] Waiting for child (19494) to terminate

I recall Ronald Jeninga giving me some advice from this post "asking for clarification regarding the 'run program' field of the Job/Batch object":
There are two things different in a solaris environment:
1. Identification of processes
    Using only a PID is not a good idea, since PIDs are reused. So in case of a died jobexecutor we need more than just a PID in order to determine the state of a job (BROKEN_RUNNING or BROKEN_FINISHED)
2. File locking of the taskfile

Judging from the output of the logs I suspect Ronald was warning me of such, if so how can I  fix this issue. I thought I should check on the forum first before I meddle with the software, it is true that the directory /proc/1 does not exist so I would like to know if a work-around is possible.

Thanks in advance,
Sabelo

Ronald Jeninga

unread,
Oct 16, 2013, 10:40:54 AM10/16/13
to sche...@googlegroups.com
Hi Sabelo,

I've seen that error before. It all depends on the version or perhaps configuration of your Solaris.
What the jobserver is trying to do is to get a good estimate of the boot time. The start time of the init process is normally such a good estimate.
BUT: Not on all Solaris machines the init process will have pid 1 :-(

You have several choices here:
1. find another method to determine boot time
2. forget about the boot time

The second option might seem strange, but you only lose a tiny bit of precision.
To explain this:

I already said that only a pid is not enough to identify a process since pids are reused.
So we add the start time of a process to make the pid unique. This works most of the time, but it's not forbidden to change the time on a computer (e.g. the clock is 3 hours ahead and you correct this as administrator), which makes the start time of a process + pid potentially non-unique.
In order not to get funny things, we also added the boot time. Just in case ...
Admitted, we're still not waterproof here, but we figured: if some administrator manages to boot twice and, by manipulation of the clock, gets the same boot time twice. And then he manages to get a process with the same pid and start time, he definitely deserves to see the error occurring (you'll get a job in state BROKEN_RUNNING instead of BROKEN_FINISHED).

So, skipping the boot time for identification isn't a catastrophe.

You can configure this: Set the configuration parameter BOOTTIME to NONE. You can't do this in the GUI, you'll have to use sdmsh for that. Alternatively you change the default value in .../src/jobserver/Config.java from SYSTEM to NONE. This might be easier.

Hope this helps,

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 17, 2013, 6:55:08 AM10/17/13
to sche...@googlegroups.com
Thanks Ronald for clarifying that part, I have changed the BOOTTIME property to NONE using sdmsh as adviced here is the output:

Connect
CONNECT_TIME : 17 Oct 2013 10:35:03 GMT
Connected
[SYS...@10.0.4.216:2506] SDMS> begin multicommand
alter job server GLOBAL.'BL02_ZONE2'.'ZONE2'.'ZONE2_JS'
with
        group = 'PUBLIC',
        rawpassword = 'password',
        node = '10.0.4.216',
        config = (
                'JOBFILEPREFIX' = '/opt/schedulix/schedulix/taskfiles/zone2/task-',
                'NOTIFYPORT' = '45500',
                'HTTPPORT' = '8900',
                'BOOTTIME' = 'NONE'
        );
end multicommand;
1 Command(s) processed

But it seems like the change has not been propagated. When I was tailing the Schedulix Server log I could see that the command was also received by the server but the change has not been effected as the missing file error is still there, here is an output of the Schedulix Server as well:

MESSAGE [1004(Listener)] 17 Oct 2013 10:35:03 GMT UserConnection initialized
MESSAGE [1004(1004)] 17 Oct 2013 10:35:03 GMT UserConnection started
MESSAGE [1004(1004)] 17 Oct 2013 10:35:03 GMT connect SYSTEM identified by '**********' with protocol = SERIAL, timeout = 0, session = 'sdmsh[root@bl02-zone2]';
MESSAGE [0,1004(Worker1)] 17 Oct 2013 10:35:03 GMT Execution time for class de.independit.scheduler.server.parser.Connect : 1 ms
MESSAGE [0,1004(1004)] 17 Oct 2013 10:35:03 GMT begin multicommand
alter job server GLOBAL.'BL02_ZONE2'.'ZONE2'.'ZONE2_JS'
with
        group = 'PUBLIC',
        rawpassword = '**********',
        node = '10.0.4.216',
        config = (
                'JOBFILEPREFIX' = '/opt/schedulix/schedulix/taskfiles/zone2/task-',
                'NOTIFYPORT' = '45500',
                'HTTPPORT' = '8900',
'BOOTTIME' = 'NONE'
        );
end multicommand;
MESSAGE [0,1004(Worker0)] 17 Oct 2013 10:35:03 GMT Server Execution time for class de.independit.scheduler.server.parser.MultiCommand : 7 ms -- Start Committing
MESSAGE [0,1004(Worker0)] 17 Oct 2013 10:35:03 GMT Execution time for class de.independit.scheduler.server.parser.MultiCommand : 7 ms
MESSAGE [0,1004(Worker2)] 17 Oct 2013 10:35:03 GMT Execution time for class de.independit.scheduler.server.parser.Disconnect : 0 ms
MESSAGE [0,1004(1004)] 17 Oct 2013 10:35:03 GMT UserConnection terminated

Did I run the command properly or I missed something? I did not change the Config.java file for now as I thought this should be good enough as you suggested.

Regards
Sabelo 
 

Ronald Jeninga

unread,
Oct 17, 2013, 7:39:30 AM10/17/13
to sche...@googlegroups.com
Hi Sabelo,

mh, the command you posted seems correct.
I just tried it myself (on a linux system, but that's not relevant, or at least shouldn't be relevant):

First I did a

[SYSTEM@localhost:2506] SDMS> alter scope GLOBAL.EXAMPLES.LOCALHOST.SERVER     
with config = ('BOOTTIME' = 'NONE');

and checked the result (show scope global.examples.localhost.server;)
Then I submitted a job and the taskfile reported

ronald@cheetah:~/SDMS/SDMS/sandbox/taskfiles$ cat localhost-GLOBAL.\'EXAMPLES\'.\'LOCALHOST\'.\'SERVER\'-24025
[17-10-2013 13:05:47 CEST] incomplete
[17-10-2013 13:05:47 CEST] id=24025
[17-10-2013 13:05:47 CEST] run=0
[17-10-2013 13:05:47 CEST] status=STARTED
[17-10-2013 13:05:47 CEST] command=sh
[17-10-2013 13:05:47 CEST] argument=-c
[17-10-2013 13:05:47 CEST] argument=sleep 60
[17-10-2013 13:05:47 CEST] workdir=/home/ronald/SDMS/SDMS/sandbox/tmp
[17-10-2013 13:05:47 CEST] usepath
[17-10-2013 13:05:47 CEST] verboselogs
[17-10-2013 13:05:47 CEST] logfile=24025.log
[17-10-2013 13:05:47 CEST] logfile_append
[17-10-2013 13:05:47 CEST] errlog=24025.log
[17-10-2013 13:05:47 CEST] errlog_append
[17-10-2013 13:05:47 CEST] samelogs
[17-10-2013 13:05:47 CEST] complete
[17-10-2013 11:05:47 GMT] execpid=26489@N0+724938526
[17-10-2013 11:05:47 GMT] extpid=26490@N0+724938526
[17-10-2013 11:05:47 GMT] status=RUNNING
[17-10-2013 13:05:47 CEST] status_tx=STARTED
[17-10-2013 13:05:52 CEST] status_tx=RUNNING


Then I did a
[SYSTEM@localhost:2506] SDMS> alter scope GLOBAL.EXAMPLES.LOCALHOST.SERVER     
with config = ('BOOTTIME' = 'SYSTEM');

and the taskfile (for a newly submitted job) reported

ronald@cheetah:~/SDMS/SDMS/sandbox/taskfiles$ cat localhost-GLOBAL.\'EXAMPLES\'.\'LOCALHOST\'.\'SERVER\'-24029
... (skipped lines)
[17-10-2013 11:07:38 GMT] execpid=26553@S1374758562+724949551
[17-10-2013 11:07:38 GMT] extpid=26554@S1374758562+724949552
[17-10-2013 11:07:38 GMT] status=RUNNING
[17-10-2013 13:07:43 CEST] status_tx=RUNNING


As you can see, the execpid (pid of job executor) and extpid (pid of user process) are built according to the configuration.

We'll have to do some digging here, I guess. I'm sorry about this. :-(

So first change the trace level of the jobserver to some higher number than 0.
Something like

alter job server GLOBAL.BL02_ZONE2.ZONE2.ZONE2_JS
with config = ('TRACELEVEL' = '3');

Then change the BOOTTIME property:

alter job server GLOBAL.BL02_ZONE2.ZONE2.ZONE2_JS
with config = ('BOOTTIME' = 'NONE');


Now in the jobserver logfile the change of configuration should be visible:

...
DEBUG   [Jobserver]    17-10-2013 13:27:42 CEST < container=[title="Jobserver Command", record=[COMMAND="ALTER", CONFIG=table=[#0=[:BOOTTIME=1014, USEPATH="true", BOOTTIME="NONE", .NOTIFYPORT=1037, .REPOHOST=1014, .JOBFILEPREFIX=1037, .NAME_PATTERN_LOGFILES=1014, .HTTPPORT=1037, .REPOPORT=1014, HTTPPORT="8900", NOTIFYPORT="45500", HTTPHOST="localhost", .VERBOSELOGS=1014, .TRACELEVEL=1037, REPOUSER="GLOBAL.'EXAMPLES'.'LOCALHOST'.'SERVER'", .JOBEXECUTOR=1014, NAME_PATTERN_LOGFILES=".*\.log", VERBOSELOGS="true", .HTTPHOST=1032, DEFAULTWORKDIR="/home/ronald/SDMS/SDMS/sandbox/tmp", .DEFAULTWORKDIR=1014, REPOHOST="localhost", REPOPORT="2506", JOBEXECUTOR="/home/ronald/SDMS/SDMS/sandbox/ENTERPRISE-linux/bin/jobserver", ;BOOTTIME="NONE", .USEPATH=1014, JOBFILEPREFIX="/home/ronald/SDMS/SDMS/sandbox/taskfiles/localhost-", TRACELEVEL="3", .BOOTTIME=1037, REPOPASS="some md5 sum", ENV={SDMSHOST=SDMSHOST, .SDMSPORT=1014, .KEY=1014, .JOBID=1014, JOBID=JOBID, SDMSPORT=SDMSPORT, KEY=KEY, .SDMSHOST=1014}]]]]
...


If it is not, we'll have to search for a reason. If it is, we'll have to understand why it ignores the setting.
(you can also try this with a functioning jobserver as well, maybe you find a difference on the way).

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 17, 2013, 11:36:29 AM10/17/13
to sche...@googlegroups.com
Thanks Ronald once again, I followed your advice on a functioning server (running on Ubuntu) I get the same results as you did. Modifying the BOOTTIME property on that server responds as it should. I can safely say on a properly function job-server the command to modify is executed as it should.

On my Solaris jobserver it looks like the is no command that is currently being accepted by it except continually logging the error: (04307111431) (04301271558) Cannot stat() /proc/1: (2) No such file or directory.
In trying to debug I modified a part of the file "jobserver/libunix.cc" to look like this:

#ifdef SOLARIS
        {
                *bt=0;
                return true;
//              struct stat buf;
//              if (stat ("/proc/1", &buf))
//                      RETURN_FALSE (errText ("(04301271558) Cannot stat() /proc/1", errno));
//
//              *bt = (long) buf.st_mtime;
        }
        return true;
#endif
 
When I do that at least the error goes way, and the job status becomes 'STARTING' (You explained what this means, am well aware). When I view the logs of the jobserver I get this log:
DEBUG   [Jobserver]     17-10-2013 17:06:36 SAST > reassure 23030;
DEBUG   [Jobserver]     17-10-2013 17:06:36 SAST < container=[title="Jobserver Command", record=[COMMAND="STARTJOB", ID=23030, DIR="/opt/schedulix/schedulix/tmp", LOG="23030.log", LOGAPP=true, ERR="23030.log", ERRAPP=true, CMD="/var/smile/install/scripts/utils/SDMSpopup.sh", ARGS=["SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "-c", "?:1=FAILURE:0=SUCCESS"], ENV=["ERRORLOG", "23030.log", "EXPFINALTIME", "0", "EXPRUNTIME", "0", "FINISHTIME", "", "ISRESTARTABLE", "0", "JOBID", "23030", "JOBNAME", "SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "JOBSTATE", "", "JOBTAG", "", "KEY", "7764061444620056933", "LOGFILE", "23030.log", "MASTERID", "23030", "MERGEDSTATE", "", "PARENTID", "", "PID", "", "RERUNSEQ", "0", "RESOURCETIME", "20131017164750", "RUNNABLETIME", "20131017164750", "SCOPENAME", "GLOBAL.NG1BL02_ZONE2.ZONE2.ZONE2_JS", "SDMSHOST", "localhost", "SDMSPORT", "2506", "SEID", "21041", "STARTTIME", "20131017170636", "STATE", "STARTING", "SUBMITTIME", "20131017160013", "SYNCTIME", "20131017164749", "SYSDATE", "20131017170636", "TRIGGERBASE", "", "TRIGGERBASEID", "", "TRIGGERBASEJOBID", "", "TRIGGERNAME", "", "TRIGGERNEWSTATE", "", "TRIGGERORIGIN", "", "TRIGGERORIGINID", "", "TRIGGERORIGINJOBID", "", "TRIGGEROLDSTATE", "", "TRIGGERREASON", "", "TRIGGERREASONID", "", "TRIGGERREASONJOBID", "", "TRIGGERSEQNO", "0", "TRIGGERTYPE", "", "LAST_WARNING", "", "WORKDIR", "/opt/schedulix/schedulix/tmp"], RUN=0]]
DEBUG   [Jobserver]     17-10-2013 17:06:36 SAST > get next job;
(04301271607) GetStaticMethodID() failed
 
As you can see modifying that method resulted to some progress (though they are negative as it results to another error: GetStaticMethodID() failed), does this bring any ideas on your side as to how I can positively fix this in a way that would be complaint with Solaris? Please note: the scope is NG1BL02_ZONE2 instead of BL02_ZONE2 as per my previous references.

Kind regards
Sabelo




Ronald Jeninga

unread,
Oct 17, 2013, 12:04:11 PM10/17/13
to sche...@googlegroups.com
Hi Sabelo,

you're welcome :-)

The way you commented out the stat("/proc/1") is correct. Effectively BOOTTIME=SYSTEM now equals BOOTTIME=NONE.

Today I actually discovered exactly the bug your jobserver runs into. I didn't fix it yet, but I can tell you how to fix it:

In libjni.cc, you'll find the error number (somewhere around line 67).
There you find a line

const jmethodID jmid = env->GetStaticMethodID (clazz, "abortProgram", "(Ljobserver/RepoIface;Ljava/lang/String;)V");

it has to be changed into

const jmethodID jmid =
   env->GetStaticMethodID (clazz, "abortProgram", "(Lde/independit/scheduler/jobserver/RepoIface;Ljava/lang/String;)V");


What happened is, because of the open source launch we conformed the class names to the de facto standards. Thereby we forgot to change this call to the jre.

After repairing this you won't be out of trouble, because the jobexecutor is busy reporting an error. It'll be quite interesting to know what the error message is, though.
(I hope it gives some valuable information. Anyway the error number should be unique within the system, so there is a chance that we'll find a cause of the problem).

The server reply to the "reassure job" looks healthy.

Regards,

Ronald

Ronald Jeninga

unread,
Oct 17, 2013, 12:56:52 PM10/17/13
to sche...@googlegroups.com
Hi Sabelo,

I just got another idea.
Have a look at the taskfiles directory. It could be the case that a taskfile is lying around. In the extpid/execpid there's a character defining how the BOOTTIME for this process identification is to be determined. This could be the cause for your actual problem (not the program bug, but the reason you run into it).

If there's a taskfile present, do the following:

1. shutdown the jobserver (too many cooks spoil the food)
2. remove the taskfile
3. alter job 23030 with state = finished, exit code = 1;
4. restart the jobserver

Now the job (that didn't run yet) has a restartable state (FAILURE) and can be restarted.
(Since you're only running a SDMSpopup.sh, you can also set the exit code to 0 and submit again. It's not a very important job ;)

I hope I had the right idea. (and if, you can revert your code change; all new jobs will have BOOTTIME=NONE).

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 18, 2013, 3:09:11 AM10/18/13
to sche...@googlegroups.com
Good-day Ronald,

I followed your advice from your last post, unfortunately I was unable to resolve the problem, I must add I repeated the steps a couple of times with the same results. Following is a view of the steps I took as adviced:

1.The task in taskfile
$ cat task-GLOBAL.\'NG1BL02_ZONE2\'.\'ZONE2\'.\'ZONE2_JS\'-23030 
[18-10-2013 08:39:13 SAST] incomplete
[18-10-2013 08:39:13 SAST] id=23030
[18-10-2013 08:39:13 SAST] run=3
[18-10-2013 08:39:13 SAST] status=STARTED
[18-10-2013 08:39:13 SAST] command=/var/smile/install/scripts/utils/SDMSpopup.sh
[18-10-2013 08:39:13 SAST] argument=SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB
[18-10-2013 08:39:13 SAST] argument=-c
[18-10-2013 08:39:13 SAST] argument=?:1=FAILURE:0=SUCCESS
[18-10-2013 08:39:13 SAST] workdir=/opt/schedulix/schedulix/tmp
[18-10-2013 08:39:13 SAST] usepath
[18-10-2013 08:39:13 SAST] verboselogs
[18-10-2013 08:39:13 SAST] logfile=23030.log
[18-10-2013 08:39:13 SAST] logfile_append
[18-10-2013 08:39:13 SAST] errlog=23030.log
[18-10-2013 08:39:13 SAST] errlog_append
[18-10-2013 08:39:13 SAST] samelogs
[18-10-2013 08:39:13 SAST] complete
[18-10-2013 08:39:13 SAST] status_tx=STARTED
 
 2. Stoping the server:
$ ./stopZone2Jobserver.sh Stopping Jobserver zone2 
3. Remove taskfile:
 $ rm task-GLOBAL.\'NG1BL02_ZONE2\'.\'ZONE2\'.\'ZONE2_JS\'-23030
4. Alter Job:
 $ ./alterJobState.sh 
Connect
CONNECT_TIME : 18 Oct 2013 06:42:36 GMT
Connected
[SYS...@10.0.0.216:2506] SDMS> begin multicommand
alter job 23030 with state = finished, exit_code = 1;
end multicommand;
1 Command(s) processed
5. View of the Job in GUI:
6. Restart jobserver:
 $ ./startZone2Jobserver.sh 
Starting Jobserver zone2
7. Resubmit job:
 I submitted the job, it went back to the previous state as indicated by the log:
 DEBUG   [Jobserver]     18-10-2013 08:48:07 SAST > reassure 23030;
DEBUG   [Jobserver]     18-10-2013 08:48:07 SAST < container=[title="Jobserver Command", record=[COMMAND="STARTJOB", ID=23030, DIR="/opt/schedulix/schedulix/tmp", LOG="23030.log", LOGAPP=true, ERR="23030.log", ERRAPP=true, CMD="/var/smile/install/scripts/utils/SDMSpopup.sh", ARGS=["SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "-c", "?:1=FAILURE:0=SUCCESS"], ENV=["ERRORLOG", "23030.log", "EXPFINALTIME", "0", "EXPRUNTIME", "0", "FINISHTIME", "", "ISRESTARTABLE", "0", "JOBID", "23030", "JOBNAME", "SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "JOBSTATE", "", "JOBTAG", "", "KEY", "7764061444620056933", "LOGFILE", "23030.log", "MASTERID", "23030", "MERGEDSTATE", "", "PARENTID", "", "PID", "", "RERUNSEQ", "4", "RESOURCETIME", "20131018084732", "RUNNABLETIME", "20131018084732", "SCOPENAME", "GLOBAL.NG1BL02_ZONE2.ZONE2.ZONE2_JS", "SDMSHOST", "localhost", "SDMSPORT", "2506", "SEID", "21041", "STARTTIME", "20131018084807", "STATE", "STARTING", "SUBMITTIME", "20131017160013", "SYNCTIME", "20131018084727", "SYSDATE", "20131018084807", "TRIGGERBASE", "", "TRIGGERBASEID", "", "TRIGGERBASEJOBID", "", "TRIGGERNAME", "", "TRIGGERNEWSTATE", "", "TRIGGERORIGIN", "", "TRIGGERORIGINID", "", "TRIGGERORIGINJOBID", "", "TRIGGEROLDSTATE", "", "TRIGGERREASON", "", "TRIGGERREASONID", "", "TRIGGERREASONJOBID", "", "TRIGGERSEQNO", "0", "TRIGGERTYPE", "", "LAST_WARNING", "", "WORKDIR", "/opt/schedulix/schedulix/tmp"], RUN=4]]
DEBUG   [Jobserver]     18-10-2013 08:48:07 SAST > get next job;
(04301271607) GetStaticMethodID() failed
DEBUG   [Jobserver]     18-10-2013 08:48:07 SAST < container=[title="Jobserver Command", record=[COMMAND="NOP"]]
DEBUG   [Jobserver]     18-10-2013 08:48:07 SAST registered thread 0

I guess now I have to try your previous advice on modifying the  'libjni.cc' file.

Sincerely
Sabelo

Ronald Jeninga

unread,
Oct 18, 2013, 3:26:31 AM10/18/13
to sche...@googlegroups.com
Hi Sabelo,

it was just an idea worth a try.
Anyway, the bug in libjni.cc must be fixed, because it hides the real error message. I think, I'll do so today in the github repository.

And something else, as an aside:
The "begin multicommand ... end multicommand;" are only necessary if you want to execute more than one statement as _one_ transaction.
Normally each statement is a transaction in itself. But if you have a bunch of related statements, you might want to have them executed atomically (everything succeeds, or (logically) nothing happened).

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 18, 2013, 3:39:33 AM10/18/13
to sche...@googlegroups.com
Thanks, will give it go and send feedback. On the side-note, I recall very well you taught me about the transactional execution of commands in schedulix, its just a bad habit I need to shake-off will revert to best practises.

Regards
Sabelo

Sabelo Dlangamandla

unread,
Oct 18, 2013, 4:05:22 AM10/18/13
to sche...@googlegroups.com
Hello Ronald once again, this is the error I get after modifying that line and executing the job following the steps discussed previously:

DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST > reassure 23030;
DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST < container=[title="Jobserver Command", record=[COMMAND="STARTJOB", ID=23030, DIR="/opt/schedulix/schedulix/tmp", LOG="23030.log", LOGAPP=true, ERR="23030.log", ERRAPP=true, CMD="/var/smile/install/scripts/utils/SDMSpopup.sh", ARGS=["SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "-c", "?:1=FAILURE:0=SUCCESS"], ENV=["ERRORLOG", "23030.log", "EXPFINALTIME", "0", "EXPRUNTIME", "0", "FINISHTIME", "", "ISRESTARTABLE", "0", "JOBID", "23030", "JOBNAME", "SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "JOBSTATE", "", "JOBTAG", "", "KEY", "7764061444620056933", "LOGFILE", "23030.log", "MASTERID", "23030", "MERGEDSTATE", "", "PARENTID", "", "PID", "", "RERUNSEQ", "5", "RESOURCETIME", "20131018095451", "RUNNABLETIME", "20131018095451", "SCOPENAME", "GLOBAL.NG1BL02_ZONE2.ZONE2.ZONE2_JS", "SDMSHOST", "localhost", "SDMSPORT", "2506", "SEID", "21041", "STARTTIME", "20131018095514", "STATE", "STARTING", "SUBMITTIME", "20131017160013", "SYNCTIME", "20131018095449", "SYSDATE", "20131018095514", "TRIGGERBASE", "", "TRIGGERBASEID", "", "TRIGGERBASEJOBID", "", "TRIGGERNAME", "", "TRIGGERNEWSTATE", "", "TRIGGERORIGIN", "", "TRIGGERORIGINID", "", "TRIGGERORIGINJOBID", "", "TRIGGEROLDSTATE", "", "TRIGGERREASON", "", "TRIGGERREASONID", "", "TRIGGERREASONJOBID", "", "TRIGGERSEQNO", "0", "TRIGGERTYPE", "", "LAST_WARNING", "", "WORKDIR", "/opt/schedulix/schedulix/tmp"], RUN=5]]
DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST > get next job;
DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST < container=[title="Jobserver Command", record=[COMMAND="NOP"]]
FATAL   [Jobserver]     18-10-2013 09:55:14 SAST (04402151824) isAlive() failed: (04301271612) Invalid pid: 
DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST > alter jobserver with fatal error_text = '(04402151824) isAlive() failed: (04301271612) Invalid pid: ';
DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST registered thread 0
DEBUG   [Jobserver]     18-10-2013 09:55:14 SAST < feedback="Job Server altered"
FATAL   [Jobserver]     18-10-2013 09:55:14 SAST ***ERROR*** (04402151824) isAlive() failed: (04301271612) Invalid pid: 
FATAL   [Jobserver]     18-10-2013 09:55:14 SAST Program aborted

Kind regards
Sabelo

Ronald Jeninga

unread,
Oct 18, 2013, 4:44:30 AM10/18/13
to sche...@googlegroups.com
Hi Sabelo,

I've seen this error before, but I don't recall the cause at the moment. It's something simple, I know. Some configuration issue.
While writing, I'm still thinking ....

In the meantime it might interest you that we're planning to rewrite the jobexecutor. First of all we want to implement a clean Java implementation. This will enable schedulix/BICsuite to execute tasks in Java environments like tomcat, JBoss and alike. This will be one of the improvements in 2.6. Having done that, I'll do the c-code again.

Anyway, until then, I'll set the default for the BOOTTIME to NONE and repair the jni error as you did.

.... Ready thinking now :-)

For some reason (which is one reason to rewrite the code) an invalid setting of BIC_LOCALE in $BICSUITECONGIG/bicsuite.conf leads to this error. I must admit I didn't find out yet why.
Could you check the setting and, if it is set to some value which isn't present on your box, correct it?
The default setting is en_US I think, you might have some other language package installed.
(run the "locale" or "locale -a" command or do something like "echo $LANG" might give you information about a valid setting).

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 18, 2013, 8:27:37 AM10/18/13
to sche...@googlegroups.com
I am so glad to finally say it 'WORKS', I changed the locale as per your advice to C (POSIX locale). Thank you so much for your help in solving my issues, this achievement deserves a cup of coffee on my side.

I am looking forward to the next version of Schedulix and once again thank you.

Regards
Sabelo

Sabelo Dlangamandla

unread,
Oct 18, 2013, 8:29:02 AM10/18/13
to sche...@googlegroups.com
Here is the output:
DEBUG   [Jobserver]     18-10-2013 14:13:46 SAST > get next job;
DEBUG   [Jobserver]     18-10-2013 14:13:46 SAST < container=[title="Jobserver Command", record=[COMMAND="STARTJOB", ID=23030, DIR="/opt/schedulix/schedulix/tmp", LOG="23030.log", LOGAPP=true, ERR="23030.log", ERRAPP=true, CMD="/var/smile/install/scripts/utils/SDMSpopup.sh", ARGS=["SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "-c", "?:1=FAILURE:0=SUCCESS"], ENV=["ERRORLOG", "23030.log", "EXPFINALTIME", "0", "EXPRUNTIME", "0", "FINISHTIME", "", "ISRESTARTABLE", "0", "JOBID", "23030", "JOBNAME", "SYSTEM.NG1BL02_ZONE2.E0010_SINGLEJOB.SINGLEJOB", "JOBSTATE", "", "JOBTAG", "", "KEY", "7764061444620056933", "LOGFILE", "23030.log", "MASTERID", "23030", "MERGEDSTATE", "", "PARENTID", "", "PID", "", "RERUNSEQ", "9", "RESOURCETIME", "20131018141346", "RUNNABLETIME", "20131018141346", "SCOPENAME", "GLOBAL.NG1BL02_ZONE2.ZONE2.ZONE2_JS", "SDMSHOST", "localhost", "SDMSPORT", "2506", "SEID", "21041", "STARTTIME", "20131018141346", "STATE", "STARTING", "SUBMITTIME", "20131017160013", "SYNCTIME", "20131018141342", "SYSDATE", "20131018141346", "TRIGGERBASE", "", "TRIGGERBASEID", "", "TRIGGERBASEJOBID", "", "TRIGGERNAME", "", "TRIGGERNEWSTATE", "", "TRIGGERORIGIN", "", "TRIGGERORIGINID", "", "TRIGGERORIGINJOBID", "", "TRIGGEROLDSTATE", "", "TRIGGERREASON", "", "TRIGGERREASONID", "", "TRIGGERREASONJOBID", "", "TRIGGERSEQNO", "0", "TRIGGERTYPE", "", "LAST_WARNING", "", "WORKDIR", "/opt/schedulix/schedulix/tmp"], RUN=9]]
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST registered thread 0
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST Interrupting Thread : 0
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST unregistered thread 0
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST Thread found : Thread[main,5,main]
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST > alter job 23030 with status = running, run = 9, exec_pid = '5695@N0+1382098427', ext_pid = '5696@N0+1382098427', timestamp = '18-10-2013 12:13:47 GMT';
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST < feedback="Job altered"
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST > alter job 23030 with status = finished, run = 9, exit_code = 0, timestamp = '18-10-2013 12:13:47 GMT';
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST < feedback="Job altered"
DEBUG   [Jobserver]     18-10-2013 14:13:47 SAST > get next job;

Ronald Jeninga

unread,
Oct 18, 2013, 9:14:11 AM10/18/13
to sche...@googlegroups.com
Hi Sabelo,

*Happy* :-)

We will have some nice new features in the 2.6 version. First of all, we'll make a nicer looking GUI. It will also have some usability enhancements.
Then we enhanced the "sticky" concept. It means you can have more than one critical regions within one work flow. It'll even work with dynamic submits.
So imagine you have a large database table which is stored in several partitions. Now you want to process this table partitionwise, so you submit a dynamic child batch, one for each partition.
For some reason you need a critical region (exclusive access to the table) which spans several jobs within each child batch. That's a trivial task in 2.6, but a PITA in 2.5.1 (although possible).
We will rewrite the jobserver and add a jobserver in pure java. (To defend the c-version: It doesn't make sense to startup a JVM in order to execute a /bin/true or alike, which means that the pure java implementation is not suited for short running jobs).

Of course we have some more ideas, but we didn't decide yet which of those will end up in 2.6.

If you happen to have feature requests (and ideally a use case), don't hesitate to tell us.

Apart from all those enhancements, I expect that the syntax documentation will be ready in the middle of November. I can't exactly promise it, because I'm not a clairvoyant.
That will be checked in both in v2.5.1 and the current master. And of course we'll publish it on our website.

Oh, and I *love* coffee. So if we meet once, I'll remind you about that ;-)

Regards & enjoy your weekend

Ronald

Sabelo Dlangamandla

unread,
Oct 18, 2013, 10:03:57 AM10/18/13
to sche...@googlegroups.com
Thanks for the detailed feedback on the lifecycle of Schedulix, 2.6 is going to be a massive upgrade. As for the syntax documentation, the current version is still readable as the commands themselves are in English so November doesnt sound bad at all.

Have a great one too, kind regards
Sabelo

Reply all
Reply to author
Forward
0 new messages