asking for clarification regarding the 'run program' field of the Job/Batch object

336 views
Skip to first unread message

Sabelo Dlangamandla

unread,
Sep 30, 2013, 10:13:49 AM9/30/13
to sche...@googlegroups.com
Schedulix team,

Good-day, I need some clarity on 'run program' field and am not sure if it qualifies as a question that can be posted to the schedulix team if not I sincerely apologies for everyone's time. Here is my dilemma: 
With the 'run program' field we specify the command line to be executed by the jobserver, that works fine if the script/program residing in the same machine (e.g Machine A) that the scheduling server is running. Is it possible to reference a program (executable script) that is located in a different machine say Machine B. If possible is there an example that I can be referred to or an easier alternative to do this?

I hope my question is not confusing.

Regards
Sabelo

Ronald Jeninga

unread,
Oct 1, 2013, 11:15:39 AM10/1/13
to sche...@googlegroups.com
Hi Sabelo,

in order to execute commandlines on another machine, you'll have to install a jobserver there.

First have a look at the principal architecture of the system:
http://www.independit.de/en/bicsuite/highlights/architektur

It shows the scheduling server running on some machine in the middle. To the right there are n machines running one or more jobservers.
There are several reasons for starting more than one jobserver on a machine. But the main reason is that jobservers do not do a setuid(). This means if you want to execute jobs as user tom and also jobs as user jodie, you'll need two jobservers on that machine. The one jobserver is started from the account tom, the other from the account jodie. Of course, if you have automated the startup procedure in some init.d script, you'll probably have lines like

su - tom -c start_toms_jobserver.sh
su - jodie -c start_jodies_jobserver.sh


The installation of a jobserver consists of two different parts. The first part is the physical installation of a jobserver. The second part is the logical installation.
The physical installation means that you need to put the schedulix software on the machine, write a configuration file for the jobserver and make sure you have a directory to write the taskfiles.

The logical installation means that you tell the scheduling system that there is another jobserver interested in getting jobs.

In the setup_example_jobservers.sh script you can find out "how we did  it".

There's another thread in this group which is called "scopes and servers", I think. You might also find some interesting facts there.

I hope this helps you a little.

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 7, 2013, 6:29:13 AM10/7/13
to sche...@googlegroups.com
Thanks Ronald and apologies for the late feedback.
Firstly by looking at the architecture the system addresses what I want to achieve, the part that I want to implement is represented by computer 1 on the architecture (JobServer agent).

What I understand so far about the system is the installation of the scheduling server as detailed in the installation guide and am happy with that, the installation can be summarized as follows: 1.create user schedulix 2.install java 3.install RDMS (MySql) 4.create configuration (user environment and software environment) 5. setup database (schedulixdb) 6.boot the server 7.create '.sdmshrc' file 8.create convenience packages

What is a challenge for me now is to install a remote job server. As per your advice you say this involves two different parts: physical and logical. The physical part is of a hindrance for now especially the part 'put the schedulix software on the machine' (e.g computer 1).
Reading the the posts on "scopes and servers" you say "You'll have to compile schedulix for every architecture on which you are going to install a jobserver." and you go to expand on this, and this is your explaination "If you are using the same architecture on all computers, you'll have to compile only once. Then you make a tarball of the result and unpack it where ever. The prerequisites for the installation still apply though. That is, for a jobserver you don't need a database system and you don't need zope. But you'll need a java. And if you want to run some of the examples on the new jobserver, you'll need the swt.jar too."

Based on the steps I mentioned above when installing the scheduling server (1-8), which ones do not apply for the installation of the job-server? or when making the tarball do I use all the files under $BICSUITEHOME?

Regards
Sabelo

Ronald Jeninga

unread,
Oct 7, 2013, 7:56:36 AM10/7/13
to sche...@googlegroups.com
Hi Sabelo,

no problem, I'm not paying you for a feedback, so you can write at your own pace. Although I must admit, I'm targeting for the final feedback (for each thread): "Thank you, it now all works".

Back to your challenge.

It definitely makes sense to create a schedulix user. You can put the software into /home/schedulix/schedulix or something and have everything owner schedulix. Of course, if you only need one jobserver on some machine, you could install the software as the user running it. But it has some (small) disadvantages:
1. If you upgrade the software, you might find a different environment on every machine running one or more jobservers. That'll be a hassle (trust me, I've gone through this ;).
2. If you later decide to install another jobserver, you might have to juggle around with permissions from one productive user to another. To have the software separated will give you an easier and symmetric setup for all users.

The jobserver on itself consists of two parts:
a. The jobserver server process which communicates with the central scheduling server. This part is written in java.
b. The jobexecutor. This is a small c program which executes the RUN COMMAND. It basically doesn't do much more than
    - Open a file for stdout and stderr, conforming the specification in LOGFILE and ERRLOG
    - Set zero or more environment variables
    - execve() or execvpe() depending on the USE_PATH parameter
    - wait()
    - write the taskfile, wich holds some information on which job is executed and the status of the entire process
It is the taskfile which serves the communication between the jobserver server process and the jobexecutor. Because this way the process state (from the jobserver's point of view) is persistent, it is always possible to shutdown, kill or restart the jobserver server process, without any loss of data and/or control.
Because an essential part of the jobserver is written in Java, you'll need a java installed. The jobserver only needs a standard java environment, which means that its footprint is comparatively small.

The jobserver itself doesn't need any DBMS. So there's no need to install one.

You'll have to do some configuring. First of all, all users that would like to start a jobserver, need to know something about BICSUITEHOME. Then you might want to edit the java.conf file.
The jobserver will user the "*_JS" parameters. This enables you to use different versions of java, for whatever reason, and a separate set of java flags. Normally you don't have to do much here.
You also might want to change the bicsuite.conf file. There's not too much magic in there, but it is handy if you want to install a (job)server on a non-standard system.
We're not done yet with the configuration, but I'll come back to that later.

Since you don't have a DBMS, you don't need a database. That's simple :-)

The server, which is the scheduling server, runs elsewhere. We don't start it here. We're not ready to start a jobserver yet.

Step 7 is something of your own choice. The system runs perfectly without it, but you make life easier if you intend to work a lot through "sdmsh".
Strictly spoken it isn't necessary on the scheduling server too, but the installation of the convenience package as well as the examples and example jobservers rely on it.
You could compare it with the exchange of keys between two computers if you hop back and forth all the time. you then just can do a "slogin othersys" without specifying your password every 5 seconds.

The convenience package (and the rest) is already installed, so you don't have to do this either.

Now the jobserver specific extra steps.
I mentioned before that you're not ready configuring yet. Let me first describe an environment where you might want to install a jobserver.

You have a machine, I call CLIENT. There you have 3 users: schedulix, john and mary. The machine running the scheduling server is called SERVER.
So far you unpacked the software as schedulix (under /home/schedulix/schedulix) and made some small changes to the files. The value for BICSUITEHOME=/home/schedulix/schedulix .

Now we'll do a bit of the "logical" installation. You can do this either by sdmsh or by the GUI (easier).
First you create a scope called GLOBAL.CLIENT.
Now you create two servers: GLOBAL.CLIENT.JOHN and GLOBAL.CLIENT.MARY.  Remember the passwords :-)
Have a look at the configuration now.

One very important parameter is the location of the jobexecutor. This has to be a full qualified file name. The use of environment variables isn't allowed. You set this parameter for scope GLOBAL.CLIENT, since all jobservers below run on the same machine and use the same schedulix installation. The value is /home/schedulix/schedulix/bin/jobserver.

The next important parameter is the value of Jobfileprefix. This is a full qualified path and a prefix. Here you'll have to take a decision. From my experience it is a good idea to keep all taskfiles in one place (more or less). Optimally you create a separate file system for them. (We did have problems in a production environment because some user processes thought it'd be a good idea to fill up the file system. Because of file system full, the jobservers couldn't create their taskfiles (only 16KB normally), and therefor couldn't start processes).
So let me assume, you have some directory for the taskfiles. Let me say /home/schedulix/taskfiles. You now create two subdirectories /home/schedulix/taskfiles/john and /home/schedulix/mary and make sure john can write the "john"-directory and mary can write the "mary"-directory. (What you do is, you do a "chmod 777 /home/schedulix/taskfiles", then you create one subdirectory as john, one subdirectory as mary, privileges 700 and then you do a "chmod 755 /home/schedulix/taskfiles").
Now you can configure the Jobfileprefix. For the server GLOBAL.CLIENT.JOHN it'll be something like "/home/schedulix/taskfiles/john/task-" and corresponding for GLOBAL.CLIENT.MARY.

On the level GLOBAL.CLIENT you can now set the values for REPOHOST and REPOPORT.

Back to physics now.
Having prepared everything, you can now write two configuration files. One for each jobserver.

For example:
RepoHost= SERVER
RepoPort= 2506
RepoUser= "GLOBAL.CLIENT.JOHN"
RepoPass= MyExtremelySecretPasswordNotToBeToldToAnyone

You store those configuration files in let me say /home/john/etc and /home/mary/etc or so.
Now you create two small wrapper scripts which call the $BICSUITEHOME/jobserver-run script.

If you did all this correctly (and I didn't make any mistakes or did oversee something), you should be able to start both jobservers.

Since I'm now a bit tired of writing, I'd say, try this and see if you get them running. If you have them running, drop a note, and I'll explain the rest.

Happy hacking,

Ronald

PS. Our translator mentioned that he targets to be ready in the beginning of November. About a week later, the command reference will be available online.
PPS. A tarball of $BICSUITEHOME gives a very good starting point. Throw away all configuration files you don't need though.

Sabelo Dlangamandla

unread,
Oct 9, 2013, 12:28:19 PM10/9/13
to sche...@googlegroups.com
Hello Ronald, thanks again for the step by step explanation. I was really feeling lucky this time around for a change, unfortunately my lucky was cut short by an exception thrown by the jobserver. I was able to start the jobserver on the CLIENT computer but when I check the log-files of the jobserver I get this exception:
[scrolllog] Waiting for child (12959) to terminate
[scrolllog] Child exited with state 256
[scrolllog] Try to restart child (child terminated with exit code <> 0)
Exception in thread "main" java.lang.NoClassDefFoundError: de/independit/scheduler/server/ExecuteLock
at de.independit.scheduler.server.SystemEnvironment.<clinit>(SystemEnvironment.java:231)
at de.independit.scheduler.jobserver.Utils.<clinit>(Utils.java:50)
at de.independit.scheduler.jobserver.Config.scanFile(Config.java:204)
at de.independit.scheduler.jobserver.Config.<init>(Config.java:242)
at de.independit.scheduler.jobserver.Server.<init>(Server.java:52)
at de.independit.scheduler.jobserver.JobServer.main(JobServer.java:78)
Caused by: java.lang.ClassNotFoundException: de.independit.scheduler.server.ExecuteLock
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
... 6 more
I tried to compile the schedulix/src several times but the was no change, I checked the BICsuite.jar the class is the there. What is it that am missing?
I would sincerely appreciate your help, and bring the feeling of lucky again, thanks in advance.
Below is the logical view of my jobserver:


Ronald Jeninga

unread,
Oct 9, 2013, 8:26:44 PM10/9/13
to sche...@googlegroups.com
Hi Sabelo,

in some way you _are_ lucky, if you consider finding an error message I didn't see so far as being lucky.

If I understand it correctly, you are able to start the example jobservers, correct?
This would mean, that there is some difference in the "environment" somewhere.

The obvious things to check are:
BICSUITEHOME set correctly
java.conf is correct, especially BICSUITEJAR

If everything seems correct, do you get the same error if you start the jobserver in foreground:

LD_LIBRARY_PATH=$BICSUITEHOME/lib:$LD_LIBRARY_PATH java -cp $BICSUITEHOME/lib/BICsuite.jar de.independit.scheduler.jobserver.JobServer --version

and if that works

LD_LIBRARY_PATH=$BICSUITEHOME/lib:$LD_LIBRARY_PATH java -cp $BICSUITEHOME/lib/BICsuite.jar de.independit.scheduler.jobserver.JobServer <configfile>

Where <configfile> is the file defining the RepoHost, RepoPort and so on.

If this also works, you might have some error in the script calling $BICSUITEHOME/bin/jobserver-run
By the way, you can simply terminate the jobserver in foreground by pressing ctrl-c.

Maybe I get another good idea over night. If so, I'll tell you as soon as possible.

Still, I'm confident we'll find the error, so don't worry, be happy.

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 10, 2013, 4:12:34 AM10/10/13
to sche...@googlegroups.com
Hello Ronald, my lucky is back, my job server started successfully. Here is a log from the Scheduling Server (SERVER) log:
MESSAGE [1010(1010)] 10 Oct 2013 07:37:36 GMT UserConnection started
MESSAGE [1010(1010)] 10 Oct 2013 07:37:36 GMT connect jobserver GLOBAL.'ZAUBUNTU100'.'ZONE4'.'ZONE4_JS' identified by '**********' with protocol = serial, session = 'jobserver[root@ZAUbuntu100]';
MESSAGE [17164,1010(Worker0)] 10 Oct 2013 07:37:36 GMT Server Execution time for class de.independit.scheduler.server.parser.Connect : 2 ms -- Start Committing
MESSAGE [17164,1010(Worker0)] 10 Oct 2013 07:37:36 GMT Execution time for class de.independit.scheduler.server.parser.Connect : 64 ms
MESSAGE [17164,1010(1010)] 10 Oct 2013 07:37:36 GMT register with pid = '15395';
MESSAGE [17164,1010(Worker0)] 10 Oct 2013 07:37:36 GMT Server Execution time for class de.independit.scheduler.server.parser.RegisterServer : 2 ms -- Start Committing
MESSAGE [17164,1010(Worker0)] 10 Oct 2013 07:37:36 GMT Execution time for class de.independit.scheduler.server.parser.RegisterServer : 62 ms
MESSAGE [17164,1010(1010)] 10 Oct 2013 07:37:37 GMT get next job;
MESSAGE [17164,1010(Worker0)] 10 Oct 2013 07:37:37 GMT Server Execution time for class de.independit.scheduler.server.parser.GetNextJob : 2 ms -- Start Committing
MESSAGE [17164,1010(Worker0)] 10 Oct 2013 07:37:37 GMT Execution time for class de.independit.scheduler.server.parser.GetNextJob : 66 ms

Back to my previous post and your response: I tried to start the jobserver in the foreground, got the same exception as before, and the second command was the same. I had a second look at the BICsuite.jar I was using on the CLIENT doing a side-by-side comparison with the BICsuite.jar I compiled when I was installing the Scheduling Server (SERVER). Well the excepetion is correct the class (de.independit.scheduler.server.ExecuteLock) did not exist on the jar created in CLIENT, I think initial I was looking at the jar created by the SERVER when I said it was there earlier.

I found that the following classes were missing as well de.independit.scheduler.server.Feature and de.independit.scheduler.server.MutableInteger. To get around this limitation simply copied the BICsuite.jar created from the SERVER and the problem was gone.
I have no idea why this classes are missing, if you also don't I would assume its the CLIENT computer am using.

I guess now we can continue with the next phase of the jobserver installation as the last time you got tired whilst you were still explaining the logical part of the installation.

Thank you for your help.
Regards
Sabelo

Ronald Jeninga

unread,
Oct 10, 2013, 5:41:09 AM10/10/13
to sche...@googlegroups.com
Hi Sabelo,

first of all, I'm happy you got the jobserver running now. It seems that something has gone wrong compiling the system, but I can't figure out what.
If you want, you can send me the output of the compile ('make new' from $SDMSHOME/src) in a private mail, as I think that it is not relevant for the group. I'll have a look at it then and might find the cause.

Now for the second part of the logical installation.

So far you have a jobserver running and it would execute jobs if it only would get some.
This means, that what you need is a way of addressing this (new) jobserver.

Now we get into the resource concept of the system. The main rule is: the scheduling server hands over a job to a jobserver which has sufficient resources available to start it.
There are three types of resources: Static, System and Synchronizing.

The idea behind this separation is:

1. Static resources define the (abstract) runtime environment.
    You could more or less describe the capabilities of a system by the presence of static resources.
    e.g. You could define resources like HAS_PERL, HAS_GCC, IS_MYSQL_CLIENT, HAS_LATEX, ...
    And a job can say, well I need a runtime environment with perl, mysql and LaTeX.
    This job now can run on _any_ system (resp. jobserver) offering at least the set of HAS_PERL, IS_MYSQL_CLIENT and HAS_LATEX.
    If you have many systems, you'll get load balancing for free this way. But it is a abstract way of looking at your hardware environment.

    For that reason we kept it simple in our examples and followed a more hardware oriented resource concept.
    So we created static resources called LOCALHOST, HOST_1 and HOST_2, logically pointing at three "different" computers.
    Then we defined a resource SERVER, which stands for the user.
    By specifying that a job needs both SERVER and LOCALHOST, you more or less say that you want to execute a job as user SERVER on the system LOCALHOST.

2. System Resources are counters representing "hardware resources".
    You could create a system resource called CPU_UNITS. You now define that each job "consumes" 1 CPU_UNIT.
    The maximum number of jobs started _at the same time_ will now be limited by the number of CPU_UNITS available.
    When a job terminates, its allocated CPU_UNITS are returned to the pool.
    This way, you get load control with only little effort.

3. Synchronizing Resources are complex objects for synchronizing activities.
    You can see them as semaphores, or, if this term is too abstract, traffic lights.
    Although synchronizing resources are relevant for the jobserver selection, I'll skip the concept here. (If a job doesn't require them, it won't miss them if they're not there ;)

For the sake of simplicity let me assume that you want to be able to specifically address a certain jobserver.
(Side note: regardless of the entire concept you want to build, this is always a good idea and doesn't interfere with your, maybe more abstract, concept).
So you have this jobserver on your node CLIENT running as user john and a second jobserver running as user mary.

You now create three Named Resources: RESOURCE.CLIENT, RESOURCE.JOHN and RESOURCE.MARY.
(You can put them somewhere else in the category tree, but that's only an issue of ordering your resource definitions; it has no technical meaning).
The second step is to create _instances_ of the previously created named resources in the scope hierarchy.
You create an instance of RESOURCE.CLIENT in the scope GLOBAL.CLIENT. An instance of RESOURCE.JOHN in the jobserver GLOBAL.CLIENT.JOHN and an instance of RESOURCE.MARY in the jobserver GLOBAL.CLIENT.MARY.

The next step is to create two Environments. An Environment is simply a set of required static resources, describing the prerequisites for the execution of a job.
Each job requires the specification of an Environment.So you create an Environment called JOHN@CLIENT, requesting the resources JOHN and CLIENT, and a second environment MARY@CLIENT, requesting the resources MARY and CLIENT.

Now you can test if you did everything right. You create a job definition (somewhere), make sure it is master submittable, and use one of the newly created environments.
If you submit the job, it will be executed by the corresponding jobserver. (For such tests, I love run commands like 'env').

If you have a look at the setup_example_jobservers.sh script, you'll probably see that it contains exactly the described procedure. Even if you don't know the ins and outs of the command language, I think it's verbose enough to be understood.

I hope I explained everything clearly. If not, just ask.

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 10, 2013, 10:25:14 AM10/10/13
to sche...@googlegroups.com
Hello, Ronald.

Thanks, that was CRYSTAL clear it cant be any better than this. I have created the resources and the environments successfully and I was able to submit a job to the jobserver successfully. My job is not processed though it remains in a state 'RUNNABLE' I guess its something I missed during the creation process of the jobserver, will continue to investigate. Currently when I tail the log I can see the job transitioning to different states then stopping on the 'RUNNABLE' this is the log I get to see from the server:
MESSAGE [0,1040(1040)] 10 Oct 2013 14:04:09 GMT CONNECT 'SYSTEM' IDENTIFIED BY '**********' WITH PROTOCOL = PYTHON, TIMEOUT = 60, SESSION = 'schedulix!web[localhost:8080]', COMMAND = (submit SYSTEM.'ZAUBUNTU100'.'PLEASERUN' with unresolved = ERROR, group = 'PUBLIC');
DEBUG   [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:09 GMT : Job 19115 is re-evaluated
DEBUG   [0,1040(Worker1)] 10 Oct 2013 14:04:09 GMT End creator : 1381413849270
MESSAGE [0,1040(Worker1)] 10 Oct 2013 14:04:09 GMT Execution time for class de.independit.scheduler.server.parser.Connect/class de.independit.scheduler.server.parser.ListSubmitted : 1 ms
MESSAGE [2,TimerThread(Worker0)] 10 Oct 2013 14:04:23 GMT -----------> Start Time Scheduling <------------
MESSAGE [2,TimerThread(Worker0)] 10 Oct 2013 14:04:23 GMT -----------> End Time Scheduling   <------------
MESSAGE [2,TimerThread(Worker0)] 10 Oct 2013 14:04:23 GMT Server Execution time for class de.independit.scheduler.server.TimeSchedule : 1 ms -- Start Committing
MESSAGE [2,TimerThread(Worker0)] 10 Oct 2013 14:04:23 GMT Execution time for class de.independit.scheduler.server.TimeSchedule : 83 ms
MESSAGE [2,TriggerThread(Worker0)] 10 Oct 2013 14:04:23 GMT End Resuming Jobs (0 jobs resumed)
MESSAGE [2,TriggerThread(Worker0)] 10 Oct 2013 14:04:23 GMT Server Execution time for class de.independit.scheduler.server.DoCheckTrigger : 0 ms -- Start Committing
MESSAGE [2,TriggerThread(Worker0)] 10 Oct 2013 14:04:23 GMT Execution time for class de.independit.scheduler.server.DoCheckTrigger : 0 ms

DEBUG   [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT Number of Job Server : 4DEBUG   [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT Number of Jobs in SYNCHRONIZE_WAIT : 1
DEBUG   [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT Number of Jobs in RESOURCE_WAIT : 1DEBUG   [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT : Job 19115 added to Runnable Queue 17164
MESSAGE [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT purgeLow = 338685
MESSAGE [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT purgeSetSize = 4MESSAGE [2,SchedulingThread(Worker0)] 10 Oct 2013 14:04:24 GMT Execution time for class de.independit.scheduler.server.DoSchedule : 99 ms

If you have a hint what could be happening please help, I browsed the GUI manual, but I couldn't get exact meaning of that state (RUNNABLE).

Regards
Sabelo 

Ronald Jeninga

unread,
Oct 10, 2013, 11:18:43 AM10/10/13
to sche...@googlegroups.com
Hi Sabelo,

we're making progress. Perfect :-)

In the GUI documentation there is a brief description of the state model of a job. Chapter 19 or so, I think.
Anyway, the causes for jobs remaining in certain states is only implicit information. I'll try to give a short overview.

1. SUBMITTED -> If a job gets visible in this state, the server is severely broken. I didn't see it since about 11 or 12 years now.
2. DEPENDENCY_WAIT -> well ... The job waits for some predecessor, or for the scheduling thread. If there is no predecessor and the job remains in this state for a (very) long time, check the server log, because the scheduling thread is likely to have huge problems. It might be a bug or something as simple that you try to schedule 100.000 jobs at once.
3. SYNCHRONIZE_WAIT -> The job waits for its required synchronizing resources. Also here the scheduling thread is responsible.
4. RESOURCE_WAIT -> The job waits for system resources. This is like 2. and 3.
5. RUNNABLE -> The job waits until it is fetched by some suitable jobserver. If a job remains in this state, normally the (intended) jobserver doesn't run or has other problems.
                           So check if the new jobserver is running, i.e. issuing "get next job" commands.
6. STARTING -> The job has been fetched by the jobserver. If it doesn't proceed, your jobserver has problems.
7. STARTED -> The jobserver has confirmed the receipt of the job. If the job's state doesn't change to RUNNING, FINISHED or FINAL, check the jobserver, it's probably gone (for a beer in the pub around the corner).
                        It's also worth investigating if the run command is visible (ps -ef), the jobexecutor is present and, if you don't find them, the contents of the taskfile (<prefix>-<jobid>).
8. RUNNING -> If you are trying to find the 200th Mersenne prime, better wait for the next few years. If you're doing something trivial ('env'), the jobserver has probably died. Proceed as in 7.
10. FINISHED -> The job has either children who didn't finish yet, or it has reached a restartable state
11. FINAL -> If you wait a few days, this successful task will have been removed from memory :-)
12. BROKEN_RUNNING -> The jobexecutor has died, but the job is still running. If this lasts unexpectedly long, your jobserver might have died too (in the meantime).
13. BROKEN_FINISHED -> The jobexecutor died before it could collect and write the exit code of your job. You'll have to manage this situation by hand. Read the logfile and try to find out if the job has been successful.
                           Then you can set the state of the job and (if it failed) do a rerun.
14. CANCELLED -> See 11. only that the task probably wasn't successful
15. ERROR -> Some error occurred. You'll have to fix it and restart or cancel the job
16. TO_KILL -> The scheduling system has received the command to execute the "kill program", but this program didn't finish yet. Check for errors as in 7.
                        A kill program is more or less processed like a job without dependencies or resources.
17. KILLED -> It has been tried to kill the process by executing the "kill program", but this seems to have failed. The process will simply continue running. If it terminates, it will reach the FINISHED state.

In short: check your jobserver :-)

Regards,

Ronald

Sabelo Dlangamandla

unread,
Oct 11, 2013, 2:28:33 AM10/11/13
to sche...@googlegroups.com
Ronald,

It WORKS, you were correct it was the jobserver that had issues which resulted on the jobs stuck on the 'RUNNABLE' state, after running 'make new' command and properly setting the environment variable as per your advice the jobs were processed successful.
I can't describe how I feel, but its a good feeling, thank you Ronald.

My next phase is to implement the jobserver agent on a Solaris machine, as this one was on Ubuntu, If I have abnormal issues I hope your hands would still be opened to clarify.

Kind regards
Sabelo

Ronald Jeninga

unread,
Oct 11, 2013, 2:44:07 AM10/11/13
to sche...@googlegroups.com
Hi Sabelo,

of course it works! ;-) ;-) ;-)

Running the system on Solaris is very well possible, as the commercial edition of the software also runs on Solaris at several customers.
You'll need to do a port of the jobexecutor though (I think scrolllog will compile without problems, but I don't know by heart and I don't guarantee it now).

There are two things different in a solaris environment:
1. Identification of processes
    Using only a PID is not a good idea, since PIDs are reused. So in case of a died jobexecutor we need more than just a PID in order to determine the state of a job (BROKEN_RUNNING or BROKEN_FINISHED)
2. File locking of the taskfile

There's no magic involved, but it's a bit tricky. I'll be away over the weekend, but I can help you next week (from Tuesday).

Anyway I'm happy you're happy :-)

Enjoy your weekend!

Ronald

Sabelo Dlangamandla

unread,
Oct 11, 2013, 2:59:51 AM10/11/13
to sche...@googlegroups.com
Thank you for the tips, will give it a shot. Have a great weekend too.

Regards
Sabelo
Reply all
Reply to author
Forward
0 new messages