Problem in Batch Mode

90 views
Skip to first unread message

Fernando Obed

unread,
Oct 13, 2022, 1:40:04 PM10/13/22
to The irace package: Iterated Racing for Automatic Configuration
Hi,

I am trying to perform the parameter tunning of my algorithm with IRACE at a cluster from Compute Canada, then I must adapt IRACE to the batch mode, where the script "target-runner-slurm" must send jobs to the queue and the file "target-evaluator" evaluates the results obtained.

When IRACE is executed in the slurm environment IRACE sends a group of jobs to the queue, when the jobs are done, then IRACE analyzes the results obtained with the jobs and then sends another group of jobs to the queue. However, when I run IRACE, at some point of the execution IRACE stops sending jobs, I mean, at some point of the execution of IRACE, IRACE sends a group of jobs, but after these jobs are finished then IRACE does not send the next group of jobs, and IRACE stops, no more advance is produced.

The main problem is that I do not get an error message, IRACE just stops sending jobs and stops running.In order to find the cause of this problem, I have done the following:

*I checked the  processes that are currently running, in order to see if the scripts "target-runner-slurm" or "target-evaluator" are running in an infinite loop, however, these scripts are not running.

*When I execute IRACE, I store all the files  *.stdout, I checked them and all contains the result obtained with my algorithm.

*When I execute IRACE, I store all the file *.stderr, all these files contain the text "Picked up JAVA_TOOL_OPTIONS: -Xmx2g", because when I run my algorithm (I implemented my algorithm with Java) then this text is printed in the terminal.

*When a job is executed, then the system produces a file "slurm-<Job ID>.txt", I checked all the slurm files produced when I execute IRACE, where each slurm file correspond to a job snet and performed with IRACE. I checked these files and all containst the text "OK"


I could not detect where the problem lies. What do you think can be the problem?

Manuel López-Ibáñez

unread,
Oct 13, 2022, 1:58:20 PM10/13/22
to The irace package: Iterated Racing for Automatic Configuration
On Thursday, 13 October 2022 at 18:40:04 UTC+1 Fernando Obed wrote:
Hi,

I am trying to perform the parameter tunning of my algorithm with IRACE at a cluster from Compute Canada, then I must adapt IRACE to the batch mode, where the script "target-runner-slurm" must send jobs to the queue and the file "target-evaluator" evaluates the results obtained.

This is not necessarily true.

You can reserve a number of CPUs within the same node (machine) and submit a job that runs irace using the --parallel option. This will be faster in terms of communication overhead (but you can only use as many jobs in parallel as CPUs there are in a single machine).

A second option, if your cluster supports it, is to use MPI to reserve CPUs in many machines, then submit irace to the cluster and the options --mpi 1 --parallel N. This requires setting up the environment using openMPI or similar so that Rmpi can find the nodes assigned to this job. The people administering your cluster should be able to help you with this.

I almost never use the --batchmode option because it is slower than all the other alternatives.

When IRACE is executed in the slurm environment IRACE sends a group of jobs to the queue, when the jobs are done, then IRACE analyzes the results obtained with the jobs and then sends another group of jobs to the queue. However, when I run IRACE, at some point of the execution IRACE stops sending jobs, I mean, at some point of the execution of IRACE, IRACE sends a group of jobs, but after these jobs are finished then IRACE does not send the next group of jobs, and IRACE stops, no more advance is produced.

If irace stops sending jobs it could be because irace believes the previous jobs are still running, so there must be a problem in how the cluster reports that a job is finished and what irace is expecting.
 
I could not detect where the problem lies. What do you think can be the problem?

* Try running with --debug-level 3 and check also that the jobs are really not running or somehow reported as running.

* For SLURM,  the output of sbatch should match what is expected here: https://github.com/MLopez-Ibanez/irace/blob/master/inst/examples/batchmode-cluster/target-runner-slurm otherwise irace will not get the correct jobID. Make sure that the jobID returned by target-runner is indeed the jobID that allows squeue to know whether the job is running or not.
* For SLURM, the command that checks is a jobID is comleted is: https://github.com/MLopez-Ibanez/irace/blob/master/R/cluster.R#L32 Make sure that it works in your system (launch a job manually and use that function within R to check whether it detects when it finishes).

A better solution would be to use https://mllg.github.io/batchtools/ since it would be more powerful, more reliable and less code within irace. Any contributions towards that direction would be very much appreciated.

Cheers,

Manuel.


fobe...@gmail.com

unread,
Oct 13, 2022, 2:35:32 PM10/13/22
to The irace package: Iterated Racing for Automatic Configuration
I adapted the scripts "target-runner-slurm" and "target-evaluator" from the directory "examples".  In which script do I have to review that a job has finished?. In case I have to add an instruction to the script "target-runner-slurm" in order to see if the jobs are finished, in which part of the "target-runner-slurm" I have to check that a job has finished?

I do not understand what "batchtools" (from https://mllg.github.io/batchtools/) is for, do you know where can I find a tutorial or some documentation about batchtools?, I check the information about bacthtools in the link, but it is not clear how to use it and what it is for in order to perform the paramter tunning with IRACE. 

I am using the slurm enviroment because the execution time of my algorithm for a given instance goes from 1 hour to 10 hours, I mean, some jobs will take one hour to be done and other jobs will take 10 hours. Currently I am using the scripts "target-runner-slurm" and "target-evaluator" (from the examples directory), I adapted in order to run IRACE at a cluster from compute canada, in compute canada I can send up to 1000 jobs at a time. When I run IRACE, IRACE will perfomr four iterations. At a given iteration, IRACE sends groups of approximately 65 jobs at time, for a given iteration IRACE sends up to 6 groups of jobs, thus performing one IRACE iteration takes in average up to 2 days (given the execution times of my algorithm), thus the entire parameter tunning with IRACE would take approximatelly 7 or 8 days. I am wondering if is possible that at a given iteration IRACE sends all the jobs at once, in order to reduce the execution time of IRACE, since in compute canada I can send 1000 jobs at a time.

Manuel Lopez-Ibanez

unread,
Oct 13, 2022, 3:56:31 PM10/13/22
to Iterated Racing for Automatic Configuration
Hi Fernando,

The website of batchtools has tutorials and documentation. There's also a paper: https://doi.org/10.21105/joss.00135 But you will need to know R to use it. Otherwise you are stuck with what irace currently provides.

target-runner cannot check that jobs have finished, it is irace (or you) which checks by using squeue. What target-runner needs to do correctly is to report the JOBID of the job launched. You can invoke target-runner outside irace, to see if it reports the JOBID correctly. 

You can also use squeue outside irace to check what happens when that JOBID is running and when it finishes. Your cluster must come with some documentation about squeue.  irace uses 'squeue -j JOBID --no-header' internally for slurm as can be seen here: https://github.com/MLopez-Ibanez/irace/blob/master/R/cluster.R#L32

When using --debug-level 3 you can see which JOBID irace is waiting for and you can check yourself why irace thinks it is still running.

Currently, irace submits all jobs per instance and waits for them to finish. It would be possible to submit jobs for multiple instances speculatively and if we receive enough evidence to eliminate a configuration that is running speculatively then kill its jobs. Someone would need to implement this within irace (irace currently does not kill jobs no matter how long they take). Any help in implementing this would be welcome. I don't have the time to do it myself.

Since irace does not launch jobs speculatively yet, the amount of parallelism is limited by the number of configurations alive in the race, which can be as low as minAlive.  That's why it may be easier and faster to reserve a node with 64 or 128 CPUs and launch irace directly in the node with --parallel 64, without using batchmode.

If your algorithm is an anytime algorithm that can report its progress, you may look at using adaptive capping approaches with irace so that bad configurations are detected earlier, saving significant time. See https://www.sciencedirect.com/science/article/pii/S0305054821003300 (there's a link to the software and examples in the paper).

I hope the above helps!

Cheers,

Manuel.



--
You received this message because you are subscribed to the Google Groups "The irace package: Iterated Racing for Automatic Configuration" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irace-packag...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/irace-package/c3ccc92f-6771-47b3-861e-1ed6bd9cee8cn%40googlegroups.com.

fobe...@gmail.com

unread,
Oct 14, 2022, 12:55:42 PM10/14/22
to The irace package: Iterated Racing for Automatic Configuration
Hi,

I checked the scritp target-runner-slurm  that I am using and it prints the job ID correctly. However, I was thinking  that perhaps the reason of why IRACE stops sending jobs and is because when the command sbatch is used, sometimes is printed a message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation" instead the message "Submitted job <JobID>", this happends when the cluster is very busy. In this case, the job ID can not be read and printed when target-runner-slurm is called.

Do you think this is the reason of why IRACE stops sending job and there is no more advance (it seems IRACE enter in an infinity loop)?

Manuel López-Ibáñez

unread,
Oct 14, 2022, 1:03:32 PM10/14/22
to The irace package: Iterated Racing for Automatic Configuration
Hi,

What does target-runner-slurm print when that error happens? I would expect it to return either:

"$0: cannot parse jobID from the output of sbatch!" or "$0: sbatch failed!

irace would receive those messages and report the error. Isn't the target-runner-slurm doing that?

Cheers,

Manuel.

fobe...@gmail.com

unread,
Oct 14, 2022, 4:09:58 PM10/14/22
to The irace package: Iterated Racing for Automatic Configuration
So, if target-runner-slurm is called and the message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation"  is produced when the command sbatch is used, then target-runner-slurm will print the message ""$0: cannot parse jobID from the output of sbatch!"". In this case, does IRACE stop running or the script targert-runner-slurm is called again in order to do a second attempt to send the job?

Manuel López-Ibáñez

unread,
Oct 14, 2022, 4:22:11 PM10/14/22
to The irace package: Iterated Racing for Automatic Configuration
Hi,

A failed call to target-runner stops irace immediately. You can try it yourself. Just change the target-runner to always produce the error. If it doesn't stop that would be a bug in irace.

Cheers,

Manuel
Reply all
Reply to author
Forward
0 new messages