Hello,
As promised, a description of the application (farmfarms) I'm working on.
The application runs in two stages. It is given one or more <farmname> directories, from which it creates tarballs <parent>_<farmname> .tar.gz containing the inputs for a computational experiment. Adding <parent> ensures uniqueness of the tarball name.
In the second stage the application runs a shell script (rubymeta.sh) that copies a ruby script (runmeta.rb) to the working directory and runs it. runmeta.rb copies two further ruby scripts (runner.rb, restart.rb) to the working directory along with some other scripts that do post-processing of output in preparation to loading data to an SQL database, although that functionality is currently disabled. runmeta.rb starts runner.rb, the heart of the application. runner.rb creates and sends <parent>_<farmname> .tar.gz to a slurm host, unpacks it, submits the job to slurm and waits for completion, signalled by the creation of a results file <parent>_<farmname>.tgz, which it recovers to the working directory, unpacks and (if not disabled) processes and loads into an SQL database. If the job timed out then restart.rb edits some input files, renames <parent>_<farmname> and invokes runner.rb with the 'new' <farmname>. Currently, the restart facility is disabled because it depends on accessing the database to discover the outcome of the calculation, and database submission is currently disabled while I get my taskvine application running.
Most of this is working well but at the moment I cannot get taskvine to distribute the stage 2 work to more than one worker at the same time. One worker always gets all the work, even if >1 worker is available.
The taskvine application is written in C++ and I'm running on FreeBSD 12.4 and 13.2.
Thanks for reading this far. If you have suggestions on how to debug, or want to see my debug logs, please ask.
Thanks,
Roger