I've got a weird problem on our slurm cluster. If I submit lots of R jobs to the queue then as soon as I've got more than about 7 of them running at the same time I start to get failures, saying:
/bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared libraries: libpcre2-8.so.0: cannot open shared object file: No such file or directory
..which makes no sense because that library is definitely there, and other jobs on the same nodes worked both before and after the failed jobs. I recently ran 500 identical jobs and 152 of them failed in this way.
There are no errors in the log files on the compute nodes where this failed and it happens across multiple nodes so it's not a single one being strange. The R binary is on an isilon network share, but the libpcre2 library is on the local disk for the node.
Anyone come across anything like this before? Any suggestions for fixes?
Thanks
Simon.
Interesting idea, thanks. I don't think this looks like the likely cause though:
# lsof | wc -l
20675
# cat /proc/sys/fs/file-max
52325451
This is on one of the nodes which had failures. The number of open files is tiny compared to the limit. I know there's a per-process limit, but given that the jobs are all identical then this should consistently fail if it was that.
Simon.