[slurm-users] R jobs crashing when run in parallel

125 views
Skip to first unread message

Simon Andrews

unread,
Mar 29, 2021, 8:47:33 AM3/29/21
to slurm...@lists.schedmd.com

I've got a weird problem on our slurm cluster.  If I submit lots of R jobs to the queue then as soon as I've got more than about 7 of them running at the same time I start to get failures, saying:

 

/bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared libraries: libpcre2-8.so.0: cannot open shared object file: No such file or directory

 

..which makes no sense because that library is definitely there, and other jobs on the same nodes worked both before and after the failed jobs.  I recently ran 500 identical jobs and 152 of them failed in this way.

 

There are no errors in the log files on the compute nodes where this failed and it happens across multiple nodes so it's not a single one being strange.  The R binary is on an isilon network share, but the libpcre2 library is on the local disk for the node.

 

Anyone come across anything like this before?  Any suggestions for fixes?

 

Thanks

 

Simon.

 

Patrick Goetz

unread,
Mar 29, 2021, 12:35:14 PM3/29/21
to slurm...@lists.schedmd.com
Could this be a function of the R script you're trying to run, or are
you saying you get this error running the same script which works at
other times?
> This message is from an external sender. Learn more about why this
> matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>
>

Prentice Bisbal

unread,
Mar 29, 2021, 1:29:00 PM3/29/21
to Slurm User Community List
It sounds to me like configuration drift on your cluster. I would check that libpcre is actually (still?) Installed on all your cluster nodes. I'll bet if you check the node(s) where the jobs are failing, it's probably a particular subset of nodes, or even only a single node, and libpcre has some how disappeared from that node(s).

Prentice

William Brown

unread,
Mar 29, 2021, 2:13:30 PM3/29/21
to Slurm User Community List
Maybe you have run out of file handles. 

William 

Simon Andrews

unread,
Mar 30, 2021, 6:40:17 AM3/30/21
to Slurm User Community List

Interesting idea, thanks.  I don't think this looks like the likely cause though:

 

# lsof | wc -l

20675

 

# cat /proc/sys/fs/file-max

52325451

 

This is on one of the nodes which had failures.  The number of open files is tiny compared to the limit.  I know there's a per-process limit, but given that the jobs are all identical then this should consistently fail if it was that.

 

Simon.

Reply all
Reply to author
Forward
0 new messages