Singularity with SLURM

671 views
Skip to first unread message

Dmitri Chebotarov

unread,
Jul 28, 2017, 10:37:40 AM7/28/17
to singu...@lbl.gov
Hi

I'm running into some issues with Singularity/SLURM. What seems to be happening is SLURM kills Singularly jobs, and it leaves a zombie process of the application that was running inside the container.

Also, how does SLURM track memory usage with Singularity? Which process does it tracks? Singularity or the actual job? 

Some Singularity jobs get killed b/c of memory usage, but sacct reports much lower memory usage vs what was requested (ie. 300M from sacct vs --mem=16GB when submitting the job).

Are there any adjustments I need to make to SLURM config to support Singularity? 

Thank you.

David Godlove

unread,
Jul 28, 2017, 11:09:31 AM7/28/17
to singu...@lbl.gov
I can't speak to SLURM leaving zombie processes lying around.  That is unusual.  Perhaps it has to do with what you are running inside the container?  Are you using a new PID namespace when you run singularity?  

SLURM should track memory of the actual job.  Which includes the singularity process itself and any processes running inside of the container.  The problem you are noting with sacct is a problem with SLURM generally.  SLURM logging is not instantaneous.  Sometimes you have a job that starts ramping up memory usage very quickly.  Then it gets killed by SLURM before proper logging occurs and you don't see the actual amount of memory that was used in sacct.  In these conditions it's usually best to submit a representative test job with much more memory than you think it will actually need.  Then record how much memory the job used and update your memory allocation for future jobs accordingly.  

The SLURM config should be just fine to support Singularity.  Singularity is just another app!  Albeit a really awesome one.  

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Dmitri Chebotarov

unread,
Jul 28, 2017, 12:29:42 PM7/28/17
to singu...@lbl.gov
Hi Dave

Thank you for explanation on SLURM sacct, it makes sense.

What do you mean by "a new PID namespace"?
To run an application in container I set alias via corresponding module file, i.e.

module load R

where the R module has:
...
module load singularity
set-alias R "singularity exec /path/to/container/R-3.4.1 /opt/R/3.4.1/bin/R $*"

User can use R as usual w/o need to change submit scripts.

Thank you.

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Gregory M. Kurtzer

unread,
Jul 28, 2017, 12:36:59 PM7/28/17
to singu...@lbl.gov
This looks fantastic, and I'm not sure why SLURM is doing that. Are you using CGroups with SLURM?

BTW, just as a minor shell nitpick, change the last bit of your alias to be "$@" instead of "$*" to preserve quoting and escapes.

Greg
--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory

David Godlove

unread,
Jul 28, 2017, 12:43:16 PM7/28/17
to singu...@lbl.gov
I was wondering if you were using the -p/--pid option to run your container in a new PID namespace.  Doesn't look like it.  If you were then when you were inside the container if you used ps you would only see processes running within the container.  And from outside you would only see the pid of the parent process but none of the pids for the processes running inside the container.  

I thought that might explain your zombie process, but maybe not.  

Dmitri Chebotarov

unread,
Jul 28, 2017, 4:01:42 PM7/28/17
to singu...@lbl.gov
Thanks much for nitpick - it's helpful tip.

I ran few test with R /w and w/o S and it seems like Slurm (or 'ps' command) doesn't know anything about Singularity - it only detects actual process.
Singularity doesn't spawn a separate process.

For example, when I run R and check 'ps ax' output:

* R 3.4.1 /w S:
9870  8.8  0.0 231732 36092 pts/8    S+   15:44   0:00 /opt/R/3.4.1/lib64/R/bin/exec/R

* R 3.2.0 native:
7240  0.6  0.0 179356 30520 pts/8    S+   15:37   0:00 /cluster/shared/apps/R/3.2.0/lib64/R/bin/exec/R

The 'ps' command (and as result the Slurm ) only 'sees' the R process. 

R 3.4.1 /w S uses slightly more memory compared to R 3.2.0 native.
It's hard to tell if it's related to Singularity or new R version uses more memory.

So far it looks OK and there should be no reason for Slurm to kill S jobs.
It's possible that the issue is related to Slurm itself - I think we few version behind current Slurm version.

Thank you.

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.



--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory

Gregory M. Kurtzer

unread,
Jul 28, 2017, 7:02:40 PM7/28/17
to singu...@lbl.gov
Singularity sets up the environment and then exec's itself out of existence, passing the PID and flushing itself from memory when the program within the container is running. You are finding that there is no Singularity process running on your system when a Singularity container is in fact running. This is correct because Singularity attaches the kernel namespaces to the actively running process(es). When that process completes, those namespaces collapse and the kernel cleans up, flushing the namespaces completely and leaving a clean system.

It really is quite elegant if I do say so myself. ;)

As far as memory consumption, it could be a few things, but it is not Singularity as Singularity is no longer taking up resident memory.

And for Slurm, check into using the CGroups support.

Hope that helps!

Greg

To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.
Reply all
Reply to author
Forward
0 new messages