[slurm-users] Memory allocation error

2,386 views
Skip to first unread message

Mahmood Naderan

unread,
Mar 13, 2018, 4:12:44 PM3/13/18
to slurm...@lists.schedmd.com
Hi,
By specifying the following parameters in a gaussian file

%nprocshared=2
%mem=1GB

and a slurm script as below

#!/bin/bash
#SBATCH --output=test.out
#SBATCH --job-name=gaus-test
#SBATCH --nodelist=compute-0-1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
g09 test.gjf


the run terminates with the following error in the log file

galloc: could not allocate memory.: Cannot allocate memory

However, by running that gaussian job file directly on the
compute-0-1, there isn't any problem. In fact compute-0-1 has 8GB of
memory.

Any idea about that?


Regards,
Mahmood

Christopher Samuel

unread,
Mar 13, 2018, 4:20:31 PM3/13/18
to slurm...@lists.schedmd.com
On 14/03/18 07:11, Mahmood Naderan wrote:

> Any idea about that?

You've not requested any memory in your batch job and I guess your
default limit is too low.

To get the 1GB (and a little head room) try:

#SBATCH --mem=1100M

That's a per node limit, so for MPI jobs (which Gaussian is not)
you'll need a different parameter.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Mahmood Naderan

unread,
Mar 14, 2018, 1:46:04 AM3/14/18
to Slurm User Community List
Excuse me, but it doesn't work. I set --mem to 2GB and I put free
command in the script. I don't know why it failed.

[mahmood@rocks7 ~]$ sbatch sl.sh
Submitted batch job 19
[mahmood@rocks7 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
[mahmood@rocks7 ~]$ cat test.out
compute-0-1.local
total used free shared buff/cache available
Mem: 7.6G 127M 6.8G 8.5M 729M 7.3G
Swap: 2.4G 0B 2.4G
galloc: could not allocate memory.: Cannot allocate memory
[mahmood@rocks7 ~]$ head -n 4 test.gjf
%nprocshared=2
%mem=1GB
# mp2/gen pseudo=read opt freq

[mahmood@rocks7 ~]$ cat sl.sh
#!/bin/bash
#SBATCH --output=test.out
#SBATCH --job-name=ga-test
#SBATCH --nodelist=compute-0-1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=2GB
hostname
free -mh
g09 test.gjf
[mahmood@rocks7 ~]$
Regards,
Mahmood

Mahmood Naderan

unread,
Mar 14, 2018, 4:38:40 AM3/14/18
to Slurm User Community List
Hi again
I tried with --mem=2000M in the slurm script and put strace command in front of g09. Please see some last lines


fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
fstat(0, {st_mode=S_IFREG|0664, st_size=6542, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc647a3f000
read(0, "%nprocshared=2\r\n%mem=1GB\r\n# mp2/"..., 8192) = 6542
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc647a3d000
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(3, 0, SEEK_SET)                   = 0
read(0, "", 8192)                       = 0
write(3, "%nprocshared=2\n%mem=1GB\n# mp2/ge"..., 5668) = 5668
close(3)                                = 0
munmap(0x7fc647a3d000, 8192)            = 0
geteuid()                               = 1000
stat("/usr/local/chem/g09-64-D01/l1.exe", {st_mode=S_IFREG|0751, st_size=1673376, ...}) = 0
write(1, " Entering Gaussian System, Link "..., 212) = 212
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fc646f69270}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7fc646f69270}, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fffe75ed3b0) = 2818
wait4(2818, galloc:  could not allocate memory.: Cannot allocate memory
[{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 0, NULL) = 2818
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fc646f69270}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7fc646f69270}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=2818, si_uid=1000, si_status=SIGSEGV, si_utime=9, si_stime=4} ---
access("/home/mahmood/Gaussian/scratch/Gau-2817.inp", F_OK) = 0
unlink("/home/mahmood/Gaussian/scratch/Gau-2817.inp") = 0
exit_group(1)                           = ?
+++ exited with 1 +++



I think that slurm wrongly detect/set the memory requirements. Maybe it put a limit and therefore g09 is unable to allocate the required space. I say that because, I can directly ssh to that node and run the program with no error.

Any idea?



Regards,
Mahmood



Chris Samuel

unread,
Mar 14, 2018, 5:46:32 AM3/14/18
to slurm...@lists.schedmd.com
On Wednesday, 14 March 2018 7:37:19 PM AEDT Mahmood Naderan wrote:

> I tried with --mem=2000M in the slurm script and put strace command in front
> of g09. Please see some last lines

Gaussian is trying to allocate more than 2GB of RAM in that case.

Unfortunately your strace doesn't show anything useful as the failure is in a
process that's forked off from the shell script that is g09 and your strace
isn't following child processes.

If you put this in your script rather than the g09 command what does it say?

ulimit -a

Also this code can be useful to find out how much memory Gaussian is actually
using, though you'll need root access (or a helpful sysadmin there) to be able
to use it.

https://github.com/gsauthof/cgmemtime

It will use a special cgroup (which it can configure for you) to track the
memory usage of the whole Gaussian run for you.

Mahmood Naderan

unread,
Mar 14, 2018, 6:15:45 AM3/14/18
to Slurm User Community List
>If you put this in your script rather than the g09 command what does it say?
>
>ulimit -a

That was a very good hint. I first ran ssh to compute-0-1 and saw
"unlimited" value for "max memory size" and "virtual memory". Then I
submitted the job with --mem=2000M and put the command in the slurm
script.

I then noticed that "max memory size" is about "2000000" while
"virtual memory" is about 2500000. I guessed that the problem is with
the virtual memory which program wants to allocate. By setting
--mem=4000M, the error disappeared and now the job is running.

Thank you very much.

Regards,
Mahmood

Chris Samuel

unread,
Mar 14, 2018, 6:21:42 AM3/14/18
to slurm...@lists.schedmd.com
On Wednesday, 14 March 2018 9:14:45 PM AEDT Mahmood Naderan wrote:

> Thank you very much.

My pleasure, so glad it helped!
Reply all
Reply to author
Forward
0 new messages