"cannot allocate memory" problem

Dmitry Mishin

unread,

Jun 1, 2017, 8:41:10 PM6/1/17

to golang-nuts

Hello all,

Trying to fix the out of memory issue in my go app.

The app is working fine with no memory usage increasing for 10-15 minutes, then suddenly system starts giving "cannot allocate memory" error for any command, same in the app.

I attached the trace and heap during the out of memory state

The app is finding files and folders in lustre mount, submitting those as amqp messages, and copying the discovered files

Thanks for your help!

trace.txt

heap.txt

Dave Cheney

unread,

Jun 1, 2017, 8:54:51 PM6/1/17

to golang-nuts

Does the machine (vm, container, etc) you are running this application have any swap configured.

What I think is happening is the momentary spike in potential memory usage during the fork / exec cycle is causing the linux memory system to refuse to permit the fork. There are several open issues for this, search clone or vfork on the github.com/golang/go project, but the short version is; add swap.

Dmitry Mishin

unread,

Jun 1, 2017, 9:13:53 PM6/1/17

to golang-nuts

The system has 1GB of swap, and I just tried enabling:

sysctl -w vm.overcommit_memory=1

sysctl -w vm.swappiness=1

according to https://stackoverflow.com/questions/35025338/cannot-allocate-memory-error, with no effect.

There's 64GB of RAM, which is not being used at the time of error:

top - 16:47:14 up 2 days,  4:35,  4 users,  load average: 1.43, 1.35, 1.12
Tasks: 32606 total,   1 running, 531 sleeping,   0 stopped, 32074 zombie
Cpu(s):  0.7%us, 14.0%sy,  0.0%ni, 80.6%id,  3.3%wa,  0.0%hi,  1.4%si,  0.0%st
Mem:  66067872k total, 43270952k used, 22796920k free,     5684k buffers
Swap:  1023996k total,      332k used,  1023664k free, 26004700k cached


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1401 root      20   0  898m  26m 2936 S 93.6  0.0   9:30.36 pdm
 6494 root      20   0 38280  25m  880 R 20.1  0.0   1:20.34 top
 6145 root      20   0  112m 5468 4428 S  0.0  0.0   0:00.05 sshd
29324 root      20   0  112m 5092 4056 S  0.0  0.0   0:00.60 sshd
 2720 root      20   0  114m 3284  616 S  0.3  0.0   0:43.29 sshd
 2377 nobody    20   0  152m 3148  668 S  0.0  0.0   0:06.24 gmond
 9290 postfix   20   0 79868 2852 1996 S  0.0  0.0   0:00.01 pickup
29336 root      20   0  106m 1916 1432 S  0.0  0.0   0:00.03 bash
 6155 root      20   0  106m 1888 1428 S  0.0  0.0   0:00.02 bash
 2575 root      20   0  379m 1880  684 S  0.0  0.0   0:00.89 automount
 2240 haldaemo  20   0 38224 1744  632 S  0.0  0.0   0:01.58 hald
30608 root      20   0  112m 1672  636 S  0.0  0.0   0:00.24 sshd
 2724 root      20   0  106m 1416  912 S  0.0  0.0   0:00.31 bash
 2467 postfix   20   0 80036 1404  496 S  0.0  0.0   0:00.18 qmgr
30634 root      20   0  106m 1376  868 S  0.0  0.0   0:00.13 bash
 2456 root      20   0 80000 1352  460 S  0.0  0.0   0:00.74 master
 1911 root      20   0  245m 1300  588 S  0.0  0.0   0:00.27 rsyslogd
 2366 ntp       20   0 25440 1160  588 S  0.0  0.0   0:00.24 ntpd
 2323 nscd      20   0  877m 1148  644 S  0.0  0.0   0:02.14 nscd
 2357 root      20   0 90848 1052  392 S  0.0  0.0   0:00.01 sshd
 1628 root      18  -2 11260 1048  232 S  0.0  0.0   0:00.00 udevd
 2468 root      20   0  112m  980  392 S  0.0  0.0   0:00.40 crond

pprof is showing around 2MB memory used all the time.

I'm thinking the problem is somewhere here, since it goes away when I disable this part of the app:

https://github.com/sdsc/pdm/blob/3b5c7fcef24e9081f3bd0608efde1bdc10a65d17/lustre_backend.go#L73

Also I thought I'm creating too many goroutines too fast, and I just rewrote this part to use no goroutines and channels and return a simple slice, with no good effect:

https://github.com/sdsc/pdm/blob/master/lustre_backend.go#L73

What I'm wondering about is the time it takes to get the error - very close to 10 minutes all the time. Not even dependent on the number of workers (I have a setting for that and trying with 1-5 workers)

Dmitry Mishin

unread,

Jun 1, 2017, 9:24:48 PM6/1/17

to golang-nuts

My current settings:

[root@mover-7-2 pdm]# sysctl -a | grep overco

vm.overcommit_memory = 1

vm.overcommit_ratio = 50

vm.overcommit_kbytes = 0

vm.nr_overcommit_hugepages = 0

So the app can use up to ~32GB in my understanding, which should be enough for everything?

Dmitry Mishin

unread,

Jun 1, 2017, 9:49:31 PM6/1/17

to golang-nuts

I was wrong about the number of folder read workers. Adding more workers makes the error happen sooner.

Dave Cheney

unread,

Jun 1, 2017, 11:09:14 PM6/1/17

to golang-nuts

Imo this is not a problem with the amount of memory in use inside you go program, but Linux's incorrect accounting.

Dmitry Mishin

unread,

Jun 2, 2017, 12:25:51 AM6/2/17

to golang-nuts

I have the same impression...

Added a 300 msec sleep before each folder read task. Will see how it goes, maybe that will let the OS recover whatever resources it's missing...

Thanks for your help!

Dmitry Mishin

unread,

Jun 4, 2017, 10:47:56 PM6/4/17

to golang-nuts

I figured it. My exec.Command's were leaving a bunch of zombie processes. Had to do cmd.Wait() on them to exit cleanly.

Reply all

Reply to author

Forward