The system has 1GB of swap, and I just tried enabling:
sysctl -w vm.overcommit_memory=1
sysctl -w vm.swappiness=1
There's 64GB of RAM, which is not being used at the time of error:
top - 16:47:14 up 2 days, 4:35, 4 users, load average: 1.43, 1.35, 1.12
Tasks: 32606 total, 1 running, 531 sleeping, 0 stopped, 32074 zombie
Cpu(s): 0.7%us, 14.0%sy, 0.0%ni, 80.6%id, 3.3%wa, 0.0%hi, 1.4%si, 0.0%st
Mem: 66067872k total, 43270952k used, 22796920k free, 5684k buffers
Swap: 1023996k total, 332k used, 1023664k free, 26004700k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1401 root 20 0 898m 26m 2936 S 93.6 0.0 9:30.36 pdm
6494 root 20 0 38280 25m 880 R 20.1 0.0 1:20.34 top
6145 root 20 0 112m 5468 4428 S 0.0 0.0 0:00.05 sshd
29324 root 20 0 112m 5092 4056 S 0.0 0.0 0:00.60 sshd
2720 root 20 0 114m 3284 616 S 0.3 0.0 0:43.29 sshd
2377 nobody 20 0 152m 3148 668 S 0.0 0.0 0:06.24 gmond
9290 postfix 20 0 79868 2852 1996 S 0.0 0.0 0:00.01 pickup
29336 root 20 0 106m 1916 1432 S 0.0 0.0 0:00.03 bash
6155 root 20 0 106m 1888 1428 S 0.0 0.0 0:00.02 bash
2575 root 20 0 379m 1880 684 S 0.0 0.0 0:00.89 automount
2240 haldaemo 20 0 38224 1744 632 S 0.0 0.0 0:01.58 hald
30608 root 20 0 112m 1672 636 S 0.0 0.0 0:00.24 sshd
2724 root 20 0 106m 1416 912 S 0.0 0.0 0:00.31 bash
2467 postfix 20 0 80036 1404 496 S 0.0 0.0 0:00.18 qmgr
30634 root 20 0 106m 1376 868 S 0.0 0.0 0:00.13 bash
2456 root 20 0 80000 1352 460 S 0.0 0.0 0:00.74 master
1911 root 20 0 245m 1300 588 S 0.0 0.0 0:00.27 rsyslogd
2366 ntp 20 0 25440 1160 588 S 0.0 0.0 0:00.24 ntpd
2323 nscd 20 0 877m 1148 644 S 0.0 0.0 0:02.14 nscd
2357 root 20 0 90848 1052 392 S 0.0 0.0 0:00.01 sshd
1628 root 18 -2 11260 1048 232 S 0.0 0.0 0:00.00 udevd
2468 root 20 0 112m 980 392 S 0.0 0.0 0:00.40 crond
pprof is showing around 2MB memory used all the time.
I'm thinking the problem is somewhere here, since it goes away when I disable this part of the app:
Also I thought I'm creating too many goroutines too fast, and I just rewrote this part to use no goroutines and channels and return a simple slice, with no good effect:
What I'm wondering about is the time it takes to get the error - very close to 10 minutes all the time. Not even dependent on the number of workers (I have a setting for that and trying with 1-5 workers)