Hey everyone,
We've been using MongoDB in production for several months without
problems until yesterday, when two servers in my replica set (the
master and the slave) suddenly went down at almost the same time. Due
to my own stupidity, my local monitoring system (monit) cleared the
mongodb.log, so I wasn't able to check for error messages. All I could
find was this message logged by syslog:
mongodb main process (24395) terminated with status 100
According to the source (
https://github.com/mongodb/mongo/blob/master/
pch.h), status 100 means "top level exception that wasn't caught," but
that doesn't help much.
Then it happened again this morning (after fixing the bug in my
monitoring system), and I was able to see that no error is being
logged in mongodb.log at all. Along with the fact that /var/lib/
mongodb/mongod.lock was getting left behind (preventing an automatic
restart), it seemed clear that something was forcibly killing MongoDB
and preventing a clean shutdown. When I checked the dmesg output, I
found several of these, indicating that the system was out of memory:
[869138.653688] mongod invoked oom-killer: gfp_mask=0x280da, order=0,
oom_adj=0
[869138.653693] mongod cpuset=/ mems_allowed=0
[869138.653696] Pid: 16290, comm: mongod Not tainted 2.6.32-309-ec2
#18-Ubuntu
[869138.653697] Call Trace:
[869138.653705] [<ffffffff8107c84c>] ? cpuset_print_task_mems_allowed
+0x8c/0xc0
[869138.653709] [<ffffffff810b1333>] oom_kill_process+0xe3/0x210
[869138.653712] [<ffffffff810b14b0>] __out_of_memory+0x50/0xb0
[869138.653714] [<ffffffff810b156f>] out_of_memory+0x5f/0xc0
[869138.653716] [<ffffffff810b4121>] __alloc_pages_slowpath
+0x4c1/0x560
[869138.653719] [<ffffffff810b4331>] __alloc_pages_nodemask
+0x171/0x180
[869138.653723] [<ffffffff810caf4d>] do_anonymous_page+0x21d/0x540
[869138.653725] [<ffffffff810cda57>] handle_mm_fault+0x427/0x4f0
[869138.653729] [<ffffffff814b2e47>] do_page_fault+0x147/0x390
[869138.653731] [<ffffffff814b0c78>] page_fault+0x28/0x30
[869138.653733] Mem-Info:
[869138.653735] DMA per-cpu:
[869138.653736] CPU 0: hi: 0, btch: 1 usd: 0
[869138.653738] CPU 1: hi: 0, btch: 1 usd: 0
[869138.653739] DMA32 per-cpu:
[869138.653740] CPU 0: hi: 155, btch: 38 usd: 37
[869138.653741] CPU 1: hi: 155, btch: 38 usd: 0
[869138.653742] Normal per-cpu:
[869138.653744] CPU 0: hi: 155, btch: 38 usd: 137
[869138.653745] CPU 1: hi: 155, btch: 38 usd: 85
[869138.653748] active_anon:3920995 inactive_anon:380998 isolated_anon:
0
[869138.653749] active_file:59 inactive_file:41 isolated_file:0
[869138.653750] unevictable:0 dirty:0 writeback:0 unstable:0
[869138.653750] free:21458 slab_reclaimable:1460 slab_unreclaimable:
2861
[869138.653751] mapped:7 shmem:53 pagetables:0 bounce:0
[869138.653756] DMA free:16384kB min:12kB low:12kB high:16kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16160kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:
0 all_unreclaimable? yes
[869138.653759] lowmem_reserve[]: 0 4024 17244 17244
[869138.653764] DMA32 free:56704kB min:3920kB low:4900kB high:5880kB
active_anon:3281788kB inactive_anon:283648kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):
0kB present:4120800kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:20kB slab_unreclaimable:160kB kernel_stack:
120kB pagetables:6232kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
[869138.653768] lowmem_reserve[]: 0 0 13220 13220
[869138.653773] Normal free:12744kB min:12880kB low:16100kB high:
19320kB active_anon:12402192kB inactive_anon:1240344kB active_file:
236kB inactive_file:164kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:13537928kB mlocked:0kB dirty:0kB writeback:
0kB mapped:28kB shmem:212kB slab_reclaimable:5820kB slab_unreclaimable:
11284kB kernel_stack:1520kB pagetables:0kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:1439 all_unreclaimable? yes
[869138.653776] lowmem_reserve[]: 0 0 0 0
[869138.653778] DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 4*4096kB = 16384kB
[869138.653784] DMA32: 24*4kB 20*8kB 10*16kB 5*32kB 3*64kB 1*128kB
2*256kB 2*512kB 1*1024kB 2*2048kB 12*4096kB = 56704kB
[869138.653789] Normal: 1138*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB = 12744kB
[869138.653794] 150 total pagecache pages
[869138.653795] 0 pages in swap cache
[869138.653797] Swap cache stats: add 0, delete 0, find 0/0
[869138.653798] Free swap = 0kB
[869138.653799] Total swap = 0kB
[869138.692474] 4482048 pages RAM
[869138.692477] 139091 pages reserved
[869138.692478] 224 pages shared
[869138.692479] 4320518 pages non-shared
[869138.692482] Out of memory: kill process 24051 (mongod) score
1628456 or a child
[869138.692497] Killed process 24051 (mongod)
However, I have several ways of keeping an eye on memory (ganglia,
sar, monit), and I've never seen free memory drop below 15GB on this
machine:
03:35:01 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbcommit %commit
03:45:01 PM 15801808 2118420 11.82 125688 1290476
426084 2.38
03:55:01 PM 15799880 2120348 11.83 126788 1292016
424564 2.37
04:05:01 PM 15797752 2122476 11.84 127904 1293500
423644 2.36
04:15:01 PM 15791768 2128460 11.88 129020 1295632
427764 2.39
04:25:01 PM 15790272 2129956 11.89 130160 1297364
425196 2.37
04:35:01 PM 15787864 2132364 11.90 131272 1298988
424552 2.37
04:45:01 PM 15783272 2136956 11.92 132360 1300804
427280 2.38
04:55:01 PM 15782408 2137820 11.93 133492 1302672
423908 2.37
05:05:01 PM 15772864 2147364 11.98 134608 1304420
434696 2.43
05:15:01 PM 15765672 2154556 12.02 135688 1306552
1081920 6.04
05:25:01 PM 15762464 2157764 12.04 136868 1308684
1109748 6.19
05:35:01 PM 15759080 2161148 12.06 138032 1310828
1130232 6.31
05:45:01 PM 15753520 2166708 12.09 139216 1313192
1143504 6.38
05:55:01 PM 15748568 2171660 12.12 140428 1317464
1151548 6.43
06:05:01 PM 15745896 2174332 12.13 141544 1319148
1192332 6.65
06:15:01 PM 17091640 828588 4.62 8080 145080
803664 4.48
06:25:01 PM 17076272 843956 4.71 9268 164444
920244 5.14
Average: 16134760 1785468 9.96 76164 1019515
472754 2.64
We had a theory that maybe the limit settings needed to be adjusted,
but I couldn't find any problems with the output of ulimit -a for the
mongodb user:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Does anyone have a clue as to what could be going on here? These
machines are m2.xlarge EC2 instances running Ubuntu 10.04.1 LTS with
the 10gen Ubuntu packages (version 20101028). Uname output is as
follows:
Linux mongodb05 2.6.32-309-ec2 #18-Ubuntu SMP Mon Oct 18 21:00:50 UTC
2010 x86_64 GNU/Linux