So it froze again during a long write, here is some data:
#from 'df -h'
Filesystem Size Used Avail Use% Mounted on
/dev/sdh 493G 163G 305G 35% /data/db
#from 'mount'
/dev/sdh on /data/db type ext4 (rw,noexec,noatime,nodiratime)
Here is the last minute of iostat before it froze:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
04/25/12 23:00:34
xvdh 0.00 1341.00 40.00 163.00 2884.00 12032.00
73.48 10.60 52.22 1.80 36.50
04/25/12 23:00:36
xvdh 0.00 1530.50 47.50 182.50 3392.00 13704.00
74.33 3.77 16.39 0.89 20.50
04/25/12 23:00:38
xvdh 0.50 1508.50 47.00 173.50 3332.00 12828.00
73.29 5.04 18.10 1.41 31.00
04/25/12 23:00:40
xvdh 0.00 1398.50 44.00 179.00 3200.00 13248.00
73.76 13.19 63.88 1.70 38.00
04/25/12 23:00:42
xvdh 0.50 2768.00 35.00 964.00 2436.00 27288.00
29.75 103.43 78.52 0.76 76.00
04/25/12 23:00:44
xvdh 0.00 2133.00 19.50 1189.00 1408.00 26332.00
22.95 168.51 154.36 0.82 99.50
04/25/12 23:00:46
xvdh 0.50 3856.50 13.00 1253.00 900.00 41860.00
33.78 165.07 132.86 0.79 100.50
04/25/12 23:00:48
xvdh 0.00 3376.50 23.50 1617.50 1668.00 38056.00
24.21 214.64 116.03 0.61 100.00
04/25/12 23:00:50
xvdh 0.00 2462.00 0.00 1488.50 0.00 32488.00
21.83 213.21 98.61 0.67 100.00
04/25/12 23:00:52
xvdh 0.00 3532.00 0.00 1760.50 0.00 39584.00
22.48 214.31 80.45 0.57 100.00
04/25/12 23:00:54
xvdh 0.00 3272.00 19.00 1492.00 1344.00 39852.00
27.26 180.90 216.66 0.66 100.00
04/25/12 23:00:56
xvdh 0.00 905.00 0.00 1896.50 0.00 23812.00
12.56 205.24 76.82 0.53 100.00
04/25/12 23:00:58
xvdh 0.00 884.00 0.00 1958.00 0.00 21888.00
11.18 204.25 72.38 0.51 100.00
04/25/12 23:01:00
xvdh 0.00 2091.00 27.50 1479.00 1924.00 31664.00
22.30 169.34 205.77 0.66 100.00
04/25/12 23:01:02
xvdh 0.00 4195.00 0.00 1331.00 0.00 39656.00
29.79 193.21 106.39 0.75 100.00
04/25/12 23:01:04
xvdh 0.00 4693.50 0.00 1032.00 0.00 46580.00
45.14 193.16 137.96 0.97 100.00
04/25/12 23:01:06
xvdh 0.00 3408.50 10.00 1128.00 704.00 35644.00
31.94 180.10 240.06 0.88 100.00
04/25/12 23:01:08
xvdh 0.00 3595.00 0.00 1351.50 0.00 41744.00
30.89 193.31 107.53 0.74 100.00
04/25/12 23:01:10
xvdh 0.00 2204.50 0.00 1739.00 0.00 31772.00
18.27 193.03 82.01 0.58 100.00
04/25/12 23:01:12
xvdh 0.00 1501.50 17.50 1561.00 1280.00 23848.00
15.92 176.93 174.64 0.63 100.00
04/25/12 23:01:14
xvdh 0.00 2554.00 0.00 1431.50 0.00 31704.00
22.15 207.59 98.20 0.70 100.00
04/25/12 23:01:16
xvdh 0.00 617.00 0.00 1667.50 0.00 18476.00
11.08 209.16 86.22 0.60 100.00
04/25/12 23:01:18
xvdh 0.00 1638.00 22.50 1634.50 1600.00 27200.00
17.38 180.82 189.91 0.60 100.00
04/25/12 23:01:20
xvdh 0.00 1703.50 4.50 1353.00 320.00 22268.00
16.64 205.88 112.59 0.74 100.00
04/25/12 23:01:22
xvdh 0.00 4077.00 0.00 1333.00 0.00 43744.00
32.82 213.58 107.12 0.75 100.00
04/25/12 23:01:24
xvdh 0.00 1171.00 1.00 1503.50 64.00 24420.00
16.27 238.37 220.66 0.66 100.00
04/25/12 23:01:26
xvdh 0.00 386.50 0.00 1982.50 0.00 19132.00
9.65 301.59 73.25 0.50 100.00
04/25/12 23:01:28
xvdh 0.50 1261.50 23.50 1246.00 1668.00 17348.00
14.98 230.70 333.32 0.79 100.00
04/25/12 23:01:30
xvdh 0.00 370.50 0.00 1938.50 0.00 18564.00
9.58 213.29 72.86 0.52 100.00
04/25/12 23:01:32
xvdh 0.00 73.00 0.00 215.00 0.00 2240.00
10.42 203.50 70.86 4.65 100.00
04/25/12 23:01:34
xvdh 0.00 0.00 0.00 0.00 0.00 0.00
0.00 202.00 0.00 0.00 100.00
Then for the next half hour (before I finally rebooted the machine),
it stayed at 202 avgqu-sz and 100 %util.
Here is the same time period of mongostat:
3823 0 0 0 0 1 0 148g 296g
3.87g 12 56.4 0 0|0 0|1 813k 1k
8 23:00:34
4235 0 0 0 0 1 0 148g 296g
3.89g 14 84.8 0 0|0 0|1 899k 1k
8 23:00:35
4087 0 0 0 0 1 0 148g 296g
3.89g 14 75.9 0 0|0 0|1 870k 1k
8 23:00:36
4386 0 0 0 0 1 0 148g 296g
3.9g 16 70.3 0 0|0 0|1 930k 1k
8 23:00:37
4114 0 0 0 0 1 0 148g 296g
3.89g 16 74.6 0 0|0 0|1 874k 1k
8 23:00:38
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
3347 0 0 0 0 1 0 148g 296g
3.9g 11 80.1 0 0|0 0|1 710k 1k
8 23:00:39
3688 0 0 0 0 1 0 148g 296g
3.83g 12 76.9 0 0|0 0|1 787k 1k
8 23:00:40
3476 0 0 0 0 1 0 148g 296g
3.84g 14 83.5 0 0|0 0|1 738k 1k
8 23:00:41
4049 0 0 0 0 1 0 148g 296g
3.93g 17 91.5 0 0|1 0|1 858k 1k
8 23:00:42
262 0 0 0 0 1 0 148g 296g
3.84g 0 15.3 0 0|0 0|1 54k 1k
8 23:00:43
2148 0 0 0 0 1 0 148g 296g
3.89g 9 91.7 0 0|0 0|1 456k 1k
8 23:00:44
1481 0 0 0 0 1 0 148g 296g
3.92g 4 54.6 0 0|1 0|1 314k 1k
8 23:00:45
1375 0 0 0 0 1 0 148g 296g
3.87g 5 31.1 0 0|1 0|1 291k 1k
8 23:00:46
2425 0 0 0 0 1 0 148g 296g
3.89g 9 64.8 0 0|0 0|1 515k 1k
8 23:00:47
1710 0 0 0 0 1 0 148g 296g
3.93g 8 65.6 0 0|0 0|1 362k 1k
8 23:00:48
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
0 0 0 0 0 2 0 148g 296g
3.93g 0 0 0 1|1 1|1 62b 1k
8 23:00:49
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 2|1 1|1 62b 1k
8 23:00:50
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 2|1 1|1 62b 1k
8 23:00:51
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 2|1 1|1 62b 1k
8 23:00:52
1432 0 0 0 0 1 0 148g 296g
3.87g 5 554 0 0|0 0|1 303k 1k
8 23:00:53
2029 0 0 0 0 1 0 148g 296g
3.91g 8 59.8 0 0|1 0|1 430k 1k
8 23:00:54
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:00:55
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:00:56
0 0 0 0 0 2 0 148g 296g
3.91g 0 0 0 2|1 1|1 62b 1k
8 23:00:57
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 2|1 1|1 62b 1k
8 23:00:58
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
369 0 0 0 0 1 0 148g 296g
3.85g 1 11.2 0 0|1 0|1 78k 1k
8 23:00:59
3200 0 0 0 0 1 0 148g 296g
3.92g 14 89 0 0|0 0|1 678k 1k
8 23:01:00
873 0 0 0 0 1 0 148g 296g
3.94g 3 18.9 0 0|1 0|1 185k 1k
8 23:01:01
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 0|1 0|1 62b 1k
8 23:01:02
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:03
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:04
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:05
1267 0 0 0 0 1 0 148g 296g
3.87g 7 43.7 0 0|1 0|1 266k 1k
8 23:01:06
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 0|1 0|1 62b 1k
8 23:01:07
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 0|1 0|1 62b 1k
8 23:01:08
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 0|1 0|1 62b 1k
8 23:01:09
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 1|1 0|1 62b 1k
8 23:01:10
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 1|1 0|1 62b 1k
8 23:01:11
3262 0 0 0 0 2 0 148g 296g
3.91g 10 72 0 0|0 0|1 690k 1k
8 23:01:12
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:01:13
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:01:14
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:01:15
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 1|1 0|1 62b 1k
8 23:01:16
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 1|1 0|1 62b 1k
8 23:01:17
3233 0 0 0 0 1 0 148g 296g
3.92g 12 608 0 0|1 0|1 686k 1k
8 23:01:18
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
1469 0 0 0 0 1 0 148g 296g
3.94g 6 35.5 0 0|0 0|1 312k 1k
8 23:01:19
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 0|1 0|1 62b 1k
8 23:01:20
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 0|1 0|1 62b 1k
8 23:01:21
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:22
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:23
549 0 0 0 0 1 0 148g 296g
3.86g 2 544 0 0|0 0|1 116k 1k
8 23:01:27
2408 0 0 0 0 1 0 148g 296g
3.91g 10 98.2 0 1|0 0|1 511k 1k
8 23:01:28
940 0 0 0 0 1 0 148g 296g
3.93g 4 44 0 0|1 0|1 199k 1k
8 23:01:29
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 0|1 0|1 62b 1k
8 23:01:30
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 0|1 0|1 62b 1k
8 23:01:31
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 1|1 0|1 62b 1k
8 23:01:32
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 1|1 0|1 62b 1k
8 23:01:33
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 1|1 0|1 62b 1k
8 23:01:34
You mentioned iostat await times - I see mostly 50-200ms in the recent
logs, but infrequent spikes of 7000-8000ms. The mongostat ar|aw
numbers are mostly 0 and 1 with a handful of 2s, so I don't think I'm
reproducing your issue, unfortunately.
I have 16 ops in currentOp() , most of them are like this (a query of
the 'files' collection from my monitoring solution)
{
"opid" : 7121854,
"active" : false,
"lockType" : "read",
"waitingForLock" : true,
"op" : "query",
"ns" : "",
"query" : {
"count" : "files",
"query" : {
}
},
"client" : "AA:41646",
"desc" : "conn",
"threadId" : "0x7f12635cc710",
"connectionId" : 1,
"numYields" : 0
},
All are waiting for a lock, none are active, and one is the write from
my updating client.
At this point, I'm just going to start looking at another database
solution because this is
a giant pain in the ass. Anyone have experience migrating to
SimpleDB? :)
Cheers,