Mongo 2.0.1 on 2.6.35 kernel hangs while writing

134 views
Skip to first unread message

Ed Norris

unread,
Apr 19, 2012, 9:05:16 AM4/19/12
to mongodb-user
This is a really simple setup - one database and one client

Server:
Mongo 2.0.1
Linux 2.6.35 x64
Amazon image - ubuntu-images/ubuntu-maverick-10.10-amd64-
server-20101007.1 (ami-548c783d)
8 GB RAM
plenty of disk space

Client:
using mongo-2.7.2.jar

Symptoms:
At some point during a long update (basically dumping several gigs of
data over a few hours), Mongo stops responding - the client actions do
not fault, just hang. This happens every few months for no apparent
reason and I end up having to reboot the server.

From a terminal session on the server while mongo is hung, I see that
the shell freezes on some actions
"ps -A" works but "ps -Af" freezes
"ls -al" in home dir works, "ls -al" in /data/db freezes

Then I open a mongo client from the shell in verbose mode:
~/mongodb-linux-x86_64-2.0.1/bin$ ./mongo --verbose admin
MongoDB shell version: 2.0.1
Thu Apr 19 01:15:54 versionCmpTest passed
Thu Apr 19 01:15:54 versionArrayTest passed
connecting to: admin
Thu Apr 19 01:15:54 creating new connection to:127.0.0.1
Thu Apr 19 01:15:54 BackgroundJob starting: ConnectBG
Thu Apr 19 01:15:54 connected connection!
[nothing]

At this point, I bounce the server, restart Mongo & let it recover so
that our service has minimal downtime.

Any suggestions? What other configuration information will help
diagnose what the heck is going on?

Thanks!

Scott Hernandez

unread,
Apr 19, 2012, 9:53:53 AM4/19/12
to mongod...@googlegroups.com
Where is mongod logging?

What does /data/db map to? What partition/mount is it on?

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>

Ed Norris

unread,
Apr 19, 2012, 10:42:36 AM4/19/12
to mongodb-user
mongod is logging to a file in the same directory as mongod (it will
get pretty big, multiple gigs, but in this most recent case it was
about 150megs and there was plenty of free space)

/data/db maps to a different volume, my fstab entry is:
/dev/sdg /data/db ext3 defaults 0 0

/dev/sdg is a 500 GB disk that has 305 GB free right now

(and I'm setting up another server with 2.0.4 at the moment, but who
knows when that will make it into production)

Ed

Scott Hernandez

unread,
Apr 19, 2012, 11:19:52 AM4/19/12
to mongod...@googlegroups.com
Where are the mongod files stored?

Does /dev/sdg respond at all?

Ed Norris

unread,
Apr 19, 2012, 11:55:19 AM4/19/12
to mongodb-user
the mongod files are stored on my boot volume. I just extract the
mongo files to a subdir off of my home directory and run it from
there.

That's a good question about /dev/sdg responding - I'll have to check
that out next time it happens (I've already restored mongo and am re-
running the data dump that failed last night)

Ross Dickey

unread,
Apr 19, 2012, 12:14:57 PM4/19/12
to mongodb-user
I've personally seen this symptom a few times, for 2 different
reasons. I see that you're on ubuntu. put these two files in /etc/
init:

# Ubuntu upstart file at /etc/init/iostat.conf

pre-start script
mkdir -p /mnt/log/
chmod 1777 /mnt/log
touch /mnt/log/iostat.log
end script

start on runlevel [2345]
stop on runlevel [06]

respawn

script
exec start-stop-daemon --make-pidfile --pidfile /var/run/
iostat.pid --start --startas /usr/bin/iostat -- -txd 2 >> /mnt/log/
iostat.log
end script

# Ubuntu upstart file at /etc/init/mongostat.conf

pre-start script
mkdir -p /mnt/log/
chmod 1777 /mnt/log
end script

start on runlevel [2345]
stop on runlevel [06]

respawn

script
exec start-stop-daemon --make-pidfile --pidfile /var/run/
mongostat.pid --start --startas /data/bin/mongostat --chuid mongo -- --
port 27018 >> /mnt/log/mongostat.log
end script

Change the log locations as you see fit. I'm on amazon so it's nice
to use the ephemeral store for just logs, and put all real data on
ebs. If that means nothing to you: just log to a volume where your
mongo data isn't. Run them wherever you have a mongod (standalone or
rep server, whichever you use. If standalone you'll want port 27017)

You'll probably see one of two things, the next time the slow write
happens:
In the iostat log, if you have a slow volume, check the await (average
time spent waiting for a write to complete). That number is in
milliseconds. If you're seeing thousands, it's the disk's fault. All
you can do is wait and hope it fixes itself, or replace the disk.
In the mongostat log, check the queues (qr|qw is queued read/write, ar|
aw is active read/write). If those numbers are in the tens/hundreds
and not moving for tens or hundreds of seconds, and the iostat for the
same time period shows almost all 0's for your mongod volume, then you
may have hit a bug I've run into. do a db.currentOp() and capture the
output, then restart the mongod. It should come back.

I have a few CS support tickets for the stuck queue thing, and
honestly can't prove it's actually a mongo bug. If you see it, it
might be worth opening a public SERVER bug with the currentOp output
attached. That'd prove its not just me :)

hth.

Ed Norris

unread,
Apr 19, 2012, 3:12:40 PM4/19/12
to mongodb-user
I'll try that out, thanks

Ed Norris

unread,
Apr 25, 2012, 9:03:38 PM4/25/12
to mongodb-user
So it froze again during a long write, here is some data:

#from 'df -h'
Filesystem Size Used Avail Use% Mounted on
/dev/sdh 493G 163G 305G 35% /data/db

#from 'mount'
/dev/sdh on /data/db type ext4 (rw,noexec,noatime,nodiratime)

Here is the last minute of iostat before it froze:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
04/25/12 23:00:34
xvdh 0.00 1341.00 40.00 163.00 2884.00 12032.00
73.48 10.60 52.22 1.80 36.50
04/25/12 23:00:36
xvdh 0.00 1530.50 47.50 182.50 3392.00 13704.00
74.33 3.77 16.39 0.89 20.50
04/25/12 23:00:38
xvdh 0.50 1508.50 47.00 173.50 3332.00 12828.00
73.29 5.04 18.10 1.41 31.00
04/25/12 23:00:40
xvdh 0.00 1398.50 44.00 179.00 3200.00 13248.00
73.76 13.19 63.88 1.70 38.00
04/25/12 23:00:42
xvdh 0.50 2768.00 35.00 964.00 2436.00 27288.00
29.75 103.43 78.52 0.76 76.00
04/25/12 23:00:44
xvdh 0.00 2133.00 19.50 1189.00 1408.00 26332.00
22.95 168.51 154.36 0.82 99.50
04/25/12 23:00:46
xvdh 0.50 3856.50 13.00 1253.00 900.00 41860.00
33.78 165.07 132.86 0.79 100.50
04/25/12 23:00:48
xvdh 0.00 3376.50 23.50 1617.50 1668.00 38056.00
24.21 214.64 116.03 0.61 100.00
04/25/12 23:00:50
xvdh 0.00 2462.00 0.00 1488.50 0.00 32488.00
21.83 213.21 98.61 0.67 100.00
04/25/12 23:00:52
xvdh 0.00 3532.00 0.00 1760.50 0.00 39584.00
22.48 214.31 80.45 0.57 100.00
04/25/12 23:00:54
xvdh 0.00 3272.00 19.00 1492.00 1344.00 39852.00
27.26 180.90 216.66 0.66 100.00
04/25/12 23:00:56
xvdh 0.00 905.00 0.00 1896.50 0.00 23812.00
12.56 205.24 76.82 0.53 100.00
04/25/12 23:00:58
xvdh 0.00 884.00 0.00 1958.00 0.00 21888.00
11.18 204.25 72.38 0.51 100.00
04/25/12 23:01:00
xvdh 0.00 2091.00 27.50 1479.00 1924.00 31664.00
22.30 169.34 205.77 0.66 100.00
04/25/12 23:01:02
xvdh 0.00 4195.00 0.00 1331.00 0.00 39656.00
29.79 193.21 106.39 0.75 100.00
04/25/12 23:01:04
xvdh 0.00 4693.50 0.00 1032.00 0.00 46580.00
45.14 193.16 137.96 0.97 100.00
04/25/12 23:01:06
xvdh 0.00 3408.50 10.00 1128.00 704.00 35644.00
31.94 180.10 240.06 0.88 100.00
04/25/12 23:01:08
xvdh 0.00 3595.00 0.00 1351.50 0.00 41744.00
30.89 193.31 107.53 0.74 100.00
04/25/12 23:01:10
xvdh 0.00 2204.50 0.00 1739.00 0.00 31772.00
18.27 193.03 82.01 0.58 100.00
04/25/12 23:01:12
xvdh 0.00 1501.50 17.50 1561.00 1280.00 23848.00
15.92 176.93 174.64 0.63 100.00
04/25/12 23:01:14
xvdh 0.00 2554.00 0.00 1431.50 0.00 31704.00
22.15 207.59 98.20 0.70 100.00
04/25/12 23:01:16
xvdh 0.00 617.00 0.00 1667.50 0.00 18476.00
11.08 209.16 86.22 0.60 100.00
04/25/12 23:01:18
xvdh 0.00 1638.00 22.50 1634.50 1600.00 27200.00
17.38 180.82 189.91 0.60 100.00
04/25/12 23:01:20
xvdh 0.00 1703.50 4.50 1353.00 320.00 22268.00
16.64 205.88 112.59 0.74 100.00
04/25/12 23:01:22
xvdh 0.00 4077.00 0.00 1333.00 0.00 43744.00
32.82 213.58 107.12 0.75 100.00
04/25/12 23:01:24
xvdh 0.00 1171.00 1.00 1503.50 64.00 24420.00
16.27 238.37 220.66 0.66 100.00
04/25/12 23:01:26
xvdh 0.00 386.50 0.00 1982.50 0.00 19132.00
9.65 301.59 73.25 0.50 100.00
04/25/12 23:01:28
xvdh 0.50 1261.50 23.50 1246.00 1668.00 17348.00
14.98 230.70 333.32 0.79 100.00
04/25/12 23:01:30
xvdh 0.00 370.50 0.00 1938.50 0.00 18564.00
9.58 213.29 72.86 0.52 100.00
04/25/12 23:01:32
xvdh 0.00 73.00 0.00 215.00 0.00 2240.00
10.42 203.50 70.86 4.65 100.00
04/25/12 23:01:34
xvdh 0.00 0.00 0.00 0.00 0.00 0.00
0.00 202.00 0.00 0.00 100.00

Then for the next half hour (before I finally rebooted the machine),
it stayed at 202 avgqu-sz and 100 %util.

Here is the same time period of mongostat:
3823 0 0 0 0 1 0 148g 296g
3.87g 12 56.4 0 0|0 0|1 813k 1k
8 23:00:34
4235 0 0 0 0 1 0 148g 296g
3.89g 14 84.8 0 0|0 0|1 899k 1k
8 23:00:35
4087 0 0 0 0 1 0 148g 296g
3.89g 14 75.9 0 0|0 0|1 870k 1k
8 23:00:36
4386 0 0 0 0 1 0 148g 296g
3.9g 16 70.3 0 0|0 0|1 930k 1k
8 23:00:37
4114 0 0 0 0 1 0 148g 296g
3.89g 16 74.6 0 0|0 0|1 874k 1k
8 23:00:38
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
3347 0 0 0 0 1 0 148g 296g
3.9g 11 80.1 0 0|0 0|1 710k 1k
8 23:00:39
3688 0 0 0 0 1 0 148g 296g
3.83g 12 76.9 0 0|0 0|1 787k 1k
8 23:00:40
3476 0 0 0 0 1 0 148g 296g
3.84g 14 83.5 0 0|0 0|1 738k 1k
8 23:00:41
4049 0 0 0 0 1 0 148g 296g
3.93g 17 91.5 0 0|1 0|1 858k 1k
8 23:00:42
262 0 0 0 0 1 0 148g 296g
3.84g 0 15.3 0 0|0 0|1 54k 1k
8 23:00:43
2148 0 0 0 0 1 0 148g 296g
3.89g 9 91.7 0 0|0 0|1 456k 1k
8 23:00:44
1481 0 0 0 0 1 0 148g 296g
3.92g 4 54.6 0 0|1 0|1 314k 1k
8 23:00:45
1375 0 0 0 0 1 0 148g 296g
3.87g 5 31.1 0 0|1 0|1 291k 1k
8 23:00:46
2425 0 0 0 0 1 0 148g 296g
3.89g 9 64.8 0 0|0 0|1 515k 1k
8 23:00:47
1710 0 0 0 0 1 0 148g 296g
3.93g 8 65.6 0 0|0 0|1 362k 1k
8 23:00:48
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
0 0 0 0 0 2 0 148g 296g
3.93g 0 0 0 1|1 1|1 62b 1k
8 23:00:49
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 2|1 1|1 62b 1k
8 23:00:50
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 2|1 1|1 62b 1k
8 23:00:51
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 2|1 1|1 62b 1k
8 23:00:52
1432 0 0 0 0 1 0 148g 296g
3.87g 5 554 0 0|0 0|1 303k 1k
8 23:00:53
2029 0 0 0 0 1 0 148g 296g
3.91g 8 59.8 0 0|1 0|1 430k 1k
8 23:00:54
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:00:55
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:00:56
0 0 0 0 0 2 0 148g 296g
3.91g 0 0 0 2|1 1|1 62b 1k
8 23:00:57
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 2|1 1|1 62b 1k
8 23:00:58
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
369 0 0 0 0 1 0 148g 296g
3.85g 1 11.2 0 0|1 0|1 78k 1k
8 23:00:59
3200 0 0 0 0 1 0 148g 296g
3.92g 14 89 0 0|0 0|1 678k 1k
8 23:01:00
873 0 0 0 0 1 0 148g 296g
3.94g 3 18.9 0 0|1 0|1 185k 1k
8 23:01:01
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 0|1 0|1 62b 1k
8 23:01:02
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:03
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:04
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:05
1267 0 0 0 0 1 0 148g 296g
3.87g 7 43.7 0 0|1 0|1 266k 1k
8 23:01:06
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 0|1 0|1 62b 1k
8 23:01:07
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 0|1 0|1 62b 1k
8 23:01:08
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 0|1 0|1 62b 1k
8 23:01:09
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 1|1 0|1 62b 1k
8 23:01:10
0 0 0 0 0 1 0 148g 296g
3.87g 0 0 0 1|1 0|1 62b 1k
8 23:01:11
3262 0 0 0 0 2 0 148g 296g
3.91g 10 72 0 0|0 0|1 690k 1k
8 23:01:12
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:01:13
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:01:14
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 0|1 0|1 62b 1k
8 23:01:15
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 1|1 0|1 62b 1k
8 23:01:16
0 0 0 0 0 1 0 148g 296g
3.91g 0 0 0 1|1 0|1 62b 1k
8 23:01:17
3233 0 0 0 0 1 0 148g 296g
3.92g 12 608 0 0|1 0|1 686k 1k
8 23:01:18
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
1469 0 0 0 0 1 0 148g 296g
3.94g 6 35.5 0 0|0 0|1 312k 1k
8 23:01:19
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 0|1 0|1 62b 1k
8 23:01:20
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 0|1 0|1 62b 1k
8 23:01:21
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:22
0 0 0 0 0 1 0 148g 296g
3.94g 0 0 0 1|1 0|1 62b 1k
8 23:01:23
549 0 0 0 0 1 0 148g 296g
3.86g 2 544 0 0|0 0|1 116k 1k
8 23:01:27
2408 0 0 0 0 1 0 148g 296g
3.91g 10 98.2 0 1|0 0|1 511k 1k
8 23:01:28
940 0 0 0 0 1 0 148g 296g
3.93g 4 44 0 0|1 0|1 199k 1k
8 23:01:29
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 0|1 0|1 62b 1k
8 23:01:30
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 0|1 0|1 62b 1k
8 23:01:31
insert query update delete getmore command flushes mapped vsize
res faults locked % idx miss % qr|qw ar|aw netIn netOut
conn time
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 1|1 0|1 62b 1k
8 23:01:32
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 1|1 0|1 62b 1k
8 23:01:33
0 0 0 0 0 1 0 148g 296g
3.93g 0 0 0 1|1 0|1 62b 1k
8 23:01:34

You mentioned iostat await times - I see mostly 50-200ms in the recent
logs, but infrequent spikes of 7000-8000ms. The mongostat ar|aw
numbers are mostly 0 and 1 with a handful of 2s, so I don't think I'm
reproducing your issue, unfortunately.

I have 16 ops in currentOp() , most of them are like this (a query of
the 'files' collection from my monitoring solution)
{
"opid" : 7121854,
"active" : false,
"lockType" : "read",
"waitingForLock" : true,
"op" : "query",
"ns" : "",
"query" : {
"count" : "files",
"query" : {

}
},
"client" : "AA:41646",
"desc" : "conn",
"threadId" : "0x7f12635cc710",
"connectionId" : 1,
"numYields" : 0
},

All are waiting for a lock, none are active, and one is the write from
my updating client.

At this point, I'm just going to start looking at another database
solution because this is
a giant pain in the ass. Anyone have experience migrating to
SimpleDB? :)

Cheers,

Ed Norris

unread,
Apr 25, 2012, 9:18:07 PM4/25/12
to mongodb-user
Forgot to add:

1. Current mongo server version is 2.0.4
2. I've updated the filesystem of the data drive to ext4
3. I was able to access the data drive during this last crash

Ed Norris

unread,
May 2, 2012, 1:17:27 PM5/2/12
to mongodb-user
I've given up on fixing this issue and am creating a new Mongo
instance. New ext4 drive & new data. I'm hoping that there was some
underlying problem in the database or filesystem that will no longer
apply.

Scott H and Ross D had good questions and information, thanks guys!
Reply all
Reply to author
Forward
0 new messages