"access violation" MongoDB v2.0.4 on Windows 2008 R2

157 views
Skip to first unread message

Mo Cassidy

unread,
May 22, 2012, 5:18:19 PM5/22/12
to mongod...@googlegroups.com
Today, we had the following seemingly spontaneously occur:


Tue May 22 14:51:45 [conn27990] update default.instances query: { [...SNIP...] } idhack:1 upsert:1 124ms
Tue May 22 14:51:47 access violation
Tue May 22 14:52:15 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY
Tue May 22 14:52:17 [rsSync] replSet syncing to: iis-man-002:27021


***** SERVER RESTARTED *****


Tue May 22 14:52:19 [initandlisten] MongoDB starting : pid=1960 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003
Tue May 22 14:52:19 [initandlisten] db version v2.0.4, pdfile version 4.5
Tue May 22 14:52:19 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf
Tue May 22 14:52:19 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42
Tue May 22 14:52:19 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true }
Tue May 22 14:52:19 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal
Tue May 22 14:52:19 [initandlisten] recover begin
Tue May 22 14:52:19 [initandlisten] recover lsn: 258699461
Tue May 22 14:52:19 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._4
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209344204 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209382318 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209420683 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209458887 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209497082 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209535336 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209573520 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209611794 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209649958 < lsn:258699461
Tue May 22 14:52:20 [initandlisten] recover skipping application of section more...
Tue May 22 14:52:58 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._5
Tue May 22 14:53:17 [initandlisten] recover cleaning up
Tue May 22 14:53:17 [initandlisten] removeJournalFiles
Tue May 22 14:53:17 [initandlisten] recover done
Tue May 22 14:53:18 [initandlisten] waiting for connections on port 27021
Tue May 22 14:53:18 [websvr] admin web console waiting for connections on port 28021
Tue May 22 14:53:19 [rsStart] trying to contact iis-man-002:27021
Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is up
Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY
Tue May 22 14:53:19 [rsStart] replSet STARTUP2
Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is up
Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER
Tue May 22 14:53:19 [rsSync] replSet SECONDARY
Tue May 22 14:53:23 [rsSync] replSet syncing to: iis-man-002:27021
Tue May 22 14:53:24 access violation


***** SERVER RESTARTED *****


Tue May 22 14:53:25 [initandlisten] MongoDB starting : pid=2356 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003
Tue May 22 14:53:25 [initandlisten] db version v2.0.4, pdfile version 4.5
Tue May 22 14:53:25 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf
Tue May 22 14:53:25 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42
Tue May 22 14:53:25 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true }
Tue May 22 14:53:25 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal
Tue May 22 14:53:25 [initandlisten] recover begin
Tue May 22 14:53:25 [initandlisten] recover lsn: 844
Tue May 22 14:53:25 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._0
Tue May 22 14:53:25 [initandlisten] recover cleaning up
Tue May 22 14:53:25 [initandlisten] removeJournalFiles
Tue May 22 14:53:25 [initandlisten] recover done
Tue May 22 14:53:25 [initandlisten] waiting for connections on port 27021
Tue May 22 14:53:25 [websvr] admin web console waiting for connections on port 28021
Tue May 22 14:53:25 [rsStart] replSet STARTUP2
Tue May 22 14:53:25 [rsSync] replSet SECONDARY
Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is up
Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY
Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is up
Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER
Tue May 22 14:53:30 [rsSync] replSet syncing to: iis-man-002:27021
Tue May 22 14:53:30 access violation


***** SERVER RESTARTED *****


Tue May 22 14:53:30 [initandlisten] MongoDB starting : pid=2268 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003
Tue May 22 14:53:30 [initandlisten] db version v2.0.4, pdfile version 4.5
Tue May 22 14:53:30 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf
Tue May 22 14:53:30 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42
Tue May 22 14:53:30 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true }
Tue May 22 14:53:30 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal
Tue May 22 14:53:30 [initandlisten] recover begin
Tue May 22 14:53:30 [initandlisten] info no lsn file in journal/ directory
Tue May 22 14:53:30 [initandlisten] recover lsn: 0
Tue May 22 14:53:30 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._0
Tue May 22 14:53:30 [initandlisten] recover cleaning up
Tue May 22 14:53:30 [initandlisten] removeJournalFiles
Tue May 22 14:53:30 [initandlisten] recover done
Tue May 22 14:53:30 [initandlisten] waiting for connections on port 27021
Tue May 22 14:53:30 [websvr] admin web console waiting for connections on port 28021
Tue May 22 14:53:30 [rsStart] replSet STARTUP2
Tue May 22 14:53:30 [rsSync] replSet SECONDARY
Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-002:27021 is up
Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY
Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-001:27021 is up
Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER
Tue May 22 14:53:35 [rsSync] replSet syncing to: iis-man-002:27021
Tue May 22 14:53:35 access violation


***** SERVER RESTARTED *****


The last pattern above repeated many times until someone noticed the error in our website logging system.
After seeing that, I manually stopped the service, which gave the following log messages:




***** SERVER RESTARTED *****


Tue May 22 15:20:03 [initandlisten] MongoDB starting : pid=3132 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003
Tue May 22 15:20:03 [initandlisten] db version v2.0.4, pdfile version 4.5
Tue May 22 15:20:03 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf
Tue May 22 15:20:03 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42
Tue May 22 15:20:03 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true }
Tue May 22 15:20:03 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal
Tue May 22 15:20:03 [initandlisten] recover begin
Tue May 22 15:20:03 [initandlisten] info no lsn file in journal/ directory
Tue May 22 15:20:03 [initandlisten] recover lsn: 0
Tue May 22 15:20:03 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._0
Tue May 22 15:20:03 [initandlisten] recover cleaning up
Tue May 22 15:20:03 [initandlisten] removeJournalFiles
Tue May 22 15:20:03 [initandlisten] recover done
Tue May 22 15:20:03 [initandlisten] waiting for connections on port 27021
Tue May 22 15:20:03 [websvr] admin web console waiting for connections on port 28021
Tue May 22 15:20:03 [rsStart] replSet STARTUP2
Tue May 22 15:20:03 [rsSync] replSet SECONDARY
Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-002:27021 is up
Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY
Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-001:27021 is up
Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER
Tue May 22 15:20:04 shutdown: going to close listening sockets...
Tue May 22 15:20:04 closing listening socket: 320
Tue May 22 15:20:04 closing listening socket: 368
Tue May 22 15:20:04 shutdown: going to flush diaglog...
Tue May 22 15:20:04 shutdown: going to close sockets...
Tue May 22 15:20:04 shutdown: waiting for fs preallocator...
Tue May 22 15:20:04 shutdown: lock for final commit...
Tue May 22 15:20:04 shutdown: final commit...
Tue May 22 15:20:04 shutdown: closing all files...
Tue May 22 15:20:04   Assertion failure inShutdown() db\dur.cpp 681
Tue May 22 15:20:04 [websvr] Listener: accept() returns -1 errno:10038 An operation was attempted on something that is not a socket.
Tue May 22 15:20:04 [websvr] select() failure: ret=-1 errno:10038 An operation was attempted on something that is not a socket.
Tue May 22 15:20:04 [initandlisten] Listener: accept() returns -1 errno:10038 An operation was attempted on something that is not a socket.
Tue May 22 15:20:04 [initandlisten] select() failure: ret=-1 errno:10038 An operation was attempted on something that is not a socket.
Tue May 22 15:20:04 [initandlisten] now exiting
Tue May 22 15:20:04 dbexit:
Tue May 22 15:20:04 [initandlisten] shutdown: going to close listening sockets...
Tue May 22 15:20:04 [initandlisten] shutdown: going to flush diaglog...
Tue May 22 15:20:04 [initandlisten] shutdown: going to close sockets...
Tue May 22 15:20:04 [initandlisten] shutdown: waiting for fs preallocator...
Tue May 22 15:20:04 [initandlisten] shutdown: lock for final commit...
Tue May 22 15:20:04 [initandlisten] shutdown: final commit...
Tue May 22 15:20:04 unhandled windows exception
Tue May 22 15:20:04 ec=0xe06d7363


***** SERVER RESTARTED *****


I saw mongod.lock in the data folder, so I renamed that, and tried again, but got the same messages, and saw mongo was automatically restarting over and over very quickly.
I manually tried to start mongo, using the command from HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\MongoDB\ImagePath:
"c:\SitesData\testing-website\mongo\mongod.exe" --quiet  --rest  --replSet  rsTest  --dbpath  "C:\SitesData\testing-website\mongodata"  --port  27021  --logpath  "C:\SitesData\testing-website\mongo.log"  --logappend  --service  
But that only gave the following logs messages:


***** SERVER RESTARTED *****


Tue May 22 15:29:04 BackgroundJob starting: DataFileSync
Tue May 22 15:29:04 dbexit:
Tue May 22 15:29:04 shutdown: going to close listening sockets...
Tue May 22 15:29:04 shutdown: going to flush diaglog...
Tue May 22 15:29:04 shutdown: going to close sockets...
Tue May 22 15:29:04 shutdown: waiting for fs preallocator...
Tue May 22 15:29:04 shutdown: lock for final commit...
Tue May 22 15:29:04 shutdown: final commit...
Tue May 22 15:29:04 shutdown: closing all files...
Tue May 22 15:29:04 closeAllFiles() finished
Tue May 22 15:29:04 dbexit: really exiting now


***** SERVER RESTARTED *****

The way this failure occurred prevented failover from happening, because the local mongo instance could be seen, and connected to (slaveok is true), but would immediately shutdown and close the connection, causing our ASP.Net website to keep throwing System.Net.Sockets.SocketException "An existing connection was forcibly closed by the remote host"

I'm not sure what to do from here. I'd like to avoid a full resync, since we only have two non-arbiter members, and this would mean a performance impact while this instance fully resynched with the primary. Does anyone have any advice?

Tad Marshall

unread,
May 23, 2012, 6:18:26 AM5/23/12
to mongod...@googlegroups.com
There are a couple of things going on here.  The main one is that it seems that something being replicated is causing an access violation on your secondary.  This is an error in memory addressing, the same thing as a segfault (signal 11) on Unix.  The second thing is that the Windows service settings that were created by mongod.exe when you used --install to install the service instruct the Windows Service Control Manager to restart mongod.exe if it ever fails.  Because you have journaling enabled, mongod is able to recover from each restart, but when it recovers it starts syncing from the primary again, reads the same oplog entry that killed it the previous time and so gets another access violation and the process repeats.

Version 2.0.4 gives very little information about the access violation, just the faulting address, which makes it hard to debug.  In addition, Address Space Layout Randomization (ASLR) makes the address hard to use in finding out what source code line triggered the crash.  Version 2.2 will behave a little better on access violations in Windows: it will show the thread name that crashed (which is almost certainly rsSync for your crash), the type of memory access and the address that was used.  In addition, it will generate a minidump that can help diagnose the problem and then shut down "cleanly" so no recovery will be required on restart.

There are a couple of things that could be tried to get you out of this hole.  One possibility is to temporarily try the most recent nightly build and see if the crashing problem has been fixed.  If it hasn't been fixed, you will get a minidump that we could use to analyze the crash.  If this fixes the problem, you could let it run long enough to fully resync from the primary and then switch back to your 2.0.4 version.

There are other options that could be tried, such as wiping the data and restoring from a dump from the primary, but it might be worth trying the nightly build first.  If it worked, it would be a lot faster.  The current nightly build is getting close to the version that will ship as 2.2.0-rc0 (the first release candidate for version 2.2) but it isn't there yet and we don't recommend running "unstable" nightly builds in production.  As a short term test to get you past this outage, it might be worth trying.

Mo Cassidy

unread,
Jun 29, 2012, 12:31:48 PM6/29/12
to mongod...@googlegroups.com
Wow, sorry Tad, I had totally forgotten about this until today (guess what happened again today). To get out of the hole, I just went ahead and did a full resync. It didn't affect performance a whole lot and was done relatively quickly (I think maybe 30 minutes). Thanks for all the info, it's really interesting. I'll just do the full resync again today, and I look forward to 2.2 so I'll have more info if this happens again! :)

Mo Cassidy

unread,
Aug 7, 2012, 11:01:12 AM8/7/12
to mongod...@googlegroups.com
Hmmm. It happened a third time, today. I noticed something fishy about the timing, so I calculated the time between failures (some of it is estimated, because my records aren't complete), and it was ~38 days each time (930 hours between the 1st and 2nd failure, and 910 hours between the 2nd and 3rd failure). It's always my secondary (I have two active nodes and an arbiter) that crashes like this, but the secondary is always running on our 8GB RAM server, while the primary is on a 12GB server. Both servers are shared by Mongo and fairly busy IIS nodes of a webfarm. I haven't been closely watching what's going on with 2.2, but hopefully this is useful information for the devs.
Reply all
Reply to author
Forward
0 new messages