After successful dropDatabase mongo became very slow and never recovered eventually causing outage.

173 views
Skip to first unread message

Sergei Bobovich

unread,
Oct 9, 2015, 11:02:13 AM10/9/15
to mongodb-user
Dear mongo gurus,
I really need your help and would appreciate any ideas on the issue below. 
While investigating problem in production with mongo server (v3.0.1 using wired tiger engine) I found that the trigger was dropDatabase command that finished fast and successful but resulted in a significant slow down of the following query processing that server has never managed to recover from. Below is a log from the primary member. The database aadh_prd has been deleted - the operation took just 476ms.  Right after that we see the spike of slow queries on other databases. The first few might be explained by dropDatabase holding global WRITE lock. But the slow queries continue further 10,20...s after dropDatabase has  completed. The slow response resulted in quick rise of connection counts form app tier eventually reaching connection limit. The mongo server never recovered and remained in this state for 2 hours until we forced switch to secondary. This happened during low traffic time - so server was not getting any significant load:  10op/sec at most. Normally these operations complete within under 3ms and we did not see any slow operations in the log for days before dropDatabase command.  Any idea or suggestion on steps to dig to the root cause will be greatly appreciated.
Thanks,
Sergei

2015-10-09T08:17:40.113+0000 I COMMAND  [conn2736988] dropDatabase aadh_prd starting
2015-10-09T08:17:40.516+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2736995 (23 connections now open)
2015-10-09T08:17:40.589+0000 I COMMAND  [conn2736988] dropDatabase aadh_prd finished
2015-10-09T08:17:40.589+0000 I COMMAND  [conn2736988] command aadh_prd.$cmd command: dropDatabase { dropDatabase: 1.0 } keyUpdates:0 writeConflicts:0 numYields:0 reslen
:59 locks:{} 476ms

2015-10-09T08:17:40.590+0000 I COMMAND  [conn2709391] command aaea_prd.$cmd command: count { count: "bask_bcIi6iaagRqmkaaacWQphRFS6L", query: { customer_id: "abbxU0aVYv
G1j6a4fMiblcZuTb" }, limit: -5 } planSummary: COUNT_SCAN { customer_id: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:44 locks:{} 279ms
2015-10-09T08:17:40.590+0000 I COMMAND  [conn2735382] command aace_prd.$cmd command: count { count: "bask_bd63giaag1s7caaaczcTQbS1qh", query: { customer_id: "ad3e7Xd5Ia
Zu9Dn9vyh0iWF0bP" }, limit: -5 } planSummary: COUNT_SCAN { customer_id: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:44 locks:{} 260ms
...
2015-10-09T08:17:50.869+0000 I COMMAND  [conn2737032] command aaea_prd.$cmd command: count { count: "bask_bdI8YiaagTolwaaacMUHoSGP0b", query: { customer_id: "abSsuASmsg
o2TKpTyg6rdXnSCF" }, limit: -5 } planSummary: COUNT_SCAN { customer_id: 1 } keyUpdates:0 writeConflicts:0 numYields:1 reslen:44 locks:{} 175ms
2015-10-09T08:17:51.171+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737058 (86 connections now open)
2015-10-09T08:17:51.454+0000 I COMMAND  [conn2736294] command aaiz_prd.$cmd command: count { count: "bask_bcIWciaagXs8Uaaac1nGIhAfR3", query: { customer_id: "deXoS5yWCt
jlcPb5AfSa6SEcMp" }, limit: -5 } planSummary: COUNT_SCAN { customer_id: 1 } keyUpdates:0 writeConflicts:0 numYields:1 reslen:44 locks:{} 174ms
2015-10-09T08:17:51.462+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737059 (87 connections now open)
2015-10-09T08:17:51.520+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737060 (88 connections now open)
2015-10-09T08:17:51.531+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737061 (89 connections now open)
2015-10-09T08:17:51.648+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737062 (90 connections now open)
2015-10-09T08:17:51.657+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737063 (91 connections now open)
2015-10-09T08:17:51.759+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737064 (92 connections now open)
2015-10-09T08:17:51.810+0000 I NETWORK  [initandlisten] connection accepted from X.X.X.X:XXX #2737065 (93 connections now open)
2015-10-09T08:17:51.835+0000 I COMMAND  [conn2737063] command aaea_prd.$cmd command: count { count: "bask_bc8eYiaagZ3x6aaacSAHoRdGTt", query: { customer_id: "agbiFDy76z
zV2cfedga31B2Cpp" }, limit: -5 } planSummary: COUNT_SCAN { customer_id: 1 } keyUpdates:0 writeConflicts:0 numYields:1 reslen:44 locks:{} 176ms
2015-10-09T08:17:51.836+0000 I WRITE    [conn2737025] update aaea_prd.bask_bcvSkiaagTQGwaaacJzUEGuv8T query: { uuid: "bdTukiaagXRhkaaadoLRxEZb9F" } update: { XXXX } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:1 writeConflicts:0 numYields:1 locks:{} 258ms
...

  

Alexander Gorrod

unread,
Oct 12, 2015, 1:43:32 AM10/12/15
to mongodb-user
Hi Sergei,

There was a known issue with dropDatabase that caused the symptoms you describe. It was fixed in the 3.0.3 release of MongoDB. The issue was https://jira.mongodb.org/browse/SERVER-17907.

Please try upgrading to the latest version of MongoDB and let us know if the issue isn't resolved.

Regards,
Alex

Reply all
Reply to author
Forward
0 new messages