Mongos get "Operation Timeout" after recovering Config Server Replica set

1,492 views
Skip to first unread message

Lukas Obermann

unread,
Sep 6, 2016, 2:14:36 PM9/6/16
to mongodb-user
So, i have a very weird issue. 
On Sunday I had a server issue which resulted in 2 of 3 config servers in one replica set became bricked and the third one stuck on recovery ...
The only way I could get it back was restoring the contents of it with the backup (Mongo Cloud).

So far it looked good, all sharding data and so is visible via sh.status() on the config servers. Also on the two replicasets I can query the data.

But, on the mongos instances I am getting timouts when trying to get the shard status:

    mongos> sh.status()
   
2016-09-05T09:49:15.645+0000 E QUERY    [thread1] Error: error: { "code" : 50, "ok" : 0, "errmsg" : "Operation timed out" } :
    _getErrorWithCode@src
/mongo/shell/utils.js:25:13
   
DBCommandCursor@src/mongo/shell/query.js:689:1
   
DBQuery.prototype._exec@src/mongo/shell/query.js:118:28
   
DBQuery.prototype.hasNext@src/mongo/shell/query.js:276:5
   
DBCollection.prototype.findOne@src/mongo/shell/collection.js:289:10
    printShardingStatus@src
/mongo/shell/utils_sh.js:540:19
    sh
.status@src/mongo/shell/utils_sh.js:78:5
   
@(shell):1:1


I can connect to the config server via mongo from the server running the mongos. Using Mongo version 3.2.7 , I have no idea how to solve this issue as I do not see any logs pointing me into the right direction...

A sh.status() on the config's primary give me following

cfg:PRIMARY> sh.status()
--- Sharding Status ---
  sharding version
: {
 
"_id" : 1,
 
"minCompatibleVersion" : 5,
 
"currentVersion" : 6,
 
"clusterId" : ObjectId("5784eeaef6b7baafd8311861")
}
  shards
:
 
{  "_id" : "rs1",  "host" : "rs1/mongodbreplicaset1_mongodb-rs1-srv1_1:27017,mongodbreplicaset2_mongodb-rs1-srv2_1:27017" }
 
{  "_id" : "rs2",  "host" : "rs2/mongodbreplicaset1_mongodb-rs2-srv2_1:27017,mongodbreplicaset2_mongodb-rs2-srv1_1:27017" }
  active mongoses
:
 
"3.2.7" : 1
  balancer
:
 
Currently enabled:  yes
 
Currently running:  no
 
Failed balancer rounds in last 5 attempts:  5
 
Last reported error:  could not get updated shard list from config server due to Operation timed out
 
Time of Reported error:  Tue Sep 06 2016 13:05:53 GMT+0000 (UTC)
 
Migration Results for the last 24 hours:
 
No recent migrations



Showing that it is not able to run the balancer also for a timeout issue. I do not see any connection issues between all the servers.


The only thing in the logs on the config servers is this:

    2016-09-05T10:09:35.549+0000 I COMMAND  [conn1243] Command on database config timed out waiting for read concern to be satisfied. Command: { find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1472281864000|2, t: 30 } }, maxTimeMS: 30000 }
   
2016-09-05T10:09:35.551+0000 I COMMAND  [conn1243] command config.$cmd command: find { find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1472281864000|2, t: 30 } }, maxTimeMS: 30000 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:92 locks:{} protocol:op_command 30409ms


I hope somebody can help me on this. Thanks!

Thiago Leite

unread,
Sep 28, 2016, 12:46:53 PM9/28/16
to mongodb-user

Hi, 

I have the same issue in mongo 3.2.9.

Did you solve it?

Regards, 

Thiago Leite

Kevin Adistambha

unread,
Oct 10, 2016, 8:12:23 PM10/10/16
to mongodb-user

Hi Lukas, Thiago,

On Sunday I had a server issue which resulted in 2 of 3 config servers in one replica set became bricked and the third one stuck on recovery ..

It has been a while since you posted this question. Have you had any success in fixing the issue?

The main issue is this line in the log you posted:

2016-09-05T10:09:35.549+0000 I COMMAND [conn1243] Command on database config timed out waiting for read concern to be satisfied.

Using config servers as a replica set, MongoDB needs to ensure that any writes and any reads to/from the config servers are committed to the majority of the replica set to ensure that the config data is permanent and will not be rolled back for any reason (see Read and Write Operations on Config Servers).

The timeout message you are seeing in the logs reflects this lack of majority. That is, the majority of the config server is not online at that point in time, and thus the config servers cannot reach a quorum on what is the latest data that were written to the config servers. In this situation, MongoDB opts to return a timeout error instead of returning potentially the wrong data.

In order to restore operation to the cluster, you would need to ensure that the majority of the config server replica set is online.

For more information regarding reading/writing settings in a replica set, please see:

Best regards,
Kevin

enric.s...@kernel-analytics.com

unread,
Sep 26, 2017, 8:58:56 PM9/26/17
to mongodb-user
Hi everyone,

I am running exactly the same issue in a MongoDB 3.2.10 shard. I have modified the write concern to "majority" in my config server replica set following https://docs.mongodb.com/v3.2/core/replica-set-write-concern/ and after relaunching my mongos instance the issue persists. I have also tried with the option
--enableMajorityReadConcern on the config server replica set mongod instances and yet it did not solve the issue.

Did you managed to solve the issue at the end?

Thank you,
Enric


Aviso legal:
Este mensaje, incluidos sus ficheros adjuntos, contienen información confidencial, dirigida a un destinatario y con una finalidad específica protegida por la Ley. Le informamos que si Usted no es el destinatario del mismo, deberá eliminar éste mensaje, quedando estrictamente prohibida cualquier divulgación, copia o distribución de su contenido, así como desarrollar o ejecutar cualquier tipo de acción basada en su contenido. Si Usted es el destinatario, el uso de la información que ha recibido, deberá estar limitado a la finalidad específica por la cual se le ha enviado.

Disclaimer: This message, including its attachments, contains confidential information addressed to a recipient for a specific purpose protected by law. Please note that if you are not the intended addressee of this message you must immediately delete the message and its attachments. Any disclosure, copying or distribution of the contents of this message is strictly forbidden. The execution of any action based on the contents of this message is strictly forbidden. If you are the intended recipient, the use of the information received shall be limited to the specific purpose for which it has been sent.

Avís legal: Aquest missatge, inclosos els seus fitxers adjunts, conté informació confidencial, dirigida a un destinatari i amb una finalitat específica protegida per la Llei. L'informem que si Vostè no és el destinatari del mateix, haurà d’eliminar aquest correu, quedant estrictament prohibida qualsevol divulgació, còpia o distribució del seu contingut, així com desenvolupar o executar qualsevol tipus d’acció basada en el seu contingut. Si Vostè és el destinatari, l’ús de la informació que ha rebut haurà d’estar limitat a la finalitat específica per la qual se li ha enviat.

Kevin Adistambha

unread,
Sep 26, 2017, 9:25:59 PM9/26/17
to mongodb-user

Hi Enric

Please note that I have replied in your own thread here: https://groups.google.com/forum/#!topic/mongodb-user/b9okvbIS_A4.

The telltale sign of what’s happening in your deployment seems to be the growth of the WiredTigerLAS.wt file, which is reported in SERVER-26592, and you may be experiencing a similar issue.

Let’s keep the discussion in that thread.

Best regards,
Kevin

enric.s...@kernel-analytics.com

unread,
Sep 27, 2017, 4:13:27 AM9/27/17
to mongodb-user

Hi Kevin,

I had not seen this answer. Currently after restarting my config replica set my WiredTigerLAS.wt in the config replica set data files 4 KB (on all nodes) so its size should not be the issue now.

I will keep the discussion in the other thread, I thought my issue now was on the read concern so I posted here.

Thanks,
Reply all
Reply to author
Forward
0 new messages