Hi all,
We are using sharding + replica set configuration and recently we are
facing a strange error.
When all servers are online, everything is OK. However, if one Master
of a replica set goes offline for a short time (there was some
operations during it's offline), it is reselected as the master of the
shard as soon as it becomes online.
After that, the problem comes :
In the log of our application, we see that the getlasterror of update
operations gives sometime a shardRawGLE and with "n" : 0 (But the
update is somehow done before, without log.)
As the getlasterror returns n = 0, our application condiers that there
was an error of this update, and it retries the same update operation
with same conditions. For these retries, we have a different
getLastError in the log, showing that updatedExisting is false. (which
means that the update is done by the first try ?)
------------------
Here's the code of updates and getLastErrors :
m_pDBConnection->GetConnection()->update(tableName, m_objQuery,
m_objFields, false, false);
m_pDBConnection->GetConnection()->runCommand(m_pDBConnection-
>GetDBName(), BSON("getlasterror" << 1 << "w" << "majority" <<
"wtimeout" << 10000), objError);
int iRows = objError.getIntField("n");
if (iRows > 0)
{
if( pcRowsAffected )
*pcRowsAffected = iRows;
EDPTRACE_DBG_INF(*********); //Application Log for a normal execution
}
else
{
if( pcRowsAffected )
*pcRowsAffected = 0;
EDPTRACE_DBG_ERR(**********); //Application Log showing that an error
occurs
}
--------------------
Here's the two different getLastErrors:
1) First try :
Error description : { "shards" :
[ "Shard1/
10.1.6.130:2222,
10.1.64.207:2222",
"Shard2/
10.1.64.102:2223,
10.1.64.207:2223",
"Shard3/
10.1.6.130:2224,
10.1.64.102:2224,
10.1.64.207:2224" ],
"shardRawGLE" : { "Shard1/
10.1.6.130:2222,
10.1.64.207:2222" : { "n" :
0, "lastOp" : 0, "connectionId" : 52600, "wnote" : "no write has been
done on this connection", "wtime" : 0, "err" : null, "ok" : 1 },
"Shard2/
10.1.64.102:2223,
10.1.64.207:2223" : { "n" : 0, "lastOp" : 0,
"connectionId" : 20070, "wnote" : "no write has been done on this
connection", "wtime" : 0, "err" : null, "ok" : 1 },
"Shard3/
10.1.6.130:2224,
10.1.64.102:2224,
10.1.64.207:2224" :
{ "writeback" : { "$oid" : "4f9459054657a7c59703e612" },
"instanceIdent" : "LY-YANG:2224", "n" : 0, "lastOp" : 0,
"connectionId" : 58, "wnote" : "no write has been done on this
connection", "wtime" : 0, "err" : null, "ok" : 1 } }, "n" : 0, "err" :
null, "ok" : 1 }
2) All tries after the first :
Error description : { "singleShard" :
"Shard3/
10.1.6.130:2224,
10.1.64.102:2224,
10.1.64.207:2224",
"updatedExisting" : false, "n" : 0, "lastOp" : 5734306103559192599,
"connectionId" : 51, "err" : null, "ok" : 1, "writeback" : { "$oid" :
"4f9459194657a7c59703e614" }, "instanceIdent" : "LY-YANG:2224",
"wnote" : "no write has been done on this connection", "wtime" : 0,
"writebackGLE" : { "singleShard" :
"Shard3/
10.1.6.130:2224,
10.1.64.102:2224,
10.1.64.207:2224",
"updatedExisting" : false, "n" : 0, "lastOp" : 5734306103559192599,
"connectionId" : 51, "err" : null, "ok" : 1 }, "initialGLEHost" :
"Shard3/
10.1.6.130:2224,
10.1.64.102:2224,
10.1.64.207:2224" }
---------------------
We are using 2.0.2 with C++ driver.
Here's a similar issue, but I didn't find the reason (It's said that
the error could be caused by data migration, but in my case, there was
a long time between the Master's re-online and the update operations.
The synchronisation is certainly done before the updates.) :
http://groups.google.com/group/mongodb-user/browse_thread/thread/600d3abcc6122b7e