Hi,
I am running a three node mongoDB cluster. I am using mongoDB 3.0.11 with RocksDB as storage engine. When I insert a new item into the store, I set w=3, j=True. According to MongoDB documentation, when updated with j=True option, it guarantees that writes go to the journal and the journal is fsynced before the write is acknowledged. But that doesn't seem to be the case. When running strace on mongod, these are the file-system operations that happen on an insert:
---Client request here---
append("/data/ds-app-bugs/example/multi-mongo/workload_dir2/db/journal/000006.log", offset=1361, inode=7213977, count=16723)
----Client acked here-----
This could result in a data loss if the node crashes before the write completely makes to the disk. If the crash happens on two nodes on a three node cluster, one of these nodes could become the leader and a global data loss is possible. We have reproduced this particular data loss issue using our testing framework. As a fix, it would be safe to fsync the journal before acknowledging the client.
In the same version of MongoDB with WiredTiger as the storage engine, journal is fsynced before acknowledging the client.
Except for one place in the code, the rocksdb writes are called with default write options. Could this be the cause of this bug?
rocks_counter_manager.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);
rocks_engine.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);
rocks_engine.cpp: syncOptions.sync = true;
auto s = _db->Write(syncOptions, &wb);
rocks_record_store.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);
rocks_record_store.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);
rocks_recovery_unit.cpp: auto status = _db->Write(rocksdb::WriteOptions(), wb);
I would be happy to file an issue if needed. If so, should I file the issue in MongoDB repository or Mongo-Rocks repository?
Thanks,
Aishwarya