MongoDB - RocksDB data loss bug

78 views

Skip to first unread message

Aishwarya Ganesan

unread,

Sep 18, 2016, 1:43:43 PM9/18/16

to mongodb-dev

Hi,

I am running a three node mongoDB cluster. I am using mongoDB 3.0.11 with RocksDB as storage engine. When I insert a new item into the store, I set w=3, j=True. According to MongoDB documentation, when updated with j=True option, it guarantees that writes go to the journal and the journal is fsynced before the write is acknowledged. But that doesn't seem to be the case. When running strace on mongod, these are the file-system operations that happen on an insert:

---Client request here---

append("/data/ds-app-bugs/example/multi-mongo/workload_dir2/db/journal/000006.log", offset=1361, inode=7213977, count=16723)

----Client acked here-----

This could result in a data loss if the node crashes before the write completely makes to the disk. If the crash happens on two nodes on a three node cluster, one of these nodes could become the leader and a global data loss is possible. We have reproduced this particular data loss issue using our testing framework. As a fix, it would be safe to fsync the journal before acknowledging the client.

In the same version of MongoDB with WiredTiger as the storage engine, journal is fsynced before acknowledging the client.

Except for one place in the code, the rocksdb writes are called with default write options. Could this be the cause of this bug?

rocks_counter_manager.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);

rocks_engine.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);

rocks_engine.cpp: syncOptions.sync = true;

auto s = _db->Write(syncOptions, &wb);

rocks_record_store.cpp: auto s = _db->Write(rocksdb::WriteOptions(), &wb);

rocks_recovery_unit.cpp: auto status = _db->Write(rocksdb::WriteOptions(), wb);

I would be happy to file an issue if needed. If so, should I file the issue in MongoDB repository or Mongo-Rocks repository?

Thanks,
Aishwarya

Igor Canadi

unread,

Sep 19, 2016, 7:28:04 PM9/19/16

to mongodb-dev

Hi Aishwarya,

Thanks for the great bug report. Looking through the code, it seems like this is a known problem is 3.0: https://github.com/mongodb-partners/mongo/blob/v3.0-fb/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L241. The correct thing to do there would be to call db->SyncWAL().

The bug should be fixed in the 3.2 branch, however. In 3.2, RocksDB handles the durability with RocksDurabilityManager and this is where the sync happens: https://github.com/mongodb-partners/mongo-rocks/blob/master/src/rocks_durability_manager.cpp#L51

Would you mind trying if either of these fix your issue:

1) Calling SyncWAL() in 3.0 branch

2) Using 3.2?

Thanks,

Igor

P.S. In the future you can also use mongo-rocks Google Group: https://groups.google.com/forum/#!forum/mongo-rocks

Reply all

Reply to author

Forward

0 new messages