Re: [mongodb-dev] questions on how replication and crash recovery work

133 views
Skip to first unread message

Eliot Horowitz

unread,
Sep 23, 2012, 8:44:49 PM9/23/12
to mongo...@googlegroups.com
Your 3 assumptions are correct.

The synchronization between the 3 sections (oplog, data, indexes) is
guaranteed because its all using the same storage engine + journal.
So once the journal commit happens, all 3 are guaranteed in sync.

Adding in a 2nd storage engine that has the same transactional
properties isn't going to be easy.
Not even sure its possible to do in an elegant way.

On Sun, Sep 23, 2012 at 7:50 PM, Zardosht Kasheff <zard...@gmail.com> wrote:
> Hello all,
>
> I am a Tokutek engineer investigating the possible integration of a
> different storage engine into MongoDB, be it at the index level or the
> storage engine level.
>
> For the purpose of this email, suppose that a collection either:
> - has a secondary index that is using our engine.
> - the entire collection is implemented using our engine.
>
> I am trying to learn how crash safety/recovery works and replication would
> work with a possible third-party engine. The problem I see right now is
> should MongoDB crash, I do not understand how we can ensure that we recover
> to a state that MongoDB finds acceptable. That being said, I was wondering
> if somebody could please help with these questions:
>
> After a crash and recovery, what is the expected state of the system? Here
> are my guesses, based on things I have read, but they are only guesses:
> - secondary indexes are in sync with the main data heap
> - the main data heap is in sync with the replication log (which I think is
> called the opLog)
> - the exact data in the database depends on when the last fsync of the
> journal occurred.
>
> Are my guesses correct? If not, what are the invariants of the system after
> a crash regarding the journal, data heap, and opLog (and anything else I may
> not know about)?
>
> If so, here is the challenge I am thinking about. Upon a crash, if we are
> just a secondary index, how do we ensure that we are in sync with the main
> data heap, and if we have the entire collection, how do we ensure that we
> are in sync with the opLog?
>
> To answer this, I am trying to learn the locking in the system that ensures
> these invariants hold? I see the following in instance.cpp and query.cpp:
> - receivedInsert, receivedUpdate, and receivedDelete call Lock::DBWrite
> lk(ns), which I guess grabs some database level lock, and releases the lock
> should there be a "PageFaultException" (which I guess is I/O). Is this a
> database level lock that gets yielded during I/O?
> - receivedInsert has a reference to "read locked in big log". What does
> this mean?
> - runQuery, through "Client::ReadContext ctx( ns , dbpath );" grabs some
> read lock? Is this a read lock on the same lock grabbed in receivedInsert
> etc...?
>
> I guess some locking needs to be in place to ensure that the opLog and
> journal is in sync with the data heap, but with the locking above, I do not
> understand how this is done. Is there a global rw lock that does this? If
> so, where in code can I read about it?
>
> Thanks
> -Zardosht
>
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-dev" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-dev/-/FxydGgHX4Q4J.
> To post to this group, send email to mongo...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-dev...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-dev?hl=en.

Zardosht Kasheff

unread,
Sep 25, 2012, 4:35:23 PM9/25/12
to mongo...@googlegroups.com, el...@10gen.com
Hello Eliot,

Thank you for the confirmation. So it seems that the invariant that each machine needs to maintain is that it is in sync with the journal. I watched a recorded presentation on journaling by Dwight Merriman (I would link to it, but I cannot find it), and I think I understood journaling to work as follows:
 - every N ms, some thread with exclusive access to the journal logs something that states we are ending a transaction, and beginning another one.
 - After this end and begin are logged, some (global?) lock is released and the journal is fsynced. (I am actually guessing this step here, I assume a global lock is needed for the logging, but that you would not want to hold it while fsyncing.)
 - All work that falls in between this begin and end essentially make up a transaction.
 - During recovery, if we see a begin, but not an end, then whatever work that is logged after that begin is not applied.

Is this accurate? If so, what locking is used to do this? What locking protects the journal and the opLog? I assume these must be some sort of global locks.

Also, how does fsyncing of data files and trimming of the journal work? What locks are used to make it work?

If I understand the recovery framework of MongoDB, I hope to understand what theoretically needs to be done to fit a second storage engine to work within this framework.

Thanks
-Zardosht
Reply all
Reply to author
Forward
0 new messages