Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion questions on how replication and crash recovery work

Received: by 10.66.79.6 with SMTP id f6mr2755976pax.3.1348605325271;
        Tue, 25 Sep 2012 13:35:25 -0700 (PDT)
X-BeenThere: mongodb-dev@googlegroups.com
Received: by 10.68.141.133 with SMTP id ro5ls2320158pbb.1.gmail; Tue, 25 Sep
 2012 13:35:23 -0700 (PDT)
Received: by 10.68.211.6 with SMTP id my6mr4136600pbc.15.1348605323828;
        Tue, 25 Sep 2012 13:35:23 -0700 (PDT)
Date: Tue, 25 Sep 2012 13:35:23 -0700 (PDT)
From: Zardosht Kasheff <zardo...@gmail.com>
To: mongodb-dev@googlegroups.com
Cc: el...@10gen.com
Message-Id: <86a7c1ed-8107-45dd-9db7-4f7ad683893f@googlegroups.com>
In-Reply-To: <CA+C=T4D2_49+4vUTNe+P3TStR+nGUeDgciRADBhCeW1O1Q+VfQ@mail.gmail.com>
References: <e8925817-c3ae-4df7-a341-c7298e7a1291@googlegroups.com>
 <CA+C=T4D2_49+4vUTNe+P3TStR+nGUeDgciRADBhCeW1O1Q+VfQ@mail.gmail.com>
Subject: Re: [mongodb-dev] questions on how replication and crash recovery
 work
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_22_10047944.1348605323346"

------=_Part_22_10047944.1348605323346
Content-Type: multipart/alternative; 
	boundary="----=_Part_23_29896476.1348605323346"

------=_Part_23_29896476.1348605323346
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Hello Eliot,

Thank you for the confirmation. So it seems that the invariant that each 
machine needs to maintain is that it is in sync with the journal. I watched 
a recorded presentation on journaling by Dwight Merriman (I would link to 
it, but I cannot find it), and I think I understood journaling to work as 
follows:
 - every N ms, some thread with exclusive access to the journal logs 
something that states we are ending a transaction, and beginning another 
one.
 - After this end and begin are logged, some (global?) lock is released and 
the journal is fsynced. (I am actually guessing this step here, I assume a 
global lock is needed for the logging, but that you would not want to hold 
it while fsyncing.)
 - All work that falls in between this begin and end essentially make up a 
transaction.
 - During recovery, if we see a begin, but not an end, then whatever work 
that is logged after that begin is not applied.

Is this accurate? If so, what locking is used to do this? What locking 
protects the journal and the opLog? I assume these must be some sort of 
global locks.

Also, how does fsyncing of data files and trimming of the journal work? 
What locks are used to make it work?

If I understand the recovery framework of MongoDB, I hope to understand 
what theoretically needs to be done to fit a second storage engine to work 
within this framework.

Thanks
-Zardosht


On Sunday, September 23, 2012 8:44:52 PM UTC-4, Eliot Horowitz wrote:
>
> Your 3 assumptions are correct. 
>
> The synchronization between the 3 sections (oplog, data, indexes) is 
> guaranteed because its all using the same storage engine + journal. 
> So once the journal commit happens, all 3 are guaranteed in sync. 
>
> Adding in a 2nd storage engine that has the same transactional 
> properties isn't going to be easy. 
> Not even sure its possible to do in an elegant way. 
>
> On Sun, Sep 23, 2012 at 7:50 PM, Zardosht Kasheff <zard...@gmail.com<javascript:>> 
> wrote: 
> > Hello all, 
> > 
> > I am a Tokutek engineer investigating the possible integration of a 
> > different storage engine into MongoDB, be it at the index level or the 
> > storage engine level. 
> > 
> > For the purpose of this email, suppose that a collection either: 
> >  - has a secondary index that is using our engine. 
> >  - the entire collection is implemented using our engine. 
> > 
> > I am trying to learn how crash safety/recovery works and replication 
> would 
> > work with a possible third-party engine. The problem I see right now is 
> > should MongoDB crash, I do not understand how we can ensure that we 
> recover 
> > to a state that MongoDB finds acceptable. That being said, I was 
> wondering 
> > if somebody could please help with these questions: 
> > 
> > After a crash and recovery, what is the expected state of the system? 
> Here 
> > are my guesses, based on things I have read, but they are only guesses: 
> >  - secondary indexes are in sync with the main data heap 
> >  - the main data heap is in sync with the replication log (which I think 
> is 
> > called the opLog) 
> >  - the exact data in the database depends on when the last fsync of the 
> > journal occurred. 
> > 
> > Are my guesses correct? If not, what are the invariants of the system 
> after 
> > a crash regarding the journal, data heap, and opLog (and anything else I 
> may 
> > not know about)? 
> > 
> > If so, here is the challenge I am thinking about. Upon a crash, if we 
> are 
> > just a secondary index, how do we ensure that we are in sync with the 
> main 
> > data heap, and if we have the entire collection, how do we ensure that 
> we 
> > are in sync with the opLog? 
> > 
> > To answer this, I am trying to learn the locking in the system that 
> ensures 
> > these invariants hold? I see the following in instance.cpp and 
> query.cpp: 
> >  - receivedInsert, receivedUpdate, and receivedDelete call Lock::DBWrite 
> > lk(ns), which I guess grabs some database level lock, and releases the 
> lock 
> > should there be a "PageFaultException" (which I guess is I/O). Is this a 
> > database level lock that gets yielded during I/O? 
> >  - receivedInsert has a reference to "read locked in big log". What does 
> > this mean? 
> >  - runQuery, through "Client::ReadContext ctx( ns , dbpath );" grabs 
> some 
> > read lock? Is this a read lock on the same lock grabbed in 
> receivedInsert 
> > etc...? 
> > 
> > I guess some locking needs to be in place to ensure that the opLog and 
> > journal is in sync with the data heap, but with the locking above, I do 
> not 
> > understand how this is done. Is there a global rw lock that does this? 
> If 
> > so, where in code can I read about it? 
> > 
> > Thanks 
> > -Zardosht 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "mongodb-dev" group. 
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msg/mongodb-dev/-/FxydGgHX4Q4J. 
> > To post to this group, send email to mongo...@googlegroups.com<javascript:>. 
>
> > To unsubscribe from this group, send email to 
> > mongodb-dev...@googlegroups.com <javascript:>. 
> > For more options, visit this group at 
> > http://groups.google.com/group/mongodb-dev?hl=en. 
>

------=_Part_23_29896476.1348605323346
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hello Eliot,<br><br>Thank you for the confirmation. So it seems that the in=
variant that each machine needs to maintain is that it is in sync with the =
journal. I watched a recorded presentation on journaling by Dwight Merriman=
 (I would link to it, but I cannot find it), and I think I understood journ=
aling to work as follows:<br>&nbsp;- every N ms, some thread with exclusive=
 access to the journal logs something that states we are ending a transacti=
on, and beginning another one.<br>&nbsp;- After this end and begin are logg=
ed, some (global?) lock is released and the journal is fsynced. (I am actua=
lly guessing this step here, I assume a global lock is needed for the loggi=
ng, but that you would not want to hold it while fsyncing.)<br>&nbsp;- All =
work that falls in between this begin and end essentially make up a transac=
tion.<br>&nbsp;- During recovery, if we see a begin, but not an end, then w=
hatever work that is logged after that begin is not applied.<br><br>Is this=
 accurate? If so, what locking is used to do this? What locking protects th=
e journal and the opLog? I assume these must be some sort of global locks.<=
br><br>Also, how does fsyncing of data files and trimming of the journal wo=
rk? What locks are used to make it work?<br><br>If I understand the recover=
y framework of MongoDB, I hope to understand what theoretically needs to be=
 done to fit a second storage engine to work within this framework.<br><br>=
Thanks<br>-Zardosht<br><br><br>On Sunday, September 23, 2012 8:44:52 PM UTC=
-4, Eliot Horowitz wrote:<blockquote class=3D"gmail_quote" style=3D"margin:=
 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Your =
3 assumptions are correct.
<br>
<br>The synchronization between the 3 sections (oplog, data, indexes) is
<br>guaranteed because its all using the same storage engine + journal.
<br>So once the journal commit happens, all 3 are guaranteed in sync.
<br>
<br>Adding in a 2nd storage engine that has the same transactional
<br>properties isn't going to be easy.
<br>Not even sure its possible to do in an elegant way.
<br>
<br>On Sun, Sep 23, 2012 at 7:50 PM, Zardosht Kasheff &lt;<a href=3D"javasc=
ript:" target=3D"_blank" gdf-obfuscated-mailto=3D"MC0QMA6nxpYJ">zard...@gma=
il.com</a>&gt; wrote:
<br>&gt; Hello all,
<br>&gt;
<br>&gt; I am a Tokutek engineer investigating the possible integration of =
a
<br>&gt; different storage engine into MongoDB, be it at the index level or=
 the
<br>&gt; storage engine level.
<br>&gt;
<br>&gt; For the purpose of this email, suppose that a collection either:
<br>&gt; &nbsp;- has a secondary index that is using our engine.
<br>&gt; &nbsp;- the entire collection is implemented using our engine.
<br>&gt;
<br>&gt; I am trying to learn how crash safety/recovery works and replicati=
on would
<br>&gt; work with a possible third-party engine. The problem I see right n=
ow is
<br>&gt; should MongoDB crash, I do not understand how we can ensure that w=
e recover
<br>&gt; to a state that MongoDB finds acceptable. That being said, I was w=
ondering
<br>&gt; if somebody could please help with these questions:
<br>&gt;
<br>&gt; After a crash and recovery, what is the expected state of the syst=
em? Here
<br>&gt; are my guesses, based on things I have read, but they are only gue=
sses:
<br>&gt; &nbsp;- secondary indexes are in sync with the main data heap
<br>&gt; &nbsp;- the main data heap is in sync with the replication log (wh=
ich I think is
<br>&gt; called the opLog)
<br>&gt; &nbsp;- the exact data in the database depends on when the last fs=
ync of the
<br>&gt; journal occurred.
<br>&gt;
<br>&gt; Are my guesses correct? If not, what are the invariants of the sys=
tem after
<br>&gt; a crash regarding the journal, data heap, and opLog (and anything =
else I may
<br>&gt; not know about)?
<br>&gt;
<br>&gt; If so, here is the challenge I am thinking about. Upon a crash, if=
 we are
<br>&gt; just a secondary index, how do we ensure that we are in sync with =
the main
<br>&gt; data heap, and if we have the entire collection, how do we ensure =
that we
<br>&gt; are in sync with the opLog?
<br>&gt;
<br>&gt; To answer this, I am trying to learn the locking in the system tha=
t ensures
<br>&gt; these invariants hold? I see the following in instance.cpp and que=
ry.cpp:
<br>&gt; &nbsp;- receivedInsert, receivedUpdate, and receivedDelete call Lo=
ck::DBWrite
<br>&gt; lk(ns), which I guess grabs some database level lock, and releases=
 the lock
<br>&gt; should there be a "PageFaultException" (which I guess is I/O). Is =
this a
<br>&gt; database level lock that gets yielded during I/O?
<br>&gt; &nbsp;- receivedInsert has a reference to "read locked in big log"=
. What does
<br>&gt; this mean?
<br>&gt; &nbsp;- runQuery, through "Client::ReadContext ctx( ns , dbpath );=
" grabs some
<br>&gt; read lock? Is this a read lock on the same lock grabbed in receive=
dInsert
<br>&gt; etc...?
<br>&gt;
<br>&gt; I guess some locking needs to be in place to ensure that the opLog=
 and
<br>&gt; journal is in sync with the data heap, but with the locking above,=
 I do not
<br>&gt; understand how this is done. Is there a global rw lock that does t=
his? If
<br>&gt; so, where in code can I read about it?
<br>&gt;
<br>&gt; Thanks
<br>&gt; -Zardosht
<br>&gt;
<br>&gt; --
<br>&gt; You received this message because you are subscribed to the Google=
 Groups
<br>&gt; "mongodb-dev" group.
<br>&gt; To view this discussion on the web visit
<br>&gt; <a href=3D"https://groups.google.com/d/msg/mongodb-dev/-/FxydGgHX4=
Q4J" target=3D"_blank">https://groups.google.com/d/<wbr>msg/mongodb-dev/-/F=
xydGgHX4Q4J</a><wbr>.
<br>&gt; To post to this group, send email to <a href=3D"javascript:" targe=
t=3D"_blank" gdf-obfuscated-mailto=3D"MC0QMA6nxpYJ">mongo...@googlegroups.c=
om</a>.
<br>&gt; To unsubscribe from this group, send email to
<br>&gt; <a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D=
"MC0QMA6nxpYJ">mongodb-dev...@<wbr>googlegroups.com</a>.
<br>&gt; For more options, visit this group at
<br>&gt; <a href=3D"http://groups.google.com/group/mongodb-dev?hl=3Den" tar=
get=3D"_blank">http://groups.google.com/<wbr>group/mongodb-dev?hl=3Den</a>.
<br></blockquote>
------=_Part_23_29896476.1348605323346--

------=_Part_22_10047944.1348605323346--