Just a few notes from last Sunday's meeting, and a few issues regarding making the ozdss more resilient against failure.
1. What we know.
a. About the fault model. ~ There are `ok`, `tempFail`, `localFail`, `permFail` states. ~ These states are associated with an 'entity' (i.e. variables, mutable stuffs like cells, arrays, ports, etc.) ~ Actually each 'entity' has an associated "fault stream" which pumps out these states (i.e. {GetFaultStream ACell} = ok|tempFail|ok|ok|tempFail|permFail|...)
b. About the delay determination. ~
while (true) {
send message wait for reply
if (has reply) {
decrease delay
} else {
increase delay
}
~ We know a more sophisticated algorithm in Mozart 1 at
https://github.com/mozart/mozart/blob/a5a2f805955ca6156b9c0db3e3b421f8cddb5600/platform/emulator/libdp/glue_site.cc#L173 ~ If we respond time is > the delay, we turn 'something' [1] to a `tempFail` state. ~ Receiving a message to 'something' [1] in a `tempFail` state will turn it back to the `ok` state.
2. Questions.
a. Whether the fault stream should be associated with (i.e. what is that 'something' [1] mentioned above?): * entities? * connections? * sites (i.e. an entire mozart VM process)? From the slides in
http://www.info.ucl.ac.be/~pvr/LADA2012PVR.pdf it seems it should be a local property of the entities of a site, but the source code of Mozart 1 seems to indicate it is a local property of the connection to another site (VM process).
b. What to do when we send message to a `tempFail` entity? * Should we send anyway? If so, what's the point of `tempFail` anyway? * Should we enqueue the message locally without sending it? If so, what happens when both sides of connection mark the other one as `tempFail`, wouldn't it go into dead lock?
c. How do we execute the delay determination loop? * Do we really need to add a dedicated 'heartbeat' message? Would it flood the whole network? * Could we use ordinary messages for heartbeat purposes?
-------
Sample pseudocode to start. Maybe we have more things to add later to ask more questions.This code is used to serve for illustration purposes please correct/adjust where necessary.
Site A: /* purpose: as a server. examine the effect of tempFail. */
create ticket for a Cell. create ticket for a Variable. thread sleep 5 seconds. bind Variable = 6 /* <- consider how to deal with tempFail here. */end
Site B: /* purpose: as a client. examine the effect of tempFail. */
take ticket of Cell from Site A. take ticket of Variable from Site A. Assign Cell to 5. sleep 10 seconds. Assign Cell to 7.
Site C: /* purpose: make the Cell distributed to more than 2 parties. */
take ticket of Cell from Site A.
Ideal long-term (eventual) end result:
Cell set to 5 before Site B sleeps (before t = 10s). Cell set to 7 eventually (after t = 10s). Variable set to 6 eventually (after t = 5s).
Kind regards
Stewart