DSS heartbeat

Stewart Mackenzie

unread,

Oct 10, 2013, 4:07:54 AM10/10/13

to mozart-...@googlegroups.com

Just a few notes from last Sunday's meeting, and a few issues regarding making the ozdss more resilient against failure.

1. What we know.

a. About the fault model. ~ There are `ok`, `tempFail`, `localFail`, `permFail` states. ~ These states are associated with an 'entity' (i.e. variables, mutable stuffs like cells, arrays, ports, etc.) ~ Actually each 'entity' has an associated "fault stream" which pumps out these states (i.e. {GetFaultStream ACell} = ok|tempFail|ok|ok|tempFail|permFail|...)

b. About the delay determination. ~
while (true) {
send message wait for reply
if (has reply) {
decrease delay
} else {
increase delay
}
~ We know a more sophisticated algorithm in Mozart 1 at https://github.com/mozart/mozart/blob/a5a2f805955ca6156b9c0db3e3b421f8cddb5600/platform/emulator/libdp/glue_site.cc#L173 ~ If we respond time is > the delay, we turn 'something' [1] to a `tempFail` state. ~ Receiving a message to 'something' [1] in a `tempFail` state will turn it back to the `ok` state.

2. Questions.

a. Whether the fault stream should be associated with (i.e. what is that 'something' [1] mentioned above?): * entities? * connections? * sites (i.e. an entire mozart VM process)? From the slides in http://www.info.ucl.ac.be/~pvr/LADA2012PVR.pdf it seems it should be a local property of the entities of a site, but the source code of Mozart 1 seems to indicate it is a local property of the connection to another site (VM process).

b. What to do when we send message to a `tempFail` entity? * Should we send anyway? If so, what's the point of `tempFail` anyway? * Should we enqueue the message locally without sending it? If so, what happens when both sides of connection mark the other one as `tempFail`, wouldn't it go into dead lock?

c. How do we execute the delay determination loop? * Do we really need to add a dedicated 'heartbeat' message? Would it flood the whole network? * Could we use ordinary messages for heartbeat purposes?

-------

Sample pseudocode to start. Maybe we have more things to add later to ask more questions.This code is used to serve for illustration purposes please correct/adjust where necessary.

Site A: /* purpose: as a server. examine the effect of tempFail. */

create ticket for a Cell. create ticket for a Variable. thread sleep 5 seconds. bind Variable = 6 /* <- consider how to deal with tempFail here. */end

Site B: /* purpose: as a client. examine the effect of tempFail. */

take ticket of Cell from Site A. take ticket of Variable from Site A. Assign Cell to 5. sleep 10 seconds. Assign Cell to 7.

Site C: /* purpose: make the Cell distributed to more than 2 parties. */

take ticket of Cell from Site A.

Ideal long-term (eventual) end result:

Cell set to 5 before Site B sleeps (before t = 10s). Cell set to 7 eventually (after t = 10s). Variable set to 6 eventually (after t = 5s).

Kind regards
Stewart

Peter Van Roy

unread,

Oct 11, 2013, 5:41:57 AM10/11/13

to mozart-...@googlegroups.com

On Thursday, October 10, 2013 10:07:54 AM UTC+2, stewart mackenzie wrote:

Just a few notes from last Sunday's meeting, and a few issues regarding making the ozdss more resilient against failure.

1. What we know.

(...)

2. Questions.

a. Whether the fault stream should be associated with (i.e. what is that 'something' [1] mentioned above?): * entities? * connections? * sites (i.e. an entire mozart VM process)? From the slides in http://www.info.ucl.ac.be/~pvr/LADA2012PVR.pdf it seems it should be a local property of the entities of a site, but the source code of Mozart 1 seems to indicate it is a local property of the connection to another site (VM process).

The failure detection is implemented by looking at a site, but a fault stream must be associated to an entity. This is actually quite important, since each entity can have different behavior even if they are on (mostly) the same sites. For example, it's possible for an application to Kill an entity (force it to permFail) - this is a very useful operation for building abstractions! Doing this should not break the connection to the site, though: other entities on the same sites will just continue to work.

b. What to do when we send message to a `tempFail` entity? * Should we send anyway? If so, what's the point of `tempFail` anyway? * Should we enqueue the message locally without sending it? If so, what happens when both sides of connection mark the other one as `tempFail`, wouldn't it go into dead lock?

When you send a message to a failed entity, the message should be enqueued and the send should block (the sender should stop sending). There are two principles:
1. If the tempFail goes away eventually (perhaps the network has just slowed down a bit), everything should work normally as if there were no tempFails. So you can't just drop a message!
2. If the tempFail lasts long, the idea is never to do anything that would not be done if the entity were correct. So if you can't do anything correct, just wait indefinitely. Failure states will not cause anything wrong to be done, but they may cause an operation to block.

The idea is that there is another thread observing the fault stream. If this thread sees a tempFail it may decide to terminate the part of the application that has blocked, or else it may just wait. See the Erlang "let it fail" philosophy: when part of an application can't do reasonable work because of a failure, then the best thing is to just terminate it (instead of trying to go ahead anyway and handle the complexities of the failure mode).

Deadlock is not possible, since even if both sides of a connection block, there is another part of the system that is observing the fault stream and that can decide to break the deadlock.

c. How do we execute the delay determination loop? * Do we really need to add a dedicated 'heartbeat' message? Would it flood the whole network? * Could we use ordinary messages for heartbeat purposes?

Might be possible. But if there are no ordinary messages, then heartbeats are necessary. Heartbeats are important because they let the application react *quickly*: a tempFail should be detected quickly, it's not a time out!

Hope this helps to understand the approach! You can also take a look at Raphael Collet's Ph.D. thesis (see the PLDC web site).

Peter

Peter Van Roy

unread,

Oct 11, 2013, 5:43:08 AM10/11/13

to mozart-...@googlegroups.com

On Friday, October 11, 2013 11:41:57 AM UTC+2, Peter Van Roy wrote:

On Thursday, October 10, 2013 10:07:54 AM UTC+2, stewart mackenzie wrote:
Just a few notes from last Sunday's meeting, and a few issues regarding making the ozdss more resilient against failure.

1. What we know.

(...)

2. Questions.

a. Whether the fault stream should be associated with (i.e. what is that 'something' [1] mentioned above?): * entities? * connections? * sites (i.e. an entire mozart VM process)? From the slides in http://www.info.ucl.ac.be/~pvr/LADA2012PVR.pdf it seems it should be a local property of the entities of a site, but the source code of Mozart 1 seems to indicate it is a local property of the connection to another site (VM process).
The failure detection is implemented by looking at a site, but a fault stream must be associated to an entity. This is actually quite important, since each entity can have different behavior even if they are on (mostly) the same sites. For example, it's possible for an application to Kill an entity (force it to permFail) - this is a very useful operation for building abstractions! Doing this should not break the connection to the site, though: other entities on the same sites will just continue to work.

It's possible to have a distributed version of an entity even if it's not actually distributed over multiple nodes. For example, assume that you have a distributed port on two nodes and it is GC'ed on one node. Then the distributed port exists on one node. Another example is finalization: you can use fault streams to implement post-mortem finalization (which is the best way to do finalization): just create a fault stream on the entity and wait until the fault stream is terminated by nil (this should be done when the entity goes away through GC). At that point you can do the cleanup actions. Throughout all this, the entity does not have to be distributed.

Peter

Sébastien Doeraene

unread,

Oct 11, 2013, 5:49:37 AM10/11/13

to Peter Van Roy, mozart-...@googlegroups.com

Hi,

On Fri, Oct 11, 2013 at 11:43 AM, Peter Van Roy <p...@info.ucl.ac.be> wrote:

Another example is finalization: you can use fault streams to implement post-mortem finalization (which is the best way to do finalization): just create a fault stream on the entity and wait until the fault stream is terminated by nil (this should be done when the entity goes away through GC). At that point you can do the cleanup actions. Throughout all this, the entity does not have to be distributed.

Although it doesn't invalidate the discussion, I just want to point out that finalization is best implemented with {System.postmortem X P Y}. It doesn't require to spawn a full-fledge distributed "profile" for the entity.

Cheers,
Sébastien

Reply all

Reply to author

Forward