Further DSS questions about fault and heartbeat

21 views
Skip to first unread message

kennytm

unread,
Nov 3, 2013, 5:50:29 AM11/3/13
to mozart-...@googlegroups.com
Hi all,

Stewart and I had another discussion on how to implement the heartbeat and fault stream, and these are what we have got:

1. Every entity has a fault stream. A distinct fault stream is associated to an entity in every site.
2. There are 3 parties that can set the fail state to an entity:

    - The site-specific "fault detector" can send `ok`, `tempFail` or `permFail` to an entity, indicating the network state of the home site the entity originates from;
    - The GC can end the stream with `nil` to indicate the entity has been garbage-collected;
    - User code can send `localFail` or `permFail` via `{Kill _} and `{Break _}`.

3. When we want to send a message via an Entity, we check the fault stream first, using this algorithm:

        for break:Break State in {GetFaultStream Entity} do
            case State
            of ok then
                % Do send the message
                {DoSendMessage}
                {Break}
            [] tempFail then
                % Wait until we get back to the `ok` state
                skip
            else
                % localFail, permFail: Drop the message and wait forever (drop the thread)
                {Wait}
            end
        end

4. The "fault detector" is installed for every site-to-site connection. It will also send hearbeat messages to determine if we should put all associated entities to `tempFail` state. We will use this algorithm:

        while still connected:
            record start time
            send a heartbeat message to remote
            wait until reply is received
            record end time
            rtt = end time - start time
            if rtt > rtt_timeout:
                set all associated entity to tempFail
            else:
                set all associated entity to ok
            recompute rtt_timeout
       
        # when connection is lost, e.g. due to Peer is Disconnect socket error,
        set all associated entity to permFail

        (rtt_timeout algorithm is same as https://github.com/mozart/mozart/blob/a5a2f805955ca6156b9c0db3e3b421f8cddb5600/platform/emulator/libdp/glue_site.cc#L173)

So the questions are:

A. Are all of these sound?
B. In Mozart 1.4, it seems the tempFail/permFail determination depends on an entity's annotation, according to http://mozart.github.io/mozart-v1/doc-1.4.0/dstutorial/node4.html#label243. Do we have that in Mozart 2.0?

-- Kenny.

Ruma Paul

unread,
Nov 6, 2013, 10:54:02 AM11/6/13
to mozart-...@googlegroups.com
Hi Kenny,

In the fault detector algorithm: if a site waits for response from the remote site, then it will be blocked if the remote site is already down(in that case it will never receive a response) without modifying the fault state or the detection time of tempfail will be much larger. 

I think for each site-to-site connection, a site should wait rtt_timeout and if no response is received by that time then set all the associate entity to tempfail, later when it receives a response from that remote node, it can calculate the rtt and modify rtt_timeout based on that.

Thanks,
Ruma
Reply all
Reply to author
Forward
0 new messages