Hi all,
Stewart and I had another discussion on how to implement the heartbeat and fault stream, and these are what we have got:
1. Every entity has a fault stream. A distinct fault stream is associated to an entity in every site.
2. There are 3 parties that can set the fail state to an entity:
- The site-specific "fault detector" can send `ok`, `tempFail` or `permFail` to an entity, indicating the network state of the home site the entity originates from;
- The GC can end the stream with `nil` to indicate the entity has been garbage-collected;
- User code can send `localFail` or `permFail` via `{Kill _} and `{Break _}`.
3. When we want to send a message via an Entity, we check the fault stream first, using this algorithm:
for break:Break State in {GetFaultStream Entity} do
case State
of ok then
% Do send the message
{DoSendMessage}
{Break}
[] tempFail then
% Wait until we get back to the `ok` state
skip
else
% localFail, permFail: Drop the message and wait forever (drop the thread)
{Wait}
end
end
4. The "fault detector" is installed for every site-to-site connection. It will also send hearbeat messages to determine if we should put all associated entities to `tempFail` state. We will use this algorithm:
while still connected:
record start time
send a heartbeat message to remote
wait until reply is received
record end time
rtt = end time - start time
if rtt > rtt_timeout:
set all associated entity to tempFail
else:
set all associated entity to ok
recompute rtt_timeout
# when connection is lost, e.g. due to Peer is Disconnect socket error,
set all associated entity to permFail
(rtt_timeout algorithm is same as
https://github.com/mozart/mozart/blob/a5a2f805955ca6156b9c0db3e3b421f8cddb5600/platform/emulator/libdp/glue_site.cc#L173)
So the questions are:
A. Are all of these sound?
B. In Mozart 1.4, it seems the tempFail/permFail determination depends on an entity's annotation, according to
http://mozart.github.io/mozart-v1/doc-1.4.0/dstutorial/node4.html#label243. Do we have that in Mozart 2.0?
-- Kenny.