If I set up a watch on a remote actor (one on a remote actor system) and the network between me and the remote system fails, I get a Terminated message almost immediately. In fact, the remote actor hasn't terminated, and I can still use the ActorRef to send messages to it once comms are restored. (However, if comms fail a second time I don't get a second Terminated message.)"Terminated" and "lost contact" are rather different states, and may need different handling. Does anyone know of a reliable way I can distinguish these?
--ThanksAlistair
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.
Hi Alistair,On Thursday, 16 January 2014 at 09:30, Alistair George wrote:If I set up a watch on a remote actor (one on a remote actor system) and the network between me and the remote system fails, I get a Terminated message almost immediately. In fact, the remote actor hasn't terminated, and I can still use the ActorRef to send messages to it once comms are restored. (However, if comms fail a second time I don't get a second Terminated message.)"Terminated" and "lost contact" are rather different states, and may need different handling. Does anyone know of a reliable way I can distinguish these?Which version of Akka are you using?
When you say that the network fails, what do you mean? How long is it “failed”?
What does the log say?
If I set up a watch on a remote actor (one on a remote actor system) and the network between me and the remote system fails, I get a Terminated message almost immediately. In fact, the remote actor hasn't terminated,
and I can still use the ActorRef to send messages to it once comms are restored. (However, if comms fail a second time I don't get a second Terminated message.)
"Terminated" and "lost contact" are rather different states, and may need different handling. Does anyone know of a reliable way I can distinguish these?
--ThanksAlistair
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.
Hi Akka,Thanks for the reply. One question: if (in 2.3) a remote actor system becomes permanently quarantined, what do I have to do to re-establish communication once the comms problem is fixed?
Do I have to restart the remote actor system? Or the local one? Or both?
"At some point you just have to move on..."
Hi Alistair,
On Tue, Jan 21, 2014 at 8:38 AM, Alistair George <alistai...@gmail.com> wrote:
Hi Akka,Thanks for the reply. One question: if (in 2.3) a remote actor system becomes permanently quarantined, what do I have to do to re-establish communication once the comms problem is fixed?First of all, quarantining is a state where it is not considered just a communications problem but the remote system is declared dead (it is declared, not proven since all we know that it does not reply). Short communication failures do not trigger quarantine (what is considered short is configurable).
Do I have to restart the remote actor system? Or the local one? Or both?From the remoting viewpoint it does not matter which one you restart. Obviously if one of the systems genuinely crashed then that is the one to be restarted, otherwise it is application specific.
We respectfully disagree then. Akka's design in this regard is highly deliberate.
I'm not sure this is desirable behaviour. I shouldn't have to restart a process just to recover from a comms failure. After all, nothing in the process has failed, and it may be providing services to other clients that have not suffered any comms failure. They shouldn't have to take the impact of a restart.
One of the strengths of Akka is that it doesn't pretend to do things that can't be done in a distributed context - this is essential for transparent distribution. One of this things you can't do distributed is give reliable, timely notification of a remote event, such as actor termination, and I don't think Akka should try.What I'd prefer is this:
- Reconnect attempts should continue indefinitely.
- The DeathWatch protocol should be extended to include (possibly multiple) Reachable/Unreachable events.
- Terminate should only be delivered when the remote actor system is reachable and asserts that the watched actor does not exist. This might never happen: an actor might stay in an unreachable state forever.
I realise I can emulate this by setting the timeout before quarantine to be effectively infinite, and adding my own facility to detect reachability and termination, but this isn't trivial. I'd prefer this behaviour to be available out of the box, for both practical and conceptual reasons.
My plan for this was to have a proxy for NodeB watch state in NodeA. In normal (connected) operation it just remembers the current watch states (Actor Ax is/isn't watching Actor By) and passes the messages on to NodeB. If disconnected it just remembers the watch state. On reconnect, it sends a snapshot of the state to NodeB.
Because it's remembering state rather than messages it stays bounded. Need to make sure that "isn't watching" states get pruned, but that's just the usual sequence number/ack stuff (if disconnected, we can prune them immediately, because they don't form part of the snapshot).
OK, so I think we're agreed that storage per remote node is bounded.
The next question is whether the number of remote nodes is bounded.
- In theory, yes: there are only a finite number of IP:port combinations
- In practice no: it's a very big number
- In my use case yes: I'm dealing with a fixed set of nodes
I can see that the possibility of new nodes continuously connecting and falling silent does require quarantine.I'd be inclined to make this explicit in the DeathWatch protocol, giving message types Reachable, Unreachable, Terminated and Quarantined. However, I appreciate that you don't agree.
My takeaway from this is that, for my application, I need to set akka.remote.watch-failure-detector to have an effectively infinite timeout and use application heartbeats to detect reachability.
OK, so I think we're agreed that storage per remote node is bounded. The next question is whether the number of remote nodes is bounded.I can see that the possibility of new nodes continuously connecting and falling silent does require quarantine.I'd be inclined to make this explicit in the DeathWatch protocol, giving message types Reachable, Unreachable, Terminated and Quarantined. However, I appreciate that you don't agree.
- In theory, yes: there are only a finite number of IP:port combinations
- In practice no: it's a very big number
- In my use case yes: I'm dealing with a fixed set of nodes
My takeaway from this is that, for my application, I need to set akka.remote.watch-failure-detector to have an effectively infinite timeout and use application heartbeats to detect reachability.
Hi Alistair,
On Wed, Jan 22, 2014 at 9:31 AM, Alistair George <alistai...@gmail.com> wrote:
OK, so I think we're agreed that storage per remote node is bounded. The next question is whether the number of remote nodes is bounded.I can see that the possibility of new nodes continuously connecting and falling silent does require quarantine.I'd be inclined to make this explicit in the DeathWatch protocol, giving message types Reachable, Unreachable, Terminated and Quarantined. However, I appreciate that you don't agree.
- In theory, yes: there are only a finite number of IP:port combinations
- In practice no: it's a very big number
- In my use case yes: I'm dealing with a fixed set of nodes
My takeaway from this is that, for my application, I need to set akka.remote.watch-failure-detector to have an effectively infinite timeout and use application heartbeats to detect reachability.Yes, a large watch-failure-detector acceptable-heartbeat-pause is most likely what you want. As for effective infinite, it depends. Do you consider a system healthy if it had not responded in 1 hour? 1 day? 1 week? When you find your psychological limit (well, most likely the limit that makes sense from the operations viewpoint), you should use that value :)
Hi Alistair,On Wed, Jan 22, 2014 at 9:31 AM, Alistair George <alistai...@gmail.com> wrote:
OK, so I think we're agreed that storage per remote node is bounded.No, as number of actors is not bounded. (You can go OOM from creating enough actors locally)
The next question is whether the number of remote nodes is bounded.
- In theory, yes: there are only a finite number of IP:port combinations
- In practice no: it's a very big number
- In my use case yes: I'm dealing with a fixed set of nodes
You're forgetting the most important factor: time.
I can see that the possibility of new nodes continuously connecting and falling silent does require quarantine.I'd be inclined to make this explicit in the DeathWatch protocol, giving message types Reachable, Unreachable, Terminated and Quarantined. However, I appreciate that you don't agree.What would be the value of providing these?
Hi Victor,
On Wednesday, January 22, 2014 9:13:26 AM UTC, √ wrote:Hi Alistair,On Wed, Jan 22, 2014 at 9:31 AM, Alistair George <alistai...@gmail.com> wrote:
OK, so I think we're agreed that storage per remote node is bounded.No, as number of actors is not bounded. (You can go OOM from creating enough actors locally)True, but I'm not sure how this is relevant to a discussion of the differential behaviour between connected and disconnected states.The next question is whether the number of remote nodes is bounded.
- In theory, yes: there are only a finite number of IP:port combinations
- In practice no: it's a very big number
- In my use case yes: I'm dealing with a fixed set of nodes
You're forgetting the most important factor: time.There must be some misunderstanding here. When I talk of a node I mean an actor system that exposes a particular endpoint (IP:port, say). If the process that is listening on that endpoint is stopped and restarted, this does not (in my terminology) create a new node. The resources that refer to that endpoint do not need to build up indefinitely over restarts.
Of course, it could be that resources are building up indefinitely in normal operation, but again, I don't see the relevance of that question to the handling of disconnection.
I can see that the possibility of new nodes continuously connecting and falling silent does require quarantine.I'd be inclined to make this explicit in the DeathWatch protocol, giving message types Reachable, Unreachable, Terminated and Quarantined. However, I appreciate that you don't agree.What would be the value of providing these?Reachability notifications (such as are commonly derived from heartbeats) are used to provide notifications to operators and users of potential failures, and to initiate fallback behaviour (failover, say).
I think even you agree that Terminated has its uses.
As for Quarantined, I have two concerns about the proposed implementation of quarantining in 2.3:
- There is (if I understand correctly) no way of recovering from a quarantined state, short of a process restart.
- There's no programmatic notification of it, so no way to initiate recovery automatically.
The Quarantined message addresses the second of those.
I think even you agree that Terminated has its uses.As for Quarantined, I have two concerns about the proposed implementation of quarantining in 2.3:
- There is (if I understand correctly) no way of recovering from a quarantined state, short of a process restart.
- There's no programmatic notification of it, so no way to initiate recovery automatically.