On 13 May 2013, at 02:33, Facundo Domínguez wrote:
>> establishing "that all messages send on the old connection have been received" is theoretically impossible without the presence of infinite storage.
>
> One way to address this is to have the process in the receiver side
> complain if an earlier connection from the same sender is still open.
> The sender would need to issue a reconnect call in such a case, and
> the receiver would be told to ignore the old connection on the next
> connection attempt.
That's exactly what we're suggesting. There's no work to do here - this is already how both the network-transport and the distributed-process layers do things today. What I was driving at is that without that explicit call to 'reconnect', there is no way to do this without turning our lightweight asynchronous message passing semantics into a heavyweight, synchronous protocol requiring some degree of consensus.
Note that even in the case where the sender *does* explicitly call 'reconnect' - there is *still* no guarantee that the receiver will be able to release the connection in a timely fashion, because in the presence of communication failure, *ALL* methods for handling lost peers are highly timing dependant. For the simplest case, viz a lost TCP connection, take a look at `man tcp` and consider the function of tcp_retries1 (defaults to 3 on linux) and tcp_retries2, plus the configured RTO (and various other parameters that interact with this behaviour). This can mean delays of between 10 and 30 minutes before the OS networking layer "notices" that a peer socket is gone and returns ETIMEDOUT to the application layer. That's why I was talking about keep-alives, since they guarantee *some* timeliness in detecting that peers have disconnected.
We might handle this by writing an empty byte string on open connections every `n' seconds, thus stimulating a network-transport failure if the connection is down. This is *STILL* subject to the OS configuration however, since for TCP the various retry settings combined with the RTO are what determines when ETIMEDOUT is returned as I mentioned previously. One can use TCP keep-alives, but these are of course transport specific. I don't know how other transports (such as CCI) deal with this, but we need a mechanism that works for everyone. The AMQP protocol uses "active keep-alives", in other words the two peers agree on a timeout value and both transmit *and* receive 'heartbeats', considering the peer to be down if it does not "ping" without the configured delay. This increases load on the network, but guarantees we're not dependent on OS limits, because we handle the timeout ourselves without waiting for ETIMEDOUT. Of course, this means that if the network is saturated, we can loose connectivity even though the connection is still there, so there's a price to pay either way.
> If no earlier connection is open, the receiver can
> accept the new connection.
Again, there's nothing to do here, because when the sender calls 'reconnect' the underlying connection is closed and the local state pertaining to that connection is discarded. So all this business about 'ignore the old connection' is irrelevant - there is no more connection after you call reconnect/disconnect, network interruptions not withstanding.
Now in the absence of a call to reconnect/disconnect, if some sender P wants to connect to Q, it sends its own endpoint address to such that Q can reuse the connection. So a close from either end should do the trick. In context of your original question - about resource handling, viz leaking connections - the point you made, which I think has merit, is that it would be *nice* if we could detect that the *receiver* has gone away early, such that the connection (which we know is now useless) can be torn down early.
To simplify the common use case where the sender uses 'reconnect', we might add a primitive that sends a single message and then tears down the connection immediately. This guarantees that the connection is released quickly, and there is no further promise we can make about ordering between the two processes thereafter. We could leave implementing this to application developers (since it is ludicrously simple), but putting it in distributed-process allows us to document the semantics properly...
-- | Sends a message to @pid@. NB: <long dialogue about semantics>
sendOnce :: Serializable a => ProcessId -> a -> Process ()
sendOnce them msg = send them msg >> liftIO $ reconnect them
Perhaps that's worth putting into the base library, since it's pretty useful for inter-process communications that don't rely on the ordering of subsequent deliveries, and documenting the behaviour will probably reduce confusion for new users. I can also add a "performance + resource management" wiki page and an additional simple tutorial on the subject, prior to the next release. Does that sound useful?
Cheers,
Tim