Cloud Haskell friends,
Our colleague Francesco Cesarini (founder and CTO of Erlang Solutions) has been writing a book. He writes (my emphasis):
Also, on an unrelated subject. I just recently finished my book for
O'Reilly on OTP and how to architect a resilient and scalable systems.
The conversation we had at FP Days around Cloud Haskell with Duncan
Coutts came to mind. What I was trying to explain is that if you have a
network in-between two nodes, you can lose messages. Or
acknowledgments messages have been received. And the reason I was not
too bothered about this as a developer is that this can happen even when
you lose a machine, a node or a process (or the receiving node or
process is slow, triggering a timeout as a result).
In Erlang, you end
up handling all of these different errors in the same way, so it does
not matter what caused the issue. What I have tried doing in this book
is once and for all describe the programming model we use when
architecting for scalability and reliability. Our discussion and the
rationale is described in chapters 13-15 (and possibly some of 16):
https://www.dropbox.com/s/ibm4926rf73qrvc/DesigningForScalability160218.pdf?dl=0
Those were the hardest chapters in the book to write!
I thought that you would be interested – after all, Erlang is such an inspiration and there is such a wealth of experience in the Erlang community. Francesco is certainly interested in feedback, so I’ve cc’d him.
Thanks Francesco!
Simon
--
You received this message because you are subscribed to the Google Groups "Distributed Haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to distributed-has...@googlegroups.com.
To post to this group, send email to distribut...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/distributed-haskell/fef9f7b41dca43a78651676c83f9b4ad%40DB4PR30MB030.064d.mgd.msft.net.
For more options, visit https://groups.google.com/d/optout.
Tim,
please forward this email to the relevant groups, as I am not allowed to post to them. Thx!
When speaking to Simon and Duncan, I recall reacting that they wanted guarantees that the message reached the remote server. You can not provide these guarantees. I also later heard that Cloud Haskell acknowledges messages sent across nodes. Once again, this is not secure, as the ack can be lost.
What I was trying to explain back then was that the only way to scale your system is through asynchronous message passing. And sending acknowledgments to messages is superfluous, as message loss should be handled in the business logic of your system. This does not result in extra complexity, as you are already handling this potential loss, as it could be caused by a process, a node or a host crashing. Or a network failure. Or a slow node which triggers a timeout. As the error propagation semantics is asynchronous, we use the same monitoring and recovery techniques locally in a node as we would across networks. This is all described in chapters 13-15.
Regards,
Francesco
Francesco, sorry, for some reason you were not added to CC, so resending this email.
On 27 Feb 2016 1:18 p.m., "Alexander V Vershilov" <alexander...@gmail.com> wrote:
>
> Hi, friends.
>
> Seems there a bit of confusion here. If I understand correctly the topic about messages guarantees was about following problem (CC'ing Duncan so he can correct me).
> The idea is that it's bad if *some* messages may be lost, this is not the same as failed node or host going down. This may happen if network experience temporary loss of connectivity, in that case erlang put messages into buffer and there if buffer gets overflow some messages are dropped (if I understand correctly). Once connectivity is back again messages are started to be delivered but some messages could be lost and programm have no way to control that. And it was a problem beign discussed long ago. C-H doesn't introduce any acknowledgement to avoid this problem, and its true that it will not help anyway. but it introduce additional rule: if connection between 2 nodes is dead than no more messages between processes that used failed connection will be delivered, unless explicit reconnect will be called. Explicit reconnect is a way for programmer to say that program is resilient to this kind of failure. We are not the only ones here, for example in slightly different field ("Using attestation to lift crash resilience to Byzantine resilience", by Herzog J et all, authors "hack" program (automatically rewrite it) and have similar behaviour).
> This problem could rather be solved in a different way by using special protocols that are resilient to temporary network failures or persistent failures. (Personal opinion, may differ from official one) For example we in a Tweag I/O had such solution and semantics difference described above played bad with our approach, as calling reconnect explicitly introduce additional noise in code, as well as complexity. As a result C-H starting from 0.5 have a family of 'unreliable' methods that have similar semantics to one in erlang, but they are not used by default.
> Sorry, I could comment only about this particular topic as I need to read chapters better in order to give some feedback.
>
> --
> Alexander
>> You received this message because you are subscribed to the Google Groups "cloud-haskell-developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to cloud-haskell-deve...@googlegroups.com.
Hi, friends.
Seems there a bit of confusion here. If I understand correctly the topic about messages guarantees was about following problem (CC'ing Duncan so he can correct me).
The idea is that it's bad if *some* messages may be lost, this is not the same as failed node or host going down. This may happen if network experience temporary loss of connectivity, in that case erlang put messages into buffer and there if buffer gets overflow some messages are dropped (if I understand correctly). Once connectivity is back again messages are started to be delivered but some messages could be lost and programm have no way to control that. And it was a problem beign discussed long ago. C-H doesn't introduce any acknowledgement to avoid this problem, and its true that it will not help anyway. but it introduce additional rule: if connection between 2 nodes is dead than no more messages between processes that used failed connection will be delivered, unless explicit reconnect will be called. Explicit reconnect is a way for programmer to say that program is resilient to this kind of failure. We are not the only ones here, for example in slightly different field ("Using attestation to lift crash resilience to Byzantine resilience", by Herzog J et all, authors "hack" program (automatically rewrite it) and have similar behaviour).
This problem could rather be solved in a different way by using special protocols that are resilient to temporary network failures or persistent failures. (Personal opinion, may differ from official one) For example we in a Tweag I/O had such solution and semantics difference described above played bad with our approach, as calling reconnect explicitly introduce additional noise in code, as well as complexity. As a result C-H starting from 0.5 have a family of 'unreliable' methods that have similar semantics to one in erlang, but they are not used by default.
Sorry, I could comment only about this particular topic as I need to read chapters better in order to give some feedback.
--
Alexander
You received this message because you are subscribed to the Google Groups "cloud-haskell-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-haskell-deve...@googlegroups.com.