Cloud Haskell friends,
Our colleague Francesco Cesarini (founder and CTO of Erlang Solutions) has been writing a book. He writes (my emphasis):
Also, on an unrelated subject. I just recently finished my book for
O'Reilly on OTP and how to architect a resilient and scalable systems.
The conversation we had at FP Days around Cloud Haskell with Duncan
Coutts came to mind. What I was trying to explain is that if you have a
network in-between two nodes, you can lose messages. Or
acknowledgments messages have been received. And the reason I was not
too bothered about this as a developer is that this can happen even when
you lose a machine, a node or a process (or the receiving node or
process is slow, triggering a timeout as a result).
In Erlang, you end
up handling all of these different errors in the same way, so it does
not matter what caused the issue. What I have tried doing in this book
is once and for all describe the programming model we use when
architecting for scalability and reliability. Our discussion and the
rationale is described in chapters 13-15 (and possibly some of 16):
https://www.dropbox.com/s/ibm4926rf73qrvc/DesigningForScalability160218.pdf?dl=0
Those were the hardest chapters in the book to write!
I thought that you would be interested – after all, Erlang is such an inspiration and there is such a wealth of experience in the Erlang community. Francesco is certainly interested in feedback, so I’ve cc’d him.
Thanks Francesco!
Simon
--
You received this message because you are subscribed to the Google Groups "Distributed Haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to distributed-has...@googlegroups.com.
To post to this group, send email to distribut...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/distributed-haskell/fef9f7b41dca43a78651676c83f9b4ad%40DB4PR30MB030.064d.mgd.msft.net.
For more options, visit https://groups.google.com/d/optout.
Tim,
please forward this email to the relevant groups, as I am not allowed to post to them. Thx!
When speaking to Simon and Duncan, I recall reacting that they wanted guarantees that the message reached the remote server. You can not provide these guarantees. I also later heard that Cloud Haskell acknowledges messages sent across nodes. Once again, this is not secure, as the ack can be lost.
What I was trying to explain back then was that the only way to scale your system is through asynchronous message passing. And sending acknowledgments to messages is superfluous, as message loss should be handled in the business logic of your system. This does not result in extra complexity, as you are already handling this potential loss, as it could be caused by a process, a node or a host crashing. Or a network failure. Or a slow node which triggers a timeout. As the error propagation semantics is asynchronous, we use the same monitoring and recovery techniques locally in a node as we would across networks. This is all described in chapters 13-15.
Regards,
Francesco