How Erlang deals with failure

Simon Peyton Jones

unread,

Feb 24, 2016, 5:19:56 PM2/24/16

to Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com, Francesco Cesarini, Simon Peyton Jones

Cloud Haskell friends,

Our colleague Francesco Cesarini (founder and CTO of Erlang Solutions) has been writing a book. He writes (my emphasis):

Also, on an unrelated subject. I just recently finished my book for

O'Reilly on OTP and how to architect a resilient and scalable systems.

The conversation we had at FP Days around Cloud Haskell with Duncan

Coutts came to mind. What I was trying to explain is that if you have a

network in-between two nodes, you can lose messages. Or

acknowledgments messages have been received. And the reason I was not

too bothered about this as a developer is that this can happen even when

you lose a machine, a node or a process (or the receiving node or

process is slow, triggering a timeout as a result).

In Erlang, you end

up handling all of these different errors in the same way, so it does

not matter what caused the issue. What I have tried doing in this book

is once and for all describe the programming model we use when

architecting for scalability and reliability. Our discussion and the

rationale is described in chapters 13-15 (and possibly some of 16):

https://www.dropbox.com/s/ibm4926rf73qrvc/DesigningForScalability160218.pdf?dl=0

Those were the hardest chapters in the book to write!

I thought that you would be interested – after all, Erlang is such an inspiration and there is such a wealth of experience in the Erlang community. Francesco is certainly interested in feedback, so I’ve cc’d him.

Thanks Francesco!

Simon

Tim Watson

unread,

Feb 26, 2016, 5:45:41 AM2/26/16

to Simon Peyton Jones, Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com, Francesco Cesarini

This is very exciting stuff, thank you for sharing Simon! I would strongly recommend this as reading for anyone who is considering using Cloud Haskell in real life. A lot of the erlang capabilities Francesco talks about in that book already exist (in an early development phase) for Cloud Haskell - supervision trees, generic servers (distributed-process-client-server), and the like are available in the latest release. These are the underpinnings of Erlang's scalability model, which along with release management, provide the reusable infrastructure components which Erlang developers rely on day to day.

Cheers,

Tim

--
You received this message because you are subscribed to the Google Groups "Distributed Haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to distributed-has...@googlegroups.com.
To post to this group, send email to distribut...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/distributed-haskell/fef9f7b41dca43a78651676c83f9b4ad%40DB4PR30MB030.064d.mgd.msft.net.
For more options, visit https://groups.google.com/d/optout.

Tim Watson

unread,

Feb 26, 2016, 6:23:13 AM2/26/16

to Simon Peyton Jones, Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com, Francesco Cesarini

Oh and I almost forgot to mention, the distributed-process-execution and distributed-process-task libraries are there to address issues like backpressure and system limits management, though there's not much in them at the moment (execution has a gen_event lookalike, and messaging exchange primitives a-la messaging systems like RabbitMQ, plus a copy of the Erlang `pobox` library - https://github.com/haskell-distributed/distributed-process-execution/blob/master/src/Control/Distributed/Process/Execution/Mailbox.hs - and task has a simple async blocking queue and the beginnings of a resource pool implementation, modelled on various Erlang pool implementations).

If someone wants to help reproduce Ulf's amazing job scheduler as part of distributed-process-task, I would be delighted to see that become available for Cloud Haskell! See https://github.com/uwiger/jobs for the Erlang implementation...

Cheers,

Tim

Tim Watson

unread,

Feb 27, 2016, 4:23:09 AM2/27/16

to Francesco Cesarini, Simon Peyton Jones, Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com

Hi Francesco!

Im ccing the groups here... Im rather confused, because I was under the impression ch copied erlang's semantics closely and didn't use acks - maybe this is happening in the network transport layers and im not aware!?

Crtainly intra-node messaging requires no acks at all!!

Thanks very much for the feedback anyway - it's most welcome and helpful! :)

Cheers,

Tim

On Friday, 26 February 2016, Francesco Cesarini <fran...@erlang-solutions.com> wrote:

Tim,

please forward this email to the relevant groups, as I am not allowed to post to them. Thx!

When speaking to Simon and Duncan, I recall reacting that they wanted guarantees that the message reached the remote server. You can not provide these guarantees. I also later heard that Cloud Haskell acknowledges messages sent across nodes. Once again, this is not secure, as the ack can be lost.

What I was trying to explain back then was that the only way to scale your system is through asynchronous message passing. And sending acknowledgments to messages is superfluous, as message loss should be handled in the business logic of your system. This does not result in extra complexity, as you are already handling this potential loss, as it could be caused by a process, a node or a host crashing. Or a network failure. Or a slow node which triggers a timeout. As the error propagation semantics is asynchronous, we use the same monitoring and recovery techniques locally in a node as we would across networks. This is all described in chapters 13-15.

Regards,
Francesco

Reply all

Reply to author

Forward