How Erlang deals with failure

32 views
Skip to first unread message

Simon Peyton Jones

unread,
Feb 24, 2016, 5:19:56 PM2/24/16
to Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com, Francesco Cesarini, Simon Peyton Jones

Cloud Haskell friends,

 

Our colleague Francesco Cesarini (founder and CTO of Erlang Solutions) has been writing a book.  He writes (my emphasis):

 

Also, on an unrelated subject. I just recently finished my book for

O'Reilly on OTP and how to architect a resilient and scalable systems.

The conversation we had at FP Days around Cloud Haskell with Duncan

Coutts came to mind. What I was trying to explain is that if you have a

network in-between two nodes, you can lose messages. Or

acknowledgments  messages have been received. And the reason I was not

too bothered about this as a developer is that this can happen even when

you lose a machine, a node or a process (or the receiving node or

process is slow, triggering a timeout as a result).

 

In Erlang, you end

up handling all of these different errors in the same way, so it does

not matter what caused the issue. What I have tried doing in this book

is once and for all describe the programming model we use when

architecting for scalability and reliability. Our discussion and the

rationale is described in chapters 13-15 (and possibly some of 16):

https://www.dropbox.com/s/ibm4926rf73qrvc/DesigningForScalability160218.pdf?dl=0

 

Those were the hardest chapters in the book to write!

 

I thought that you would be interested – after all, Erlang is such an inspiration and there is such a wealth of experience in the Erlang community.  Francesco is certainly interested in feedback, so I’ve cc’d him.

 

Thanks Francesco!

 

Simon

Tim Watson

unread,
Feb 26, 2016, 5:45:41 AM2/26/16
to Simon Peyton Jones, Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com, Francesco Cesarini
This is very exciting stuff, thank you for sharing Simon! I would strongly recommend this as reading for anyone who is considering using Cloud Haskell in real life. A lot of the erlang capabilities Francesco talks about in that book already exist (in an early development phase) for Cloud Haskell - supervision trees, generic servers (distributed-process-client-server), and the like are available in the latest release. These are the underpinnings of Erlang's scalability model, which along with release management, provide the reusable infrastructure components which Erlang developers rely on day to day.

Cheers,
Tim

--
You received this message because you are subscribed to the Google Groups "Distributed Haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to distributed-has...@googlegroups.com.
To post to this group, send email to distribut...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/distributed-haskell/fef9f7b41dca43a78651676c83f9b4ad%40DB4PR30MB030.064d.mgd.msft.net.
For more options, visit https://groups.google.com/d/optout.

Tim Watson

unread,
Feb 26, 2016, 6:23:13 AM2/26/16
to Simon Peyton Jones, Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com, Francesco Cesarini
Oh and I almost forgot to mention, the distributed-process-execution and distributed-process-task libraries are there to address issues like backpressure and system limits management, though there's not much in them at the moment (execution has a gen_event lookalike, and messaging exchange primitives a-la messaging systems like RabbitMQ, plus a copy of the Erlang `pobox` library - https://github.com/haskell-distributed/distributed-process-execution/blob/master/src/Control/Distributed/Process/Execution/Mailbox.hs - and task has a simple async blocking queue and the beginnings of a resource pool implementation, modelled on various Erlang pool implementations).

If someone wants to help reproduce Ulf's amazing job scheduler as part of distributed-process-task, I would be delighted to see that become available for Cloud Haskell! See https://github.com/uwiger/jobs for the Erlang implementation...

Cheers,
Tim 

Tim Watson

unread,
Feb 27, 2016, 4:23:09 AM2/27/16
to Francesco Cesarini, Simon Peyton Jones, Facundo Domínguez, parallel-haskell, cloud-haskel...@googlegroups.com, distribut...@googlegroups.com
Hi Francesco! 

Im ccing the groups here... Im rather confused, because I was under the impression ch copied erlang's semantics closely and didn't use acks - maybe this is happening in the network transport layers and im not aware!?

Crtainly intra-node messaging requires no acks at all!! 

Thanks very much for the feedback anyway - it's most welcome and helpful! :)

Cheers,
Tim

On Friday, 26 February 2016, Francesco Cesarini <fran...@erlang-solutions.com> wrote:
Tim,

please forward this email to the relevant groups, as I am not allowed to post to them. Thx!

When speaking to Simon and Duncan, I recall reacting that they wanted guarantees that the message reached the remote server. You can not provide these guarantees. I also later heard that Cloud Haskell acknowledges messages sent across nodes. Once again, this is not secure, as the ack can be lost.

What I was trying to explain back then was that the only way to scale your system is through asynchronous message passing. And sending acknowledgments to messages is superfluous, as message loss should be handled in the business logic of your system. This does not result in extra complexity, as you are already handling this potential loss, as it could be caused by a process, a node or a host crashing. Or a network failure. Or a slow node which triggers a timeout. As the error propagation semantics is asynchronous, we use the same monitoring and recovery techniques locally in a node as we would across networks. This is all described in chapters 13-15.

Regards,
Francesco

Alexander V Vershilov

unread,
Feb 28, 2016, 5:54:44 PM2/28/16
to Tim Watson, cloud-haskel...@googlegroups.com, Duncan Coutts, parallel-haskell, fran...@erlang-solutions.com

Francesco, sorry, for some reason you were not added to CC, so resending this email.

On 27 Feb 2016 1:18 p.m., "Alexander V Vershilov" <alexander...@gmail.com> wrote:
>
> Hi, friends.
>
> Seems there a bit of confusion here. If I understand correctly the topic about messages guarantees was about following problem (CC'ing Duncan so he can correct me).
> The idea is that it's bad if *some* messages may be lost, this is not the same as failed node or host going down. This may happen if network experience temporary loss of connectivity, in that case erlang put messages into buffer and there if buffer gets overflow some messages are dropped (if I understand correctly). Once connectivity is back again messages are started to be delivered but some messages could be lost and programm have no way to control that. And it was a problem beign discussed long ago. C-H doesn't introduce any acknowledgement to avoid this problem, and its true that it will not help anyway. but it introduce additional rule: if connection between 2 nodes is dead than no more messages between processes that used failed connection will be delivered, unless explicit reconnect will be called. Explicit reconnect is a way for programmer to say that program is resilient to this kind of failure. We are not the only ones here, for example in slightly different field ("Using attestation to lift crash resilience to Byzantine resilience", by Herzog J et all, authors "hack" program (automatically rewrite it)  and have similar behaviour).
> This problem could rather be solved in a different way by using special protocols that are resilient to temporary network failures or persistent failures. (Personal opinion, may differ from official one) For example we in a Tweag I/O had such solution and semantics difference described above played bad with our approach, as calling reconnect explicitly introduce additional noise in code, as well as complexity. As a result C-H starting from 0.5 have a family of 'unreliable' methods that have similar semantics to one in erlang,  but they are not used by default.
> Sorry, I could comment only about this particular topic as I need to read chapters better in order to give some feedback.
>
> --
> Alexander

>> You received this message because you are subscribed to the Google Groups "cloud-haskell-developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to cloud-haskell-deve...@googlegroups.com.

Alexander V Vershilov

unread,
Feb 28, 2016, 5:54:44 PM2/28/16
to Tim Watson, cloud-haskel...@googlegroups.com, parallel-haskell, dun...@well-typed.com

Hi, friends.

Seems there a bit of confusion here. If I understand correctly the topic about messages guarantees was about following problem (CC'ing Duncan so he can correct me).
The idea is that it's bad if *some* messages may be lost, this is not the same as failed node or host going down. This may happen if network experience temporary loss of connectivity, in that case erlang put messages into buffer and there if buffer gets overflow some messages are dropped (if I understand correctly). Once connectivity is back again messages are started to be delivered but some messages could be lost and programm have no way to control that. And it was a problem beign discussed long ago. C-H doesn't introduce any acknowledgement to avoid this problem, and its true that it will not help anyway. but it introduce additional rule: if connection between 2 nodes is dead than no more messages between processes that used failed connection will be delivered, unless explicit reconnect will be called. Explicit reconnect is a way for programmer to say that program is resilient to this kind of failure. We are not the only ones here, for example in slightly different field ("Using attestation to lift crash resilience to Byzantine resilience", by Herzog J et all, authors "hack" program (automatically rewrite it)  and have similar behaviour).
This problem could rather be solved in a different way by using special protocols that are resilient to temporary network failures or persistent failures. (Personal opinion, may differ from official one) For example we in a Tweag I/O had such solution and semantics difference described above played bad with our approach, as calling reconnect explicitly introduce additional noise in code, as well as complexity. As a result C-H starting from 0.5 have a family of 'unreliable' methods that have similar semantics to one in erlang,  but they are not used by default.
Sorry, I could comment only about this particular topic as I need to read chapters better in order to give some feedback.

--
Alexander

On 27 Feb 2016 12:23 p.m., "Tim Watson" <watson....@gmail.com> wrote:
You received this message because you are subscribed to the Google Groups "cloud-haskell-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-haskell-deve...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages