delayed child restart with incremental back-off

Nicolas Martyanoff

unread,

May 2, 2021, 3:01:10 PM5/2/21

to erlang-q...@erlang.org

Hi,

I originally posted this email on erlang-patches, but I just realized
most developers are on erlang-questions instead. I believe this could be
of interest.

Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
allowing to delay the re-creation of failed children in supervisors.

After a quick discussions, the official answer was that the OTP team
would discuss about it [2]. There is no further message on the mailing
list.

Was there an official response ?

I have various supervisors whose children handle network connections.
When something goes wrong with the connection, children die and are
immediately restarted. Most of the times, errors are transient (remote
server restarting, temporary network issue, etc.), but retrying without
any delay is pretty much guaranteed to fail again. And of course after a
few retries, the application dies which is unacceptable.

This kind of behaviour is a huge problem: it fills logs with multiple
copies of identical errors and causes a system failure.

In general, if I could, I would use restart delays with exponential
backoff everywhere because in practice, restarting immediately is almost
never the right approach: code errors do not disappear when restarting
so they are going to get triggered again immediately, and external errors
are not magically fixed by retrying without any delay.

Is there still interest for this patch ?

[1] https://erlang.org/pipermail/erlang-patches/2012-January/002575.html
[2] https://erlang.org/pipermail/erlang-patches/2012-January/002597.html

--
Nicolas Martyanoff
http://snowsyn.net
kha...@gmail.com

Loïc Hoguin

unread,

May 2, 2021, 3:27:40 PM5/2/21

to Nicolas Martyanoff, erlang-q...@erlang.org

I have not looked at the patch, but something like this would be good to
have. Then we could get rid of supervisor2 in RabbitMQ (
https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/supervisor2.erl#L15
for the delay part, non-backoff in our case ).

I was going to see if Maria/Jan had interest in providing a patch for
this as well, so I'm glad that there's others showing interest.

Cheers,

Loïc Hoguin
https://ninenines.eu

Tristan Sloughter

unread,

May 2, 2021, 8:11:10 PM5/2/21

to Erlang Questions

I still think supervisors are the wrong place for this and Fred's blog post about it from back then is still the best explanation https://ferd.ca/it-s-about-the-guarantees.html

Michael Truog

unread,

May 2, 2021, 9:16:15 PM5/2/21

to Tristan Sloughter, Erlang Questions

To put as many error checks in the initialization phase as possible, we
should be able to have connections established during process
initialization. That is best to keep the logic simple and reliable
(establishing requirements for the runtime as clear constraints). To
facilitate that use of Erlang processes it is advantageous to have
backoff in the supervisor source code with the understanding that it is
meant to be used for external failures (normally associated with network
connections, like a database not being up, that is determined to be
critical to the operation of the Erlang process).

The backoff would provide an increasing delay to the restart and the
Shutdown timeout value can remain constant (the termination time
wouldn't relate to external failures).

Nicolas Martyanoff

unread,

May 3, 2021, 1:13:47 AM5/3/21

to Tristan Sloughter, Erlang Questions

"Tristan Sloughter" <t...@crashfast.com> writes:

> I still think supervisors are the wrong place for this and Fred's blog
> post about it from back then is still the best explanation
> https://ferd.ca/it-s-about-the-guarantees.html

This articles focuses on initialization, while I am talking about the
entire lifetime of the process. A process crashing for any reason is
currently immediately restarted, which in a lot of cases causes tight
restart loops with multiple duplicate crash logs followed by the entire
program going down (which btw goes again everything I expect from a
server program). This proposition would give control on the process
restart mechanism to avoid this issue.

Ultimately the point is moot since the original proposition was for a
feature which is configurable, so developers would be free to ignore it
entirely.

Loïc Hoguin

unread,

May 3, 2021, 3:33:15 AM5/3/21

to Tristan Sloughter, Erlang Questions

I don't disagree with the article.

* Connect (or other) outside of init: yes
* Callers getting a response consistent with the state: yes

But not everything is managing a connection with callers.

I can see restart delays in the supervisor to be very useful in those cases:

* The process that is (re)started is just a worker. For example a
process that synchronizes data between two nodes (over the distribution,
or not; with/without handshake)

* The process uses a third party library that does an operation that may
crash and leave this process in a bad state (so it has to restart)

I can also see restart delays to be useful in the case where you just do
a file:open or similar, which can get you hitting a resource limit. Sure
you could do the backoff in your process, but doing a backoff in every
process that may get an emfile is a bit much.

The advantage of having this option in the supervisor is that you don't
have to implement the backoff everywhere, you can just implement it
where it provides value (such as HTTP/database connections).

Cheers,

Ingela Andin

unread,

May 3, 2021, 5:04:00 AM5/3/21

to Nicolas Martyanoff, Erlang-Questions Questions

Hi!

See answer below,

Den sön 2 maj 2021 kl 21:01 skrev Nicolas Martyanoff <kha...@gmail.com>:

Hi,

I originally posted this email on erlang-patches, but I just realized
most developers are on erlang-questions instead. I believe this could be
of interest.

Erlang-patches is legacy, we use GitHub instead, and yes erlang-questions is still a place for discussions.

Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
allowing to delay the re-creation of failed children in supervisors.

After a quick discussions, the official answer was that the OTP team
would discuss about it [2]. There is no further message on the mailing
list.

Was there an official response ?

Well, this was some time ago so I am unsure of how it was communicated. But the conclusion was that we did see merit in the idea but that we were not able to include something that would be backwards incompatible by default. To be able to change defaults we need to have a phasing out mechanism and period of testing what
problems it might cause legacy code. We also did not have an immediate own use case for this that could motivate it to be prioritized for us to put much of our own time into it, and hence it requires a bigger effort from the contributor to motivate and test and think through all scenarios. Alas, we do not have the luxury to persue all ideas that we think are good ones. One example of something that we had wanted to do for a long time, and actually finally got to do, is gen_statem. The recent contribution to supervisors
of significant children is an example of a successful Open Source contribution where we also happened to have an immediate use case.

Regards Ingela - Erlang OTP/Team - Ericsson AB

Nicolas Martyanoff

unread,

May 3, 2021, 5:04:39 AM5/3/21

to Ingela Andin, erlang-q...@erlang.org

Ingela Andin <ing...@andin.se> writes:

> Erlang-patches is legacy, we use GitHub instead, and yes erlang-questions
> is still a place for discussions.

Got it. It would make sense to send an email to people posting on
erlang-questions to inform them (instead of just telling them that the
mailing list is "moderated").

> Well, this was some time ago so I am unsure of how it was
> communicated. But the conclusion was that we did see merit in the idea
> but that we were not able to include something that would be backwards
> incompatible by default. To be able to change defaults we need to have
> a phasing out mechanism and period of testing what problems it might
> cause legacy code. We also did not have an immediate own use case for
> this that could motivate it to be prioritized for us to put much of
> our own time into it, and hence it requires a bigger effort from the
> contributor to motivate and test and think through all scenarios.

Thank you for explaining.

While I understand your point, I fear that this line of reasoning leads
to lots of developers having to skip various OTP components because they
simply cannot be patched. Backward compatibility is important; but
pushed to the extreme, it is tentamount to stagnation and death.

In this case, I am going to have to write a new supervisor module and
apparently I'm not the first one to do so. In addition of a new
gen_server so that I can get the right types and the infinite call
timeout by default, among other things.

The more I use Erlang, the more I realize I would love to have a
distribution containing only the language and related standard libraries
without most of OTP, because it simply does not match my needs and it is
almost impossible to change anything.

Maria Scott

unread,

May 3, 2021, 6:18:31 AM5/3/21

to Loïc Hoguin, Nicolas Martyanoff, erlang-q...@erlang.org

Hi

> I have not looked at the patch,

Neither have I =^^=

> but something like this would be good to
> have. Then we could get rid of supervisor2 in RabbitMQ (
> https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/supervisor2.erl#L15
> for the delay part, non-backoff in our case ).

I have only read the comment (4) explaining the delay behavior in supervisor2, and I guess it does things a bit different from what the OP seems to ask for. Specifically, it says that when a child exceeds the restart limit, another restart attempt will be delayed instead of the supervisor shutting down. What the OP asks for, if I understand correctly, is delays between restart attempts in general (right?)

> I was going to see if Maria/Jan had interest in providing a patch for
> this as well, so I'm glad that there's others showing interest.

Hm, not sure (yet). Since we're talking supervisor, another EEP will be required. This seems to be a somewhat controversial topic with a long history, and I think there are valid arguments for as well as against delays. As it is too late for OTP/24 now anyway (and I have no immediate use case for it myself), I would let the discussion run on for a while and see where it leads before attempting anything ;)

> > In general, if I could, I would use restart delays with exponential
> > backoff everywhere because in practice, restarting immediately is almost
> > never the right approach: code errors do not disappear when restarting

They won't disappear after a delay, either. Just saying ;)

Kind regards,
Maria

Nicolas Martyanoff

unread,

May 3, 2021, 7:04:05 AM5/3/21

to Maria Scott, erlang-q...@erlang.org

Maria Scott <maria-1...@hnc-agency.org> writes:

> I have only read the comment (4) explaining the delay behavior in supervisor2,
> and I guess it does things a bit different from what the OP seems to ask for.
> Specifically, it says that when a child exceeds the restart limit, another
> restart attempt will be delayed instead of the supervisor shutting down. What
> the OP asks for, if I understand correctly, is delays between restart attempts
> in general (right?)

The two points I believe are important are:

1. Adding an optional delay before restarting a child that died
(exponential backoff would be better, ferd has a nice library
demonstrating how to do it right). Because filling log files really does
not help (and this is not the worse consequence).

2. Adding an option to always restart, without any limit, because
aborting the entire program is a huge problem.

>> > In general, if I could, I would use restart delays with exponential
>> > backoff everywhere because in practice, restarting immediately is almost
>> > never the right approach: code errors do not disappear when restarting
>
> They won't disappear after a delay, either. Just saying ;)

No they will not. But the last thing I want is my entire server going
down because a specific context appears that results in recurrent errors
in a non-essential component.

Tristan Sloughter

unread,

May 3, 2021, 9:53:26 AM5/3/21

to Nicolas Martyanoff, Erlang Questions

> This articles focuses on initialization, while I am talking about the
> entire lifetime of the process. A process crashing for any reason is
> currently immediately restarted, which in a lot of cases causes tight
> restart loops

If it is crashing immediately after restart then it sounds like initialization to me? And I don't think it is moot since it changes the concept of supervisors and where/how they are to be used, even if optional. Now that I think about it... gen_statem is usually where I implement this over and over these days, and it already has so many options :), maybe it could be made to more easily support the implementation of certain failures resulting in a state change with a backoff, or some such thing.

As for "infinite retry", I actually don't see the use of this when crashes propagate up to the program itself which is then restarted, which could be done an infinite number of times.

Loïc Hoguin

unread,

May 3, 2021, 10:51:06 AM5/3/21

to Maria Scott, erlang-q...@erlang.org

On 03/05/2021 12:18, Maria Scott wrote:
>> but something like this would be good to
>> have. Then we could get rid of supervisor2 in RabbitMQ (
>> https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/supervisor2.erl#L15
>> for the delay part, non-backoff in our case ).
>
> I have only read the comment (4) explaining the delay behavior in supervisor2, and I guess it does things a bit different from what the OP seems to ask for. Specifically, it says that when a child exceeds the restart limit, another restart attempt will be delayed instead of the supervisor shutting down. What the OP asks for, if I understand correctly, is delays between restart attempts in general (right?)

I don't think the details of the implementation matter too much. A
backoff would work just as well, if not better, than what supervisor2 does.

Loïc Hoguin

unread,

May 3, 2021, 10:58:47 AM5/3/21

to Tristan Sloughter, Nicolas Martyanoff, Erlang Questions

On 03/05/2021 15:52, Tristan Sloughter wrote:
>> This articles focuses on initialization, while I am talking about the
>> entire lifetime of the process. A process crashing for any reason is
>> currently immediately restarted, which in a lot of cases causes tight
>> restart loops
>
> If it is crashing immediately after restart then it sounds like initialization to me? And I don't think it is moot since it changes the concept of supervisors and where/how they are to be used, even if optional. Now that I think about it... gen_statem is usually where I implement this over and over these days, and it already has so many options :), maybe it could be made to more easily support the implementation of certain failures resulting in a state change with a backoff, or some such thing.

Yes gen_statem has been very good at making this easier at least. But
the problem is of course to have to reimplement this over and over.
Perhaps something can be built on top of gen_statem or as part of
gen_statem that would make it easier. But it might be difficult to make
it generic enough considering you often want to handle other events
while the backoff is in progress.

It might be easier to just make supervisor do it.

The other option I have been thinking about is a module on top of
supervisor that would force supervisor to delay the restarts (instead of
supervisor starting the process immediately, it tells itself to start it
and the module on top of supervisor can delay the real start, or
something). That would likely be a better first step.

Cheers,

Tristan Sloughter

unread,

May 3, 2021, 1:04:15 PM5/3/21

to Loïc Hoguin, Nicolas Martyanoff, Erlang Questions

> But it might be difficult to make
> it generic enough considering you often want to handle other events
> while the backoff is in progress.

Right, that could be done with the end of backoff being a timeout message. Handle other events in whatever state you want until you receive the backoff complete timeout. But how would you do this if it was handled by the supervisor? Difficulty being generic enough is another reason to not attempt to move this into the supervisor.

I don't know that this would actually fit well in gen_statem, I think an attempt at a generic version that is a module called from the user's statem would be a good way to find out.

Loïc Hoguin

unread,

May 3, 2021, 2:32:42 PM5/3/21

to Tristan Sloughter, Nicolas Martyanoff, Erlang Questions

On 03/05/2021 19:03, Tristan Sloughter wrote:
>> But it might be difficult to make
>> it generic enough considering you often want to handle other events
>> while the backoff is in progress.
>
> Right, that could be done with the end of backoff being a timeout message. Handle other events in whatever state you want until you receive the backoff complete timeout. But how would you do this if it was handled by the supervisor? Difficulty being generic enough is another reason to not attempt to move this into the supervisor.

Yes they're two separate solutions.

Right now my immediate concern on this topic is that there's a duplicate
supervisor module in RabbitMQ and it would be good to get rid of it. The
exact solution to achieve that is not super important as long as there
isn't a duplicate anymore and it's simple enough to maintain.

Then there's the hypothetical best solution which may be via gen_statem
(but that sounds difficult to achieve) or via supervisor (easier, but
less potential use cases covered).

> I don't know that this would actually fit well in gen_statem, I think an attempt at a generic version that is a module called from the user's statem would be a good way to find out.

Yeah. I don't know if much can be done there. In Gun for example there
are three potential "init" states: 'domain_lookup', 'connecting' and
'tls_handshake'. Then a 'not_connected' state that does a state_timeout
and keeps track of how many retries it does. Perhaps this
'not_connected' state could be abstracted somewhat. But it's not much.

zxq9

unread,

May 3, 2021, 8:20:44 PM5/3/21

to erlang-q...@erlang.org

You don't have to implement your own supervisor to get this kind of
behavior, simply move connection out of initialization. As a general
rule initialization should never be dependent on anything outside your
node's control -- especially not something across the network.

It is less complicated to either:
1. Write a service manager: A connection manager process whose job it is
to know what connections have failed and how long ago and implements
*exactly* the kind of backoff you want by having the workers start up
disconnected and have a connect/0 call.
2. Write smarter workers: The connections processes themselves written
to handle the case where the connection is lost and implement reconnect
backoff themselves.

Which way you choose to do this is up to you. Neither is very complicated.

Losing an external resource is not a *fault* in your program, but rather
an expected case that you know about and are discussing right now.

Putting this into supervisors is overloading and specializing
supervisors to handle a state management task that belongs either in a
sub-service manager process or inside the state of the workers
themselves. I tend to opt for the "write a service manager" approach
when we have a simple_one_for_one type supervisor structure (typical of
the case where we have multiple incoming connections, the
service->worker pattern), and the "write smarter workers" approach when
we have a predetermined number of connections of various types (often
meaning named workers connecting to specific external resources like one
connection each to a DB, an upstream feed, and a presence service, all
of which have totally different code internally).

"Write smarter children" sometimes becomes "write a backoff connection
behavior" so that the details of backoff can be implemented just once,
but if it is three or fewer modules... meh.

-Craig

Nicolas Martyanoff

unread,

May 4, 2021, 2:15:59 AM5/4/21

to zxq9, erlang-q...@erlang.org

zxq9 <zx...@zxq9.com> writes:

> You don't have to implement your own supervisor to get this kind of behavior,
> simply move connection out of initialization. As a general rule initialization
> should never be dependent on anything outside your node's control --
> especially not something across the network.

I do not know why there is such a focus on initialization. Errors can
occurs during the entire lifecycle of a process; it is common to end up
in a situation where a worker will fail *after* initialization, and this
failure will repeat due to external consequences or to a coding mistake.
In that situation, initialization tricks will not help you: the process
will crash N times in a row, filling the logs with duplicate error
messages, then the entire program will die. This is not acceptable for a
server.

Writing smart children works fine of course, I have written this kind of
logic for panic recovery in Go dozens of times. Until you realize you
are re-writing the exact same logic everywhere. If only we had an
abstraction designed to handle this kind of restart; some kind of
process which would, you know, supervise others and restart them in a
configurable way.

Michael Truog

unread,

May 4, 2021, 3:12:19 AM5/4/21

to Nicolas Martyanoff, zxq9, erlang-q...@erlang.org

On 5/3/21 11:15 PM, Nicolas Martyanoff wrote:
> zxq9 <zx...@zxq9.com> writes:
>
>> You don't have to implement your own supervisor to get this kind of behavior,
>> simply move connection out of initialization. As a general rule initialization
>> should never be dependent on anything outside your node's control --
>> especially not something across the network.
> I do not know why there is such a focus on initialization. Errors can
> occurs during the entire lifecycle of a process; it is common to end up
> in a situation where a worker will fail *after* initialization, and this
> failure will repeat due to external consequences or to a coding mistake.
> In that situation, initialization tricks will not help you: the process
> will crash N times in a row, filling the logs with duplicate error
> messages, then the entire program will die. This is not acceptable for a
> server.
>

The reason is due to initialization being a short period of time that
can have a timeout value to limit the execution (and being the
precondition for all later execution). It is better to have something
fail during initialization when compared to 5 days later. If a failure
after x days is difficult to replicate, you still don't want to wait
that length of time to test. That is why it is best to validate
everything during initialization to ensure the undefined runtime length
after initialization is valid. Otherwise you are just wasting
development time when bugs occur.

Best Regards,
Michael

Max Lapshin

unread,

May 5, 2021, 2:18:31 PM5/5/21

to Michael Truog, Erlang-Questions Questions

> As a general rule initialization should never be dependent

I do not understand, why do we should treat this as a sacred scripture.

Supervisor is just a 1500 lines of code which is not too much. We all
suffer from the same issue: need to connect to database and if it
dies, connect to another and do it in a predictable way without
writing the same and the same code from project to project.

Perhaps it is possible to find some rather generic form of code that
will allow to add proper and convenient logic to supervisor restarts
that will make programming easier, more predictable and more unified.

Richard Carlsson

unread,

May 10, 2021, 9:05:08 AM5/10/21

to Nicolas Martyanoff, Erlang Questions

What happened at the time was that I met up with the OTP team and discussed it, and they eventually agreed that this was a good thing. However, it needed more work to be accepted (and I realized a couple of weaknesses in the implementation that I needed to address), but I never found time to do more work on it.

/Richard

Den sön 2 maj 2021 kl 21:01 skrev Nicolas Martyanoff <kha...@gmail.com>:

Sean Hinde

unread,

May 10, 2021, 9:27:33 AM5/10/21

to Richard Carlsson, Erlang Questions

This would still be incredibly useful.

Not having it throws a whole bunch of complexity every time a database or other type of potentially failing connection must be opened. Even more so if this process must start before other processes should start.

The current self() ! finish_startup recommendation drops all the startup ordering and failure recovery synchronisation on the programmer, and that is hard to get right.

gen_server:init is such a natural place to put this synchronous opening logic. The supervisor can then take care of the synchronous start of the connection and managing dependent processes.

/Sean

Fred Hebert

unread,

May 10, 2021, 1:57:10 PM5/10/21

to Sean Hinde, Erlang Questions

On Mon, May 10, 2021 at 9:27 AM Sean Hinde <sean....@mac.com> wrote:

The current self() ! finish_startup recommendation drops all the startup ordering and failure recovery synchronisation on the programmer, and that is hard to get right.

These aren't current anymore. There is now a 'continue' return option with a 'handle_continue' handler just for that: http://erlang.org/doc/man/gen_server.html#Module:handle_continue-2 -- do note that gen_statem doesn't require it because the whole FSM mechanism has even more powerful general mechanisms for that kind of thing.

Reply all

Reply to author

Forward