Process supervision/monitoring

151 views
Skip to first unread message

Jason Harrelson

unread,
Jun 26, 2015, 7:28:49 PM6/26/15
to elixir-l...@googlegroups.com
I understand that a process supervisor can restart processes if the crash, etc.  However, I am interested in the opposite.  Is there an OTP method/pattern for monitoring a processes uptime and if it does not finish and exit on its own within a given timeout, the supervisor/monitor will kill it?

Jason M Barnes

unread,
Jun 27, 2015, 9:20:17 AM6/27/15
to elixir-l...@googlegroups.com
Take a look at Tasks (http://elixir-lang.org/docs/stable/elixir/Task.html).  You might be able to use them to do what you’re looking for.  You can use Task.await to set a timeout for the task process.  You can then use Process.exit(task_pid, :kill) to stop the process if the timeout occurs.

There might be a better pattern that the community knows, but that is the first one that comes to mind.

Jason

On Fri, Jun 26, 2015 at 7:28 PM, Jason Harrelson <cjhar...@gmail.com> wrote:
I understand that a process supervisor can restart processes if the crash, etc.  However, I am interested in the opposite.  Is there an OTP method/pattern for monitoring a processes uptime and if it does not finish and exit on its own within a given timeout, the supervisor/monitor will kill it?

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/11ae650c-055b-4504-aa9b-3d802b53216d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris McGrath

unread,
Jun 27, 2015, 10:59:10 AM6/27/15
to elixir-l...@googlegroups.com, Christopher McGrath
Not quite what you were asking for but GenServer.call takes a timeout as the last parameter. If the handle_call doesn’t return within the timeout the process is killed. I was playing around with this in the context of your question and made a little scratch app to show one way it might be done.


It requires erlang 18.0 due to how it seeds the rng. It uses a simple_one_for_one supervisor to start job workers and allows you to pass the timeout through. There’s an perform_job_async api method as well that uses Task.async to simulate using cast rather than call as cast doesn’t have a timeout.

I threw this together pretty quickly as a sketch and I hope it gives you some ideas or makes someone else chime in with a simpler method!

iex -S mix and TimeMeOut.perform_job("test”) or TimeMeOut.perform_job_async("test”) will let you see what happens.

Cheers,

Chris

On 27 Jun 2015, at 00:28, Jason Harrelson <cjhar...@gmail.com> wrote:

I understand that a process supervisor can restart processes if the crash, etc.  However, I am interested in the opposite.  Is there an OTP method/pattern for monitoring a processes uptime and if it does not finish and exit on its own within a given timeout, the supervisor/monitor will kill it?

Jason Harrelson

unread,
Jun 27, 2015, 11:10:37 AM6/27/15
to elixir-l...@googlegroups.com
Gentlemen, thanks for the replies.  Both of your suggestions are very good and have shed some light on my understanding.  However, I will go a little further into the details of an implementation I am working on  so that you have a little better context.

I have a Phoenix web request that will be spawning a process to contact an asynchronous API using 0mq.  This process (or group of several processes if necessary) will listen for a web timeout and messages from a 0mq pub/sub socket.  When the web timeout is reached, this process will message to the web request process with the results it has already gathered and the web process will return a response and cease to exist.  However, the async subscribing process will stick around and continue to listen for more messages until a second timeout is reached.  If more messages arrive, it will write this data to the system's DB.  

The main concern I have is if the 2nd timeout fails to occur for some reason and this process sticks around forever?  I am trying to find an OTP'ish way to monitor this process not for crashes, but for staying alive too long, as you can imagine the system ramifications if this happened repeatedly.

Of additional concern is that the obvious process to monitor this process is the web request.  However, the fact that the web request should finish well before this process by design kills it as a candidate to monitor this process.  This is why I am looking at supervision for the answer, but have yet to find a supervision pattern that is concerned with process length of lofe and not simply restarting crashed processes.  

Please bear in mind that I have been an Elixir user for a little over 2 weeks now and may have an incorrect understanding of how processes work.  Thus my fears may be illogical.  Thanks for using your time to help me!

Chris McGrath

unread,
Jun 28, 2015, 9:15:32 AM6/28/15
to elixir-l...@googlegroups.com, Christopher McGrath
So I’ve been thinking about how you would implement something similar in general terms, where you’d reply to the client within a certain time budget but still perform some work after that. Also perhaps sending “late” data to the browser via websockets or something like that.

I sketched out one possible solution in my scratch repo: https://github.com/chrismcg/time_me_out

I’ve added a handle_request api method that will return to the client (e.g. phoenix process) within ~200ms and wait till ~1000ms have passed before tearing down the workers.

The code starts a simple_one_for_one supervisor like before, and this starts a WebRequest GenServer to handle each request. The handle_call function first queues up some timeout messages, then starts a fake ZeromqWorker process and stores it’s pid in the state. The key thing here is returning {:noreply, state} from the handle_call. This allows the WebRequest GenServer to go back into it’s message receive loop but keeps the client still waiting for a reply. The ZeromqWorker GenServer process is linked to the WebRequest worker so both will die if one of them dies thanks to OTP. The WebRequest then uses some handle_info callbacks to deal with data from the ZeromqWorker and the timeout messages it queued up for itself. When the first timeout message is received it uses GenServer.reply to send what data it has back to the caller and when the second is received it tells OTP to terminate it and clean itself up.

You can see it working by running something like TimeMeOut.handle_request(%{foo: :bar}) in iex -S mix

It was fun to play with this, I couldn’t tell you if it’s the best way of doing things but I hope it will give you some ideas.

I did have to set Process.flag(:trap_exit, true) on the ZeromqWorker to get it to terminate properly and I’m not sure why yet. If anyone could point me to why I’d really appreciate it!

HTH,

Chris

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.

Booker Bense

unread,
Jun 28, 2015, 12:31:04 PM6/28/15
to elixir-l...@googlegroups.com


On Saturday, June 27, 2015 at 8:10:37 AM UTC-7, Jason Harrelson wrote:
Gentlemen, thanks for the replies.  Both of your suggestions are very good and have shed some light on my understanding.  However, I will go a little further into the details of an implementation I am working on  so that you have a little better context.

I have a Phoenix web request that will be spawning a process to contact an asynchronous API using 0mq.  This process (or group of several processes if necessary) will listen for a web timeout and messages from a 0mq pub/sub socket.  When the web timeout is reached, this process will message to the web request process with the results it has already gathered and the web process will return a response and cease to exist.  However, the async subscribing process will stick around and continue to listen for more messages until a second timeout is reached.  If more messages arrive, it will write this data to the system's DB.  

The main concern I have is if the 2nd timeout fails to occur for some reason and this process sticks around forever?  I am trying to find an OTP'ish way to monitor this process not for crashes, but for staying alive too long, as you can imagine the system ramifications if this happened repeatedly.


FWIW, I find it very helpful not to call them processes. Call them Eprocs or something else. You are applying intuitions from expensive heavyweight unix processes that simply don't apply to BEAM processes. 
If the process is simply waiting on a message, it's going to take almost no resources until that message shows up.  You don't want to leak processes if you don't have to, but you can leak processes for much
longer than in the traditional unix fork/exec module. 

I found this blog post very helpful in starting to think about the "let it crash" methodology for BEAM processes. 


To me it sounds like you are attempting to engineer for flaws in the BEAM VM. Why do you think the process won't 
get it's 2nd timeout? 

The BEAM VM does a lot to isolate the runtime from the vagaries of unix processes. There are special drivers for 
all the I/O functions that engineer around the kind of I/O hangs that you'd have to worry about with normal unix 
processes. 

I'm not saying things can't go wrong, just that they will go wrong in completely different ways that you are used
to in dealing with either unix processes or threads. BEAM processes are a very different beast and until you've 
got actual running code and experience in how that code fails, it's very hard to know in advance what the failure
modes will be. The more you can educate yourself on the Erlang scheduler the better. This blog post covers the 
basic philosophy. 


- Booker C. Bense 

Saša Jurić

unread,
Jun 28, 2015, 2:21:29 PM6/28/15
to elixir-l...@googlegroups.com


On Saturday, June 27, 2015 at 5:10:37 PM UTC+2, Jason Harrelson wrote:
The main concern I have is if the 2nd timeout fails to occur for some reason and this process sticks around forever?  I am trying to find an OTP'ish way to monitor this process not for crashes, but for staying alive too long, as you can imagine the system ramifications if this happened repeatedly.

Two options not mentioned here (at least I think so, didn't look through all the links provided).

1) If your process is a GenServer, you can use timeout feature to be informed when a process has been idle longer than some time. Note: this is not the same timeout Chris McGrath mentioned. This timeout operates inside the server process, and allows the process to do something when it has been idling for "too long".

Essentially in every handle_* callback, you can include one more value in the return tuple. For example, you can return {:reply, response, new_state, timeout_ms} from handle_call. If this value is an integer, it presents the timeout in milliseconds. If the process gets no message in the given time, the handle_info will be called with :timeout as the message value. You can handle this message and for example stop the server process with the :normal exit reason. Keep in mind that the timeout is cleared on every callback (handle_*), so you probably want to return the timeout from every callback function.

2) Another option is to send yourself a message in future (after some time). You can use :timer.send_after or :erlang.send_after. You could queue such message for example from init/1 callback of the GenServer. When/if the corresponding message arrives, you can again handle it by stopping the server.

If you're afraid you'll idle for too long and thus waste resources, then I'd probably suggest option number 1.

Jason Harrelson

unread,
Jun 29, 2015, 1:29:27 PM6/29/15
to elixir-l...@googlegroups.com
Chris,

Thanks so much for taking this much time to help me, as it is exactly what I needed.  Your design is spot on.  In addition, you have covered several concepts for GenServers that I did not know were possible( ie. the :noreply in the call to effectively block the calling process and responding later in the handle_info).  This makes it so that you do not have to enter a receive block in the Phoenix request, which greatly simplifies the design.  I feel I have finally reached a critical mass of my understanding that will allow to move forward with confidence.  Thanks again!
Reply all
Reply to author
Forward
0 new messages