How to degrade gracefully when a task fails?

1,306 views
Skip to first unread message

Myron Marston

unread,
Oct 31, 2015, 1:12:23 AM10/31/15
to elixir-l...@googlegroups.com

I’m working on a bit of code that offloads a DB query to a task. The result of the DB query isn’t absolutely necessary to provide a response but provides some extra metadata that the user can do without in a pinch. If the DB query takes too long or fails, I’d like to degrade gracefully and provide an incomplete response.

I’ve got it degrading nicely when the query takes too long by using Task.yield:

task = Task.async(fn -> run_query end)
# do other work
results = case Task.yield(task, timeout) do
  { :ok, results } -> results
  nil ->
    Task.shutdown(task)
    Logger.info "Task timed out..."
    []
end

I can’t figure out how to deal with errors (e.g. DB connection failures) in the query, though. I don’t want my main process to crash if there’s a query failure. Task.async/1 links the processes so they mutually crash. The Task docs suggest using Task.start_link/1 when you don’t want the processes linked, but I’m not sure how to get the result back when using Task.start_link/1 as it does not return a task.

Any suggestions for how to do this?

Thanks,
Myron

Saša Jurić

unread,
Oct 31, 2015, 4:48:51 AM10/31/15
to elixir-lang-talk
I've blogged a bit about similar scenarios here.

In your case, the easiest thing would probably be to explicitly catch the query error, something like:

Task.async(fn -> 
  try do
    {:ok, run_query} 
  catch type, error ->
    {:error, {type, error}}
  end
end)

There's also Task.Superivor.async_nolink in the Elixir master which could be used to achieve the same thing.

José Valim

unread,
Oct 31, 2015, 7:54:15 AM10/31/15
to elixir-l...@googlegroups.com
The idea behind async/await is that it won't change the semantics of your code. So the solution to make perform_some_query() not blow up without a task is the same solution when the task is involved, which is by rescuing the expected error.

Notice that async_nolink still won't work on this case, because yield/await and friends will still error if the monitored process is dead.



José Valim
Skype: jv.ptec
Founder and Director of R&D

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/011e2e4a-3864-4588-b995-4a5df748e8f1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Myron Marston

unread,
Oct 31, 2015, 3:23:16 PM10/31/15
to elixir-lang-talk, jose....@plataformatec.com.br
Thank you both! I had unsuccessfully tried rescuing the error but was wrapping the `Task.yield` call, and not the query itself within the task function.  That didn't occur to me for some reason.

Saša, in your code snippet you are using `catch` but I'd expect it to be `rescue` since I thought `rescue` was for errors and `catch` was for throws which, like ruby, are used for control flow in rare situations.  Did I learn the distinction between rescue/catch wrong or is there some other reason you used catch there?

Thanks again,
Myron

José Valim

unread,
Oct 31, 2015, 4:03:56 PM10/31/15
to Myron Marston, elixir-lang-talk
catch would also handle exits. I would use rescue though unless you really need to handle exits too.
--

Saša Jurić

unread,
Oct 31, 2015, 6:45:03 PM10/31/15
to elixir-l...@googlegroups.com
tl;dr: I would most often use catch

There are a couple of layers here, so I’ll go gradually. Things will become more involved as I progress, but most of the later stuff is usually not relevant, so keep in mind my first sentence :-)

First, if I understand correctly, you want to ignore task errors. So whatever goes wrong in the task, you want the “main” process (aka caller) to produce some output. I’ll assume this is the case, and if it’s not then the rest of the discussion doesn’t make much sense.

If you don’t want an error to propagate, then catch type, error is IMO the way to go, because it deals with anything that happens within the do block. Be it try, error, or an exit, catch will catch it. In contrast, rescue will deal only with errors, so throws and internally induced exits will bubble up. If that’s what you want, then fine, but I just wanted to clear it up.

For the record, I occasionally use catch all and most often, if not always, I also log an error, to make sure it doesn’t go unnoticed.

It’s worth mentioning that catch can’t handle something that is triggered from the outside. If someone else sends an exit signal to your task process, catch won’t deal with it, and such error will bubble up to the task caller. This can happen if your task process links to some other process, and that other process crashes (exits with non-normal reason), and the task process doesn’t trap exits. 

With a lot of hand waving, I’d say 9 times out of 10 (maybe even 99/100 or more) this won’t be the case.

However, if you wan’t to be super-paranoid, and say “I don’t want the ‘main’ process to crash, no matter what happens to the task process”, then there are two options:

1. Use async_nolink and catch in the caller on every yield/await
2. Per my blog, reinvent most of the async stuff

Option 1 is fairly simple, but it has two downsides:

1. async_nolink in the master but not in the latest release
2. If the caller crashes, the task will still keep running. 

If these properties are not a problem, then I’d use the option 1.


If you want asymmetrical error propagation where:

1. The caller doesn’t crash when the task crashes but
2. The task should crash if the caller crashes

then I’d go for wheel reinvention as described in the blog.

If at this point your head hurts, don’t feel bad. The second half of my discussion deals with situations that are not likely to happen. Assuming I correctly understand the desired behaviour, most of the time I’d go for plain catch type, error, as I mentioned in my original answer.


You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/rblNhLwYtuw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/0ce44aef-4159-4a61-ab07-64725663e695%40googlegroups.com.

José Valim

unread,
Oct 31, 2015, 7:03:43 PM10/31/15
to elixir-l...@googlegroups.com
tl;dr: explicitly list the cases you want to rescue/catch. Avoid catch-alls.

I have to strongly disagree with this one, Saša. :)

You generally want to explicitly list the cases you want to rescue. Otherwise, there is a very high chance you end-up catching errors that come from actual bugs that may have other consequences in the system. And if you forget to log, you may never find those out.

From the original message by Myron: "I’ve got it degrading nicely when the query takes too long". He knows exactly the case he wants to handle. Surely, there are some cases you want to use "catch" but they are by large the exception. Furthermore, 99% of the times I use the "catch kind, reason" syntax I end-up re-raising what I caught anyway.

Finally, once you catch the error, you change the exit status of the Task, which is particularly worrying when you spawn other tasks from the parent task. And catching all just increase the chances you end-up catching something you really should not, affecting involved processes.




José Valim
Skype: jv.ptec
Founder and Director of R&D

Myron Marston

unread,
Nov 1, 2015, 1:14:01 AM11/1/15
to elixir-l...@googlegroups.com

This is very enlightening discussion, so thank you both :).

He knows exactly the case he wants to handle.

I don’t know the exact exceptions I want to handle. I’d have to dig through ecto and mariex to see what all the possible exceptions are and that’s not a particularly appealing approach here. I don’t see how any throws or process exits could happen here so using catch type, error feels like overkill. I’m thinking of using the following:

task = Task.async fn ->
  try do
    run_query
  rescue
    ex ->
      log_exception(ex)
      []
  end
end

One part I’m unsure about is log_exception: I can format a message for Logger.error easy enough but exceptions raised in tasks are usually already logged with a well-formatted message including stacktrace, etc. Is there something provided in the stdlib that will format the exception for me so I don’t have to format it myself when logging it?

On a side note, I read through the getting started page on try, catch and rescue and the docs on try and I don’t see anything that mentions that catch can be used to handle exceptions. Perhaps the docs could be improved to mention this? I’d work up a PR myself if not for the fact that my understanding of this area of elixir is still very limited.

Thanks,
Myron


Saša Jurić

unread,
Nov 1, 2015, 4:03:11 AM11/1/15
to elixir-l...@googlegroups.com
On 01 Nov 2015, at 00:03, José Valim <jose....@plataformatec.com.br> wrote:

tl;dr: explicitly list the cases you want to rescue/catch. Avoid catch-alls.

I have to strongly disagree with this one, Saša. :)

You generally want to explicitly list the cases you want to rescue. Otherwise, there is a very high chance you end-up catching errors that come from actual bugs that may have other consequences in the system. And if you forget to log, you may never find those out.

From the original message by Myron: "I’ve got it degrading nicely when the query takes too long". He knows exactly the case he wants to handle. Surely, there are some cases you want to use "catch" but they are by large the exception. Furthermore, 99% of the times I use the "catch kind, reason" syntax I end-up re-raising what I caught anyway.

We’re not talking generally though. The original problem is stated as: If the DB query takes too long or fails, I’d like to degrade gracefully and provide an incomplete response.

How can a db query fail? I don’t know, and for that to know I need to look through all of the code of the task and any dependency involved, consider every possible input and detect possible bugs. It’s hard to get a reliable answer today, and it’s impossible to be certain that answer will always hold. Explicitly listing possible failures doesn’t really cut it here.

So what I’m suggesting is not a general pattern. It’s an easy (though not a proper) fix when I don’t want some crash to propagate. Most often I don’t do that, but there are exceptions (no pun intended).


Finally, once you catch the error, you change the exit status of the Task, which is particularly worrying when you spawn other tasks from the parent task. And catching all just increase the chances you end-up catching something you really should not, affecting involved processes.

You raise some good points here. However, the same issues hold when rescuing all or even some errors.

A proper solution would of course be to let the task crash without crashing the caller. It may require some more boilerplate, but it gives you strongest guarantees. More on that can be found in my blog post, especially in the section “Explicitly handling errors”.
Reply all
Reply to author
Forward
0 new messages