Dealing with runtime panics

792 views
Skip to first unread message

Felix Lange

unread,
Jun 9, 2014, 9:38:21 AM6/9/14
to golan...@googlegroups.com
Hi,

I'm using Go in our startup project for about 7 months now.
The majority of my day job programming so far has been in Erlang.

Comparing the two environments, the one thing I keep thinking about
is the handling of runtime panics. In Erlang, processes are first class
objects that can monitor each other and behave appropriately when
a monitored/linked process crashes (commit suicide, restart the crashed process,
log something, etc.). This facility is at the core of Erlang's bespoke suitability
for systems that 'cannot fail'.

In Go, since goroutines are invisible (except in stack dumps), no such thing
exists. This is a design decision, the apparent reason being that code should
not assume which goroutine it is running in. This decision result in several kinds
of failures that can happen (and do happen in my code, at least during development).

  • Goroutine leaks are easy to produce and hard to find. Given that the
    only way to actually 'see' groutines is panicking, a recommended debugging
    technique is to put panic("show me the stacks") somewhere into the program
    during development.

  • A panicking goroutine can bring down the whole process, unless panics are
    recovered. However, recover is local to a specific goroutine and thus cannot
    prevent crashes in all cases. This too is by design.
    Following go tip, I have seen a discussion where the runtime will raise
    unrecoverable panics in some situations.

I have three questions about this that I cannot answer myself:

  1. Is it true that panics not supposed to happen, ever, in a production program?

    I know that panic/recover are used by the json decoder to make its internal
    error handling simpler. I'm not talking about those uses.

  2. Is it worthwhile to wrap certain goroutines with recover simply because
    I suspect that the code that they run might be subtly wrong? net/http does this ;)
    Since that code might create new goroutines that do not recover, I feel like
    I'm risking a crash all the time.

  3. How do others deal with runtime panics in production? I watched Peter Bourgons
    talk at GopherCon where he spoke about production issues, but panics were
    not mentioned at all!

    I know that it is possible to make the runtime raise SIGABRT on panic with
    GOTRACEBACK=crash, but that doesn't help very much given that gdb still
    doesn't work very well for debugging go programs, especially from a core dump
    (it's seems to be impossible to display the stack of a specific goroutine from a core dump,
    at least for non-wizards).
Please help me find reasonable answers to those questions.

egon

unread,
Jun 9, 2014, 10:18:13 AM6/9/14
to golan...@googlegroups.com


On Monday, 9 June 2014 16:38:21 UTC+3, fjl wrote:
Hi,

I'm using Go in our startup project for about 7 months now.
The majority of my day job programming so far has been in Erlang.

Comparing the two environments, the one thing I keep thinking about
is the handling of runtime panics. In Erlang, processes are first class
objects that can monitor each other and behave appropriately when
a monitored/linked process crashes (commit suicide, restart the crashed process,
log something, etc.). This facility is at the core of Erlang's bespoke suitability
for systems that 'cannot fail'.

In Go, since goroutines are invisible (except in stack dumps), no such thing
exists. This is a design decision, the apparent reason being that code should
not assume which goroutine it is running in. This decision result in several kinds
of failures that can happen (and do happen in my code, at least during development).

  • Goroutine leaks are easy to produce and hard to find. Given that the
    only way to actually 'see' groutines is panicking, a recommended debugging
    technique is to put panic("show me the stacks") somewhere into the program
    during development.

  • A panicking goroutine can bring down the whole process, unless panics are
    recovered. However, recover is local to a specific goroutine and thus cannot
    prevent crashes in all cases. This too is by design.
    Following go tip, I have seen a discussion where the runtime will raise
    unrecoverable panics in some situations.

There's nothing to do about that... imagine if your computer runs out of RAM? Or some other program writes to another programs memory (because of some bug in OS)?

I have three questions about this that I cannot answer myself:

  1. Is it true that panics not supposed to happen, ever, in a production program?

    I know that panic/recover are used by the json decoder to make its internal
    error handling simpler. I'm not talking about those uses.
Yes, panic should mean that something has gone very, very wrong.
  1. Is it worthwhile to wrap certain goroutines with recover simply because
    I suspect that the code that they run might be subtly wrong? net/http does this ;)
    Since that code might create new goroutines that do not recover, I feel like
    I'm risking a crash all the time.
Yes, if you can handle/reason about the panics at that level, you can do that.
  1. How do others deal with runtime panics in production? I watched Peter Bourgons
    talk at GopherCon where he spoke about production issues, but panics were
    not mentioned at all!

The only way to properly protect your program is to handle the program/server crash rather than separate panics. Basically crash the server and bring it up again. (Have some sort of monitoring program that restarts it, if the program has crashed.)
  1. I know that it is possible to make the runtime raise SIGABRT on panic with
    GOTRACEBACK=crash, but that doesn't help very much given that gdb still
    doesn't work very well for debugging go programs, especially from a core dump
    (it's seems to be impossible to display the stack of a specific goroutine from a core dump,
    at least for non-wizards).
Please help me find reasonable answers to those questions.


Programs and computers crash, there's nothing you can do about it... just have multiple computers and try to restart the systems quickly.

+ egon

akwi...@gmail.com

unread,
Jun 9, 2014, 11:45:47 AM6/9/14
to golan...@googlegroups.com
This blog may help:

Sometimes panics are recoverable, even some serious ones. You just have to pick your battles but inevitability some will be fatal.

Peter Bourgon

unread,
Jun 9, 2014, 3:42:02 PM6/9/14
to Felix Lange, golang-nuts
> Is it true that panics not supposed to happen, ever, in a production
> program?

In the general case, yes, that's correct.

> Is it worthwhile to wrap certain goroutines with recover simply because
> I suspect that the code that they run might be subtly wrong?

In the general case, no. Use of recover would immediately elicit
raised eyebrows and pointed questions in a code review.

> How do others deal with runtime panics in production? I watched Peter
> Bourgons
> talk at GopherCon where he spoke about production issues, but panics were
> not mentioned at all!

In the general case, production code should never use panic to
indicate anything other than unrecoverable error.

And a +1 to egon's statement:

> Programs and computers crash, there's nothing you can
> do about it... just have multiple computers and try to
> restart the systems quickly.

Hope this helps!
Peter.

Thomas Bushnell, BSG

unread,
Jun 9, 2014, 4:08:51 PM6/9/14
to Felix Lange, golang-nuts
On Mon, Jun 9, 2014 at 6:38 AM, Felix Lange <f...@twurst.com> wrote:
  • Goroutine leaks are easy to produce and hard to find. Given that the
    only way to actually 'see' groutines is panicking, a recommended debugging
    technique is to put panic("show me the stacks") somewhere into the program
    during development.
Goroutines are also visible via the runtime.Stack function. It is common to have a debugging interface to your program (maybe make it accessible across HTTP for convenience) which runs things like that so that you can inspect a running program. This is much better than inserting a panic.
  • A panicking goroutine can bring down the whole process, unless panics are
    recovered. However, recover is local to a specific goroutine and thus cannot
    prevent crashes in all cases. This too is by design.
    Following go tip, I have seen a discussion where the runtime will raise
    unrecoverable panics in some situations.
Computers can crash. Entire buildings can suddenly and unpredictably lose power. You need to write code which deals with that possibility. Having done so, it turns out that the kind of recovery you're talking about is relatively unimportant. 

I have three questions about this that I cannot answer myself:

  1. Is it true that panics not supposed to happen, ever, in a production program?

Yes. 

Thomas

Felix Lange

unread,
Jun 9, 2014, 4:26:00 PM6/9/14
to Peter Bourgon, golang-nuts

How do others deal with runtime panics in production? I watched Peter
Bourgons
talk at GopherCon where he spoke about production issues, but panics were
not mentioned at all!

In the general case, production code should never use panic to
indicate anything other than unrecoverable error.

You're misreading the question, That's kind of my fault because
I didn't phrase it well enough.

Let's face it: panics do happen, also in production code. In Go, any panic
(be it from a call the built in function panic, or e.g. a nil pointer deref) will
kill the operating system process. What I'm asking about would be tips/experiences
around handling that case. I could be asking this question on a sysadmin mailing
list, it's not specific to Go. I'm asking on golang-nuts because Go does some things
that e.g. C programs don't do:

  • It prints a giant dump of all goroutines to stderr
    • It produces a core dump that doesn't help me most of the time.

How do you handle that in production? This is mostly about tooling around
Go programs...

With Erlang, when a lightweight process (= goroutine) crashes, a detailed crash
report is sent to the VM-wide error logger. The report includes the stacktrace, the
process state (for OTP servers) and some memory statistics. The error logger
can be extended to send this somewhere else.

net/http does this too. In Go 1.3, it even includes a way to set the log output
accept error messages and handler panic dumps.

Felix Lange

unread,
Jun 9, 2014, 4:39:51 PM6/9/14
to Thomas Bushnell, BSG, golang-nuts
On 9 Jun 2014, at 22:08, Thomas Bushnell, BSG wrote:
> On Mon, Jun 9, 2014 at 6:38 AM, Felix Lange <f...@twurst.com> wrote:
>> - Goroutine leaks are easy to produce and hard to find. Given that
>> the only way to actually 'see' groutines is panicking, a recommended
>> debugging technique is to put panic("show me the stacks") somewhere
>> into the program during development.
>
> Goroutines are also visible via the runtime.Stack function. It is a
> common to have a debugging interface to your program (maybe make it
> accessible across HTTP for convenience) which runs things like that
> so that you can inspect a running program. This is much better than
> inserting a panic.

Good tip. I'll use that more often.

>> - A panicking goroutine can bring down the whole process, unless
>> panics are recovered. However, recover is local to a specific
>> goroutine and thus cannot prevent crashes in all cases. This too
>> is by design. Following go tip, I have seen a discussion where
>> the runtime will raise unrecoverable panics in some situations
>> <https://codereview.appspot.com/97620043#msg5>.
>
> Computers can crash. Entire buildings can suddenly and unpredictably
> lose power. You need to write code which deals with that possibility.
> Having done so, it turns out that the kind of recovery you're talking
> about is relatively unimportant.

So what you're saying is: if my process can handle being killed
at any time, I don't have to worry so much about handling crashes internally.

That does answer my second question.

Peter Bourgon

unread,
Jun 9, 2014, 4:40:49 PM6/9/14
to Felix Lange, golang-nuts
> Let's face it: panics do happen, also in production code. In Go, any panic
> (be it from a call the built in function panic, or e.g. a nil pointer deref)
> will kill the operating system process. What I'm asking about would be
> tips/experiences around handling that case.

In the general case, the correct way to handle that case is to allow
the operating system process to be killed.

Your processes should probably be monitored[0] and supervised[1], so
that crashes trigger alerts and and restarts, respectively.

[0] e.g. http://pagerduty.com
[1] e.g. http://smarden.org/runit

There are some special cases in the Go stdlib which do provide
recover-wrapping, like (as you point out) the net/http server. In
general it's not appropriate to extend those exceptional cases to a
general idiom, like "all goroutines should be wrapped in
recover-blocks". Your specific use-case may warrant further
consideration, but without knowing more, it's hard to say.

> With Erlang, when a lightweight process (= goroutine) crashes, a detailed
> crash report is sent to the VM-wide error logger. The report includes the
> stacktrace, the process state (for OTP servers) and some memory statistics.

A goroutine shouldn't be considered equivalent to an Erlang/OTP
lightweight process.

Cheers,
Peter.

Felix Lange

unread,
Jun 9, 2014, 4:53:08 PM6/9/14
to Peter Bourgon, golang-nuts
On 9 Jun 2014, at 22:40, Peter Bourgon wrote:
> There are some special cases in the Go stdlib which do provide
> recover-wrapping, like (as you point out) the net/http server. In
> general it's not appropriate to extend those exceptional cases to a
> general idiom, like "all goroutines should be wrapped in
> recover-blocks". Your specific use-case may warrant further
> consideration, but without knowing more, it's hard to say.

K.

> A goroutine shouldn't be considered equivalent to an Erlang/OTP
> lightweight process.

Diversion: why not?

>> It prints a giant dump of all goroutines to stderr
>> It produces a core dump that doesn't help me most of the time.

How does your Heroku-like thing handle this?

Thank you for taking the time to answer my questions. I will
buy you a beer at the next Berlin Go meetup (if you're there).

Jesse McNelis

unread,
Jun 9, 2014, 6:07:36 PM6/9/14
to Felix Lange, Peter Bourgon, golang-nuts
On Tue, Jun 10, 2014 at 6:52 AM, Felix Lange <f...@twurst.com> wrote:
>> A goroutine shouldn't be considered equivalent to an Erlang/OTP
>> lightweight process.
>
> Diversion: why not?

Erlang's processes are isolated from each other. They can't share
memory and communication is asynchronous. They can be killed and
recover without leaving the program in an undefined state.
The price for this is a lot of copying and a more complicated
communication model.

Goroutines have synchronous communication, share memory and hold
locks. If you recover from a panic that you didn't expect then you
don't know the state of the program any more. The goroutine could have
been holding a lock at the time of the panic, or other goroutines
might be waiting to receive from a channel it was going to send on. So
if you don't trust code not the panic then you also can't trust it to
handle that panic properly.

akwi...@gmail.com

unread,
Jun 9, 2014, 9:35:30 PM6/9/14
to golan...@googlegroups.com, f...@twurst.com, pe...@bourgon.org
> How do others deal with runtime panics in production? I watched Peter
> Bourgons
> talk at GopherCon where he spoke about production issues, but panics were
> not mentioned at all!

In the general case, production code should never use panic to
indicate anything other than unrecoverable error.
 
In that case, gut json from 1.3 before the rollout:) In all seriousness, panics do have use cases.

Jesse McNelis

unread,
Jun 9, 2014, 10:28:00 PM6/9/14
to akwi...@gmail.com, golang-nuts, f...@twurst.com, Peter Bourgon
A panic you receive from code you call in to should be considered an
unrecoverable error.
If you're doing both the panic and the recover then you know it's a
recoverable error and it doesn't matter.

If you don't know the state of the program at the point of the panic
then you shouldn't be recovering.

andrewc...@gmail.com

unread,
Jun 9, 2014, 11:59:48 PM6/9/14
to golan...@googlegroups.com

My general thoughts on this after reading these posts:

  • If you cant guarantee that a crashed goroutines didn't hold a lock or exclusive access to a resource, then the whole program must crash since you cant recover.
  • If each goroutine is independent like a server handling requests often is, then perhaps a top level recover is acceptable.
  • goroutines branching from independent requests must follow the same recursively.

egon

unread,
Jun 10, 2014, 2:46:38 AM6/10/14
to golan...@googlegroups.com, andrewc...@gmail.com
... oh I forgot; also read 
"Why Do Computers Stop and What Can Be Done About It?" by Jim Gray http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

+ egon

akwi...@gmail.com

unread,
Jun 12, 2014, 4:07:36 PM6/12/14
to golan...@googlegroups.com, akwi...@gmail.com, f...@twurst.com, pe...@bourgon.org, jes...@jessta.id.au
What is your point when responding to my post? That panics do have use cases? That was already stated - twice.

Felix Lange

unread,
Jun 16, 2014, 5:06:41 PM6/16/14
to Peter Bourgon, golang-nuts
>> It prints a giant dump of all goroutines to stderr
>>
>> How do you handle that in production? This is mostly about tooling around
>> Go programs...

I've found https://github.com/mitchellh/panicwrap to solve that problem.

Reply all
Reply to author
Forward
0 new messages