Error handling: Rust, Google C++, Java... and Rune?

10 views
Skip to first unread message

Bill Cox

unread,
May 1, 2023, 11:30:02 AM5/1/23
to Rune language discussion
TL;DR I think Python gets this right.

Rust: Error shaming
I'm reading a lot of Rust code right now.  A surprising fraction is devoted to reporting errors, and it impacts the overall efficiency of the Cloud Hypervisor project.  If error handling in Rust weren't complex enough, there's the anyhow library to help throw a hierarchy of errors, where each module has the chance to wrap the error in a higher level of context.  At the same time, error handling is far from perfect.  For example, in the api_client library, all errors are handled except response too short, which results in a panic.  Every other problem parsing a response returns an error.  The funny thing is that panic generally results in a more useful error message, since it always includes a stack trace.

There are multiple problems here:
  • Error handling is in the coder's face everywhere, requiring more work than I've seen in any other language.
  • It slows down execution even when there are no errors
  • No stack trace is printed unless 1) the application calls panic, or 2) some rare custom work is done to manually call a library to print a stack trace (I've not seen an example of this so far).
We all want logs to be useful, especially in a cloud context where we can't simply fire up gdb and break on the error.  Rust's approach leaves responsibility for this to the programmer, resulting in what I'm calling "Error shaming".  The result is far too much effort put into error handling, and the error messages still leave a lot to be desired.

Google C++
Google does not use catch/throw in C++ for our services, which I think is a good thing.  Instead, we use Status<T> and StatusOr<T>, similar to Rust's Result<T, E>, and try to return an error all the way to the Stubby RPC handler.  This is similar to Rust, but I feel like there is a significant difference.

We use the absl:::StatusCode, with only 16 different error types, which were selected after years of RPC error handling development.  In 99% of the cases, custom errors are not used, and I would argue that in 70% of the cases where custom errors are used, it is simply random Googler kicking tires on the custom error system.  With StatusOr, we return an error message as a string, not an arbitrary type.  In Rust, we return a type E, and almost always define the Display trait for the error type so it can simply be converted to a string.  I've not seen a counter-example so far.

Google's scheme still allows an error message to be wrapped: simply return a new message that includes the old.  I see this all the time in log messages when there are errors handling an RPC.

Google C++ error handling still leaves a lot to be desired:
  • It still impacts the speed of execution even when errors do not occur
  • Ugly macros litter our code, like ASSIGN_OR_RETURN, which Rust builds in as the ? perator.
However, it's not too bad.  IMO, Rust can be saved from error shaming, by simply printing stack traces from the lowest level error, returning one of the 16 standard codes and a string, and almost never defining custom Error types.

Java
Java's error handling is pretty good but a lot of work for the programmer.  Having to declare all the error types that can be thrown by a function requires merging all the error types from every function you call.  In short, it is a PITA.  That said, errors reported in Java typically include good log messages, often stack traces.

Pros:
  • Works well
  • Doesn't lead to error shaming where every error thrown requires a custom type and custom Display traits
  • Does not slow down execution
Cons:
  • Too much work for the coder.
  • Bloats executable size a bit
Python
I do not have Google Python readability, and my views on Python are from the open source projects I've worked on.  Python does not require declaration of error types, and for the most part, folks simply throw strings, and don't write custom __repr__  methods for their custom error classes.  I would say a healthy culture exists here: we all hate Python functions that throw exotic errors that we can't simply catch and print.  It goes against the spirit of Python to require that the user understand complex details of dependencies like that.

Rune
Runt's error handling is TBD.  We have the throw keyword, but so far, no catch.  Instead throw always panics and currently doesn't even print a stack trace.  I would like for error handling in Rune to:
  • Be lightweight in terms of how much code users need to write
  • Be lightweight in terms of impact on runtime
  • Automatically include stack trace info, even when catching errors
  • By default, throw a standard absl::StatusCode if users call throw with a status code
  • Only allow an error message, not an arbitrary class.  I never felt constrained when using Google C++ to return StatusOr.
  • Allow for alternative status codes for those rare cases where it is appropriate, e.g. returning an HTTP response status code.
  • Automatically catch errors in RPC handlers so errors are propagated to the caller.
Implementing catch can be done a number of ways.  Unwinding the Rune stack requires only:
  • freeing of temporary dynamic arrays
  • decrementing reference counts of ref counted objects
Freeing of temporary arrays can be almost free.  Instead of allocating them on the stack, we can have a separate stack for just temp arrays.  try statements would save the temp array stack position on the stack, and when catching an error, free all the ones that were allocated after the try statement was executed.  I think this would have zero, or maybe even a positive impact on runtime speed.

Decrementing ref counts is trickier.  The most efficient way in terms of runtime is to define "landing pads" for stack unwinding in every function that references ref-counted objects.  This won't work for a C/C++ backend code generator, it bloats the binary for the LLVM IR backend, and the extra complexity is not clearly worth it, because ref counted objects being referenced in Rune should be rare.

An alternative to landing pads is a stack of object references and their unref functions.  This would be compatible with a C/C++ backend where we use setjmp and longjmp to implement catch and throw.  Good Rune data models should have almost no ref counted classes.

Bill

Aiden Hall

unread,
May 3, 2023, 8:55:28 AM5/3/23
to Bill Cox, Rune language discussion
I'd add go to this list as well, I've been writing a fair amount of it recently. Go also supports panic but most errors are returned and checked directly - the most common thing to see in go is this:

x, err := foo()
if err != nil {
  return nil, err
}
[...]

Seeing this all over makes the code much harder to read, IMO, because it's effectively boilerplate code injected into every function. Using the C++ StatusOr stuff is much more readable and does essentially the same thing. It does make tests somewhat easier to write in my experience, perhaps because there's no special exception catching harnesses in the testing frameworks which also makes the tests more uniform.

In general I prefer using stack traces because that's the first thing the programmer will have to do when they debug anyways, so doing that work for them seems valuable to the developer experience. I definitely prefer Python's approach personally since I mostly just want the error message and type, which is similar to the Google C++ way with the caveat that Python errors are much easier to code than the Google C++ way. I'm not familiar with the details of how error handling would be implemented, but I'd certainly consider this an area to continue borrowing from the Python philosophy from a syntax/programmer experience perspective.

--
You received this message because you are subscribed to the Google Groups "Rune language discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rune-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rune-discuss/CAH9QtQHcuVQmH6od3HbM6OJZgNvmV7Yb2zm0uD7zpeiT4_nn6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Wilson

unread,
May 3, 2023, 11:44:11 AM5/3/23
to Aiden Hall, Bill Cox, Rune language discussion
My turn!

Python exceptions are just fine, IMHO.  While they may be implemented in a bunch of ways, ultimately they're just syntactic sugar for returning StatusOr everywhere. 

I like Herb Sutter's "Zero-overhead deterministic exceptions" proposal for C++: https://wg21.link/p0709r4  It boils down to an invisible 'StatusOr' return type.





For more options, visit https://groups.google.com/d/optout.


--
Andrew Wilson
Software Engineer, Android TV Eng Prod

Bill Cox

unread,
May 3, 2023, 4:05:17 PM5/3/23
to Andrew Wilson, Aiden Hall, Rune language discussion
Wow, Herb Sutter's paper shows just how wonky error handling has become.  I agree with both of you: Python error handling is pretty good.

What should we do for out-of-memory situations?  Even printing in Rune uses dynamic arrays, OOM may significantly impact performance and executable size if we handle it like other exceptions.  Also, it is not clear if handlers can succeed when OOM.

Aiden Hall

unread,
May 3, 2023, 6:52:45 PM5/3/23
to Bill Cox, Andrew Wilson, Rune language discussion
If the process is already hopeless could we just free some and use that? It's not pretty but I bet that would work, assuming we could identify memory that could be safely cannibalized. I'm not sure this is a good idea, but it is a good strawman.

Andrew Wilson

unread,
May 4, 2023, 12:19:42 PM5/4/23
to Aiden Hall, Bill Cox, Rune language discussion
In my previous jobs, I usually reserved a small amount of memory (a page, I think) solely for handing OOM error messaging. 

Bill Cox

unread,
May 4, 2023, 6:09:49 PM5/4/23
to Andrew Wilson, Aiden Hall, Rune language discussion
Either of those solutions work.  I just sent out an initial CL for mostly broken try/catch support.  It only allows errors to be strings, and it does not unwind the stack , causing memory leaks.  Still, it should work OK for the io module.

I went pretty deep into the error handling rabbit hole, and I'm not happy with what I found.  For example, a simple C program with setjmp/longjmp resulted in 67 lines of LLVM IR code, while the equivalent C++ with try/catch was over 250 lines.  LLVM's landing pad stuff is very heavy weight, and I see why Google avoids it.

Maybe we could look a exception handling from a generic "what should happen" point of view:
  • Panic: Do what we can to cleanly shutdown/restart.  Try/catch statements should be avoided, and the stack should not be unwound.
  • OOM: Could panic, but maybe let the user override this?  E.g. a Borg job might notify the Borglet and wait for more memory to become available.
  • A bug was detected: Should act like panic, but maybe let the user catch this panic?  E.g. an RPC exposed a bug, but this might be rarely sensitized.  Do we want to try and avoid RPCs of death?
  • Most of the situations described in abls::StatusCode, resource exhausted, invalid arguments, unauthenticated, permission denied, etc.  These are where we probably need to unwind the stack to the enclosing try/catch, which may be in the RPC  event handler.
I don't really see a case for throwing more than an enumerated error code and a string.  Do you?  Clearly panic should print or log a stack trace.  Is i worth it for any of the other cases?  Maybe the other cases should just include the file name and line number?

Rust I think made a mistake in standardizing on returning Result<T, E>,  where Result is a tagged union and either has a T or an E.  Rust's convention has only one error type returnable from a function, and so what I'm seeing in Cloud Hypervisor is a ton of error type translators to translate from one enum to another.  In Python, Java, and other languages, the try statement can catch any number of error types, and this eliminates most of the error code translation.

So, here's a proposal: Be similar to Python, but require errors to be tuples of the form (Enum, string), where Enum can be any enum.  Matching cases could list any number of Enum types (the whole enum class) and Enum values (entries in the Enum).  Do we need more than this?

The error thrown could be stored in thread-owned global memory.  It would be a u32 for the Enum class, a u32 for the Enum entry, and a dynamic array for the error message.  These don't need to be passed all the way back up the stack.  For each function that has data needing unwinding (either arrays or ref-counted objects), we can generate an unwind label at the end of the function that will free all of them if they are still valid (non-empty arrays and non-null object refs).

In the LLVM IR backend we can use "invoke" instead of "call", which takes a label for the unwiding code in the function.  In C, we could have a conditional branch after each call that may throw.  This isn't too bad, because the branch predictor should make these zero-time branches.  The main speed impact would be due to less efficient use of the instruction cache, since the code would be larger.
Reply all
Reply to author
Forward
0 new messages