How to handle errors

It seems I’m going through my old articles and re-writing them. The first version of How to estimate has been written in 2008, and I wrote the first version of this article around 2004. Originally it was focused around exceptions, but here I want to talk about

Checking errors
Handling errors
Designing errors

Checking errors

In OCaml programming language, you can define so called variant types. A variant type is a composite over several other types; an expression or function can then have the value belonging to either one of those types:

type int_or_float = Int of int | Float of float

(* This type can for example be used like this: *)
let pi_value war_time =
    if war_time then Int(4) else Float(3.1415)

# pi_value true
Int 4

# pi_value false
Float 3.1415

(you can try this code online http://try.ocamlpro.com/)

OCaml is a very statically typed, very safe language. This means, if you use this function, it will force you to handle both the Int and the Float cases, separately:

# pi_value true + 10  (* do you expect answer 14? no, you'll get an error: *)
Error: This expression has type int_or_float
       but an expression was expected of type int

(* what you have to do is for example this: *)
# match pi_value true with
      Int(x) -> x + 10
    | Float(y) -> int_of_float(y) + 10
int 14

The last line checks if the returning value of pi_value true is actually an Int or a Float, and executes different expressions in each case. If you forget to handle one of the possible return types in your match clause, OCaml will remind you.

One of the idiomatic usages of variant types in OCaml is error handling. You define the following type:

type my_integer = Int of int | Error

Now, you can write a function returning a value of type “integer or error”:

let foobar some_int =
    if some_int < 5 then Int(5 - some_int) else Error

# foobar 3
Int 2

# foobar 7
Error

Now, if you want to call the foobar, you have to use the match clause, and you should handle all possible cases. For example:

let blah a b =
    match (foobar a, foobar b) with
          (Int(x), Int(y)) -> x + y
        | (Error , Int(y)) -> y
        | (Int(x), Error)  -> x
        | (Error , Error)  -> 42

Not only this language design feels very clean, but also it helps to understand that errors are just return values of functions. They are part of the function codomain (function range), together with the "useful" return values. From this point of view, not checking and not being able to process error values returned by a function, should feel equally strange as if we wouldn't be able to process some particular integer return value.

Still, often I don't check for errors. I think, it is related to the design of many mainstream languages, making it harder to emulate variant types or to return several values. Let's take C for example:

int error_code;
float actual_result = 0;

error_code = foo(input_param, &actual_result);

if(error_code < 0) {
  // handle error
} else {
  // use actual_result
}

and compare it with OCaml, which is both safer and more concise:

type safe_int = Int of int | ErrorCode of int

match foo input_param with
      Float(f) -> (* use result *)
   |  ErrorCode(e) -> (* handle the error *)

Unfortunately, most of us have to use mainstream languages. Error checking makes source code less readable, therefore I try to counteract it by using a uniform specific code style for error handling (eg. same variable names for error codes and same code formatting).

Recap: checking for errors is the same as being able to handle the whole spectrum of possible return values. Make it part of your code style guide.

Handling errors

In Smalltalk, exceptions (any many other things) are implemented not as a magical part of the language itself. They are just normal classes in the standard library. (if you're ok with Windows, try Smalltalk with my favorite free implementation, otherwise go for the cross-platform market leader. In Smalltalk, REPL is traditionally called a Workspace)

Processor "This global constant gives 
           you the scheduler of Smalltalk 
           green threads"

Processor activeProcess "This returns the green thread 
                         being currently executed"

Processor activeProcess exceptionEnvironment "This gives the 
                                              current ExceptionHandler"

my_faulty_code := [2 + 2 / 0] "This produces a BlockClosure, 
                               which is also known as closure, 
                               lambda or anonymous method 
                               in other languages"

my_faulty_code on: ZeroDivide 
               do: [ :ex | Transcript 
                              display: ex; 
                              cr] "This will print the 
                                   ZeroDivide exception 
                                   to the console"

The latter line of code does roughly the following:

The method #on:do: of the class BlockClosure creates a new object ExceptionHandler, passing ZeroDivide as the class of exceptions this handler cares about, and the second BlockClosure, which will be evaluated, when the exception happens.
It temporarily saves the current value of Processor activeProcess exceptionEnvironment
Sets the newly created ExceptionHandler as the new Processor activeProcess exceptionEnvironment
Stores the previously saved value of exception handler in the outer property of the new ExceptionHandler.

This effectively creates a stack of ExceptionHandlers, based on a trivial linked list, and having its head (the top) in Processor activeProcess exceptionEnvironment.

Now, when you throw an exception:

ZeroDivide signal

the signal method of the Exception class, which ZeroDivide inherits from, starts with the ExceptionHandler currently defined in Processor activeProcess exceptionEnvironment and loops over the linked list, until it finds an ExceptionHandler suitable for this. Then, the exception object passes itself to the handler.

Not only it looks very clean and is a brilliant example of proper OOD, but also it helps to understand that exceptions is just a construct allowing you not to check for the error in the immediate function caller, but propagate it backwards in the call stack.

Now why is it important?

Because one thing is to check for error, and another thing is to handle it, meaningfully. The latter is not always possible in the immediate caller.

Deciding how to handle errors, meaningfully, is one of the advanced aspects of software development. It requires understanding of the software system I'm working on, as a whole, and the motivation to make code as user-friendly as possible -- in the most generic sense: my code can be used by linking and calling it from another code; or an end-user would execute it and interact with it; or somebody will try to read, to understand, to debug and to modify my code.

What makes things worse is the realization that most of time, sporadic run-time errors happen in a very small percentage of use-cases, and therefore they are usually associated with a quite small estimated business value loss. Therefore, from the business perspective, only a small time budget can be provided for error handling. We all know that and when under time pressure, the first thing most developers compromise is error handling.

Therefore, every time I decide how to handle an error, I try to answer all of the following questions:

How far should we go trying to recover from the error, given the limited time budget?
If the user is blocked waiting for results of our execution, how to unblock him, but (if possible) not to make him angry?
If the user is not blocked, should we inform him at all?
If we assume a software bug being the reason of an error, how to help testers to find it, and developers to fix it?
If we assume an issue with installation or environment, how to help admins to fix it?

Usually, this all boils down to one of the following error handling strategies (or a combination of them):

Silently swallow the error.
Just log the exception.
Immediately fully crash the app.
Just try again (max. N times, or indefinitely).
Try to recover, or at least to degrade gracefully.
Inform the user.

I'll try to describe a typical situation for each of the handling strategies.

I'm using a third-party library that throws an exception in 20% of cases when I use it. When this exception is thrown, the required function will still be somehow performed by the library. I will catch this specific class of exceptions and swallow them, writing a comment about it in the exception handler.

I'm writing a tracking function, which will be used 10 times a second to send user's mouse position from the web browser back to the web server. When posting to the server fails for first time, I will log the error (including all information available), and either swallow all other errors, or log every 10th error.

The technology I'm using for my app allows me to define what to do, if an exception is unhandled. In this handler, I will implement a detection if my app is running on development or staging; or in production. When running on production, I will log the error and then swallow it. If it not possible to swallow the error, I'll inform the user about unexpected error (if possible using some calming down graphic). Not on production, I will crash the app immediately and write a core dump, or at least log the most accurate and complete information about the exception and the current app state. This will help both me and testers to detect even smaller problems, because it will help creating a no-bug-tolerance mindset.

I'm writing a logger class for an app working in the browser. The log is stored on the web server. If sending the log message fails, I will repeat the post 3 times with some delay. If it still fails, I'll try to write it into the local offline storage. If writing in this storage fails, I will output it to the console.

I'm writing some boring enterprise app with a lot of forms. User clicks on a button to pre-populate the form with data from the previous week. In this event handler, I will place a try/catch block, and my exception handler will log the exception. I will then go chat with the UX designer to decide if and how exactly to inform the user about the issue.

Recap: handling errors is not trivial -- there are no hard and fast rules, and you need to know about the whole system and think about usability do to it properly.

Desiging errors

Designing errors is deciding when and how to signal error conditions from a reusable component (framework). If handling errors is complicated, because you have to know the overall context and think about usability, designing errors is in order of magnitude more complicated, because you have to imagine all possible systems, contexts and situations where your code will or can be used, and design your errors so that they can be handled easily (or at the very least, can be handled reasonably).

Frankly speaking, I haven't designed an error system (yet) I were particularly proud about, and I think this complicated topic is pretty subjective and a question of your style.

My personal style is to believe that my framework or library is just a guest, and the calling code is a host. As a guest, one must respect decisions of the host and do not try to force any specific usage pattern. This is why most (but not all) of my properties and methods are public. I don't know how the host is going to communicate with me, and I'm not going to force one specific style over him, or declare some topics taboo. I still specifically mark preferred class members though, so that I can indicate my own communication preferences (or suggested API) to the host. I also warn the host in the comments that all members outside of the suggested API are subject to change without notice. But ultimately, it is up to host what to use and how.

This approach has the drawback that the developer writing the calling code has to understand must more about my component, and that he has less support from IDE and compiler detecting usages of class members that I don't recommend to use. But it has the advantage of giving the greatest possible freedom to the host. And I believe that I have to trust in host that he won't use his freedom to shoot himself in the leg.

When designing errors, I think I should follow the same idea. This means:

If possible, do not force the caller to check for errors. It is his choice. Maybe he is just prototyping some quick and dirty idea and needs my component for a throw-away code.
If possible, do not handle errors in the component, but let the calling code to handle them. Or at least, make it configurable. Specifically, do not retry or try to recover from an error, because retrying or recovering takes time and resources, and the host might have a different opinion about what error handling is appropriate. Provide an easy API for retrying/recovering though, so that if the caller decides to recover, it would be easy for him. Another example: do not log errors, or at least make the logging configurable. In one of my recent components I've violated this recommendation. When the server didn't respond timely, 3 lines were added to the log instead of one: the first one has been added by a transport layer, the second one from the business logic that actually needed data from the server, and the third one from the UI event handler. This is unnecessary. Logs must be readable, and usability of the logging is one of the most important factors separating well designed error handling from the bad designed one.
Do not provide text messages describing the error. The caller might be tempted to use them "as is" to show them in a message box. And then he'll get problems, when his app will need to be translated into another language.
Provide different kinds of errors, when you anticipate that their handling might be different. For example, if the component is a persistency layer, provide one kind of exceptions for problems related with network communication with the database, and other kind of exceptions for logical problems like non-unique primary key when inserting a new record, or non-existent primary key when updating a record.
Add as much additional information into the error as possible. In one of my projects I went so far: when downloading and parsing of some feed from the web server failed, I've added the full http request (including headers and body) and full http response, along with the corresponding timestamps, into the error.
If possible, always try to signal errors using one and the same mechanism. In my recent project, some of my functions have signaled the error by returning 0, other functions by returning -1, and yet another one has accepted a pointer to the result code as argument.

Recap: Designing errors is even more complicated than handling them.

Maxim Fridental

Checking errors

Handling errors

Desiging errors

Leave a Reply

Categories

Archive