Error Handling & Failure Modes
Failure is normal, not rare. A resilient system and a fragile one usually fail at the same rate. The difference is what the code does right after a fault. Design those moments on purpose.
Every operation that can fail forces a choice: recover, retry, escalate, or stop. Much production pain comes from code that never made that choice. It caught an error and carried on, leaving the system in a state nobody planned for. The rules below cover the three choices that matter most: where you handle a failure, how you keep information about it, and what you show to the outside world.
Where to handle failure
- DoHandle an error only where you can make a real decision about it. If a function cannot decide, let the error pass up.
- DoGive one layer clear ownership of recovery. That layer decides retry or fail, instead of every layer guessing.
- ConsiderSplitting transient failures (network blips, lock contention → retry with backoff) from permanent ones (validation, auth → fail fast).
- Do notCatch broad exceptions in low-level helpers just to "be safe". This takes away the caller's chance to respond correctly.
- NeverRetry a non-idempotent operation without a guard. A retried payment is a duplicate charge.
try { chargeCard(order); } catch (Exception) { /* ignore */ }
The order is marked paid, but no charge happened, and there is no log, alert, or trace. This one pattern causes a large share of "it just silently didn't work" incidents.
try { chargeCard(order); }
catch (TransientGatewayError e) { enqueueRetry(order, e); }
catch (CardDeclinedError e) { markUnpaid(order); notify(order, e); throw; }
Transient and permanent failures are handled differently. The order never ends up in a wrong state, and the original error still passes up with its cause attached.
Preserving information
- AlwaysKeep the original cause when you wrap an error. Add context; never throw the cause away.
- DoMake every failure visible — structured logs, metrics, or traces — so you can find it without a debugger.
- ConsiderDefining a function's failure contract (what it throws or returns on error) as carefully as its success type.
- Do notUse errors or exceptions for normal control flow. It hides intent and is slow.
- NeverSwallow an error silently. If ignoring it is truly correct, say so clearly and write down why.
catch (IOException e) { throw new AppError("save failed"); }
The real reason (disk full? permission denied?) is gone. The new error starts a fresh, empty stack trace.
catch (IOException e) { throw new AppError("save failed", e); }
The original IOException is kept as the inner cause, so the real reason still shows up in logs and traces.
What you expose
- DoReturn stable, useful error shapes to callers (a code plus a safe message), not raw internals.
- ConsiderDesigning for graceful degradation. A reduced but working response is better than a hard failure, where the domain allows it.
- NeverLeak stack traces, SQL, secrets, or file paths to an external caller.
Self-review checklist
- AskIf this fails, who finds out, and how?
- AskIs the system in a valid state on every failure path, not just the happy path?
- AskCould this error be retried safely — and if so, is it?
- AskDoes anything I show on error reveal more than the caller should see?