Skip to main content
  1. Posts/

The Error Nobody Reads

·1220 words·6 mins
Photograph By Erik Mclean
Blog Software Engineering Error Handling
Table of Contents

“Error: null”
#

It’s 2 AM. PagerDuty wakes you up. You open the dashboard. The error message staring back at you is:

Error: null

That’s it. No stack trace. No request ID. No hint about which of your 14 microservices decided to ruin your night. Just null, the most honest and least helpful word in computing.

You spend 45 minutes grepping through logs across three services before you find the real cause: a downstream API changed a field from accountId to account_id, and some catch block three layers up swallowed the original exception and replaced it with… nothing. Someone wrote catch (e) {} and went to lunch. That someone was you, eight months ago.

We’ve all been that person. Here’s how to stop.

Catch Me If You Can (But Please Don’t Ignore Me)
#

The single most destructive error handling pattern is the catch-and-swallow. Every language makes it easy. Every developer has done it “just for now” and never come back.

try {
    riskyOperation();
} catch (Exception e) {
    // TODO: handle this later
}
fetchData().catch(() => {}); // the "it's fine" handler
begin
  risky_operation
rescue
  # shrug emoji
end

That empty catch block is a lie. It tells the calling code that everything succeeded. It tells the monitoring system that nothing happened. It tells future-you that past-you was either lazy or optimistic, and neither is a good look at 2 AM.

The fix is almost offensively simple: if you can’t handle it, don’t catch it. Let it propagate. Let it crash. A loud failure you can debug is infinitely better than a silent one you can’t.

The Overeager Net
#

Slightly more insidious than catching and ignoring is catching everything. Java has a hierarchy of throwables, and most developers treat catch (Exception e) as a safety net. Some go further:

try {
    processOrder();
} catch (Throwable t) {
    // Catches OutOfMemoryError, StackOverflowError...
    // You know, the ones you definitely can't recover from
    log.error("oops", t);
}

Ruby has an equivalent footgun that’s bitten me personally. rescue without a class catches StandardError, which is usually fine. But rescue Exception catches everything — including SignalException, which means your graceful Ctrl+C gets swallowed and your process becomes immortal. Not in a good way.

begin
  long_running_job
rescue Exception => e
  # Congratulations, you just caught Ctrl+C
  # Your process now ignores SIGINT
  retry # And this makes it an infinite loop. Chef's kiss.
end

TypeScript took an interesting approach in strict mode: the catch variable is typed as unknown, not any. This means you have to narrow it before you can do anything useful:

try {
  await fetchUser();
} catch (e: unknown) {
  if (e instanceof HttpError) {
    // Now you can safely access .status, .message
    logger.warn(`HTTP ${e.status}: ${e.message}`);
  } else {
    throw e; // Don't know what this is — rethrow
  }
}

It’s a small friction that prevents a lot of sins. The compiler is literally saying: “You don’t know what this is. Act accordingly.”

The Log That Cried Wolf
#

You caught the error. You didn’t swallow it. Great. But what did you log?

[ERROR] Something went wrong
[ERROR] Something went wrong
[ERROR] Something went wrong

I’ve seen production logs that are just this. Hundreds of lines. No stack trace, no request context, no correlation ID. Just a string that confirms something went wrong — which you already knew, because the alerts fired.

The difference between a useless log and a useful one is usually about 30 extra characters:

// This helps nobody
log.error("Payment failed");

// This saves you 45 minutes at 2 AM
log.error("Payment failed for order={} user={} provider={} trace={}",
    orderId, userId, provider, traceId, exception);

Structured logging (JSON instead of plaintext) and correlation IDs (a trace_id that follows a request across services) aren’t glamorous. Nobody writes blog posts about how exciting their log format is. But teams I’ve worked with that adopted structured logging with trace IDs cut their mean time to debug by more than half. The error still happens — you just find it in minutes instead of hours.

Checked Out of Checked Exceptions
#

Java is the only mainstream language that tried checked exceptions — forcing you to either handle or declare every exception a method can throw. The idea was noble: make error handling explicit, make it impossible to ignore.

The result was throws Exception on everything, catch (Exception e) { throw new RuntimeException(e); } wrappers everywhere, and an industry-wide migration toward unchecked exceptions with global handlers. Spring’s @ExceptionHandler and @ControllerAdvice exist precisely because the checked exception experiment produced more boilerplate than safety.

The lesson isn’t that checked exceptions were wrong in theory — it’s that forcing developers to handle errors at every call site produces worse error handling, not better. People take the path of least resistance, and the path of least resistance was wrapping everything in RuntimeException.

Go took a different bet with if err != nil at every call site, which has its own trade-offs. Rust went with Result<T, E> and the ? operator, which is probably the closest anyone’s gotten to making error handling both explicit and ergonomic. But that’s a whole separate post (foreshadowing…).

When Errors Cross the Wire
#

All of the above gets harder when your error originates in Service A, passes through Service B, and surfaces in Service C — which is where you happen to be looking. Distributed systems turn error handling from a coding problem into an observability problem.

A few patterns that have saved me:

Circuit breakers. If Service B is failing, stop calling it. Resilience4j in Java, Polly in .NET, or even a simple counter that trips after N failures. Better to return a degraded response than to cascade timeouts across your entire system.

Dead letter queues. When an async message fails processing, don’t just retry forever. Put it somewhere you can inspect it later. The number of times I’ve seen a poison message bring down an entire queue consumer is… too many.

Timeout cascading. If Service A gives Service B a 5-second timeout, and Service B gives Service C a 5-second timeout, you’ve got a 10-second timeout that nobody configured intentionally. Set your timeouts to decrease as you go deeper: if A allows 5 seconds total, B should get 3, and C should get 1. Otherwise a slow C makes A look broken.

Correlation IDs. I mentioned these already, but they’re worth repeating: generate a trace_id at the edge, propagate it through every service call, include it in every log line. When something breaks, you grep for one ID and get the full story across every service. It’s the single highest-ROI observability investment you can make.

The Catch-All (Pun Intended)
#

Error handling is one of those things that separates code that works from code that works in production. It’s not hard to do well. It’s just easy to skip — especially when you’re focused on the happy path, which is every developer, always.

The rules are simple enough to fit on an index card: don’t swallow errors, don’t catch too broadly, log with context, and propagate trace IDs. If you’re doing those four things, you’re ahead of most codebases I’ve worked in.

More to come on the Rust and Go side of error handling — they each made fundamentally different bets, and the ergonomics are worth comparing. But that’s for next time.

Aaron Yong
Author
Aaron Yong
Building things for the web. Writing about development, Linux, cloud, and everything in between.

Related

Everything Is a String Until It Isn't
·1238 words·6 mins
Photograph By Artem Maltsev
Blog Software Engineering Languages
Three languages, three philosophies on what happens when types collide — and the bugs that live in the gaps.
The Dependency You Forgot About
·1238 words·6 mins
Photograph By Jon Tyson
Blog Software Engineering Security
Your lockfile has 847 packages in it. You chose 12. The other 835 are a trust exercise.
Your Tests Are Lying to You
·1254 words·6 mins
Photograph By Nguyen Dang Hoang Nhu
Blog Software Engineering Testing
When all tests pass but production breaks, the problem isn’t your code — it’s where you drew the mock boundary.