Skip to main content
  1. Posts/

The Lies We Tell Ourselves

·1217 words·6 mins
Photograph By Jametlene Reskp
Blog Software Engineering System Design
Table of Contents

Eight Assumptions That Are All Wrong
#

In 1994, Peter Deutsch (and others at Sun Microsystems) wrote down eight assumptions that developers make about distributed systems — assumptions that are all wrong. They called them the Fallacies of Distributed Computing. Thirty years later, every one of them still shows up in production.

If you’ve followed this series from Docker through Kubernetes to ArgoCD and monitoring , you’ve built a distributed system. And every component we set up exists, in part, because one or more of these assumptions is false.

The Fallacies
#

1. The Network Is Reliable
#

It’s not. Packets get lost. Connections drop. DNS goes down. That cloud region you depend on? It had an outage last month.

Where you’ve seen this: ArgoCD’s retry policyduration: 5s, factor: 2, maxDuration: 3m. Five retries with exponential backoff exist because the sync between Git and the cluster sometimes fails due to network blips. Not bugs. Not bad config. Just unreliable networks.

2. Latency Is Zero
#

The call from your API pod to the database takes 2ms inside the same cluster. The call to a third-party API takes 200ms. The call to a service in another region takes 50ms on a good day and 500ms on a bad one.

Where you’ve seen this: health check timing . Liveness probes have a 30-second initial delay because the app needs time to start, connect to the database, and warm up. Readiness probes have a 10-second delay because you want to serve traffic as soon as possible but not before the service is actually ready. These delays exist because connections aren’t instant.

3. Bandwidth Is Infinite
#

Every Prometheus scrape is data over the network. Every log line shipped to Loki. Every container image pulled from the registry. At scale, this adds up. We set scrape intervals to 15 seconds — not because faster wouldn’t be useful, but because scraping 50 targets every second would saturate the network.

4. The Network Is Secure
#

It’s not, which is why we have Cloudflare in front of everything, TLS on all connections, the Cloud SQL Proxy using IAM authentication instead of passwords, and Authentik enforcing 2FA. Every layer assumes the network is hostile.

5. Topology Doesn’t Change
#

Pods come and go. Nodes get added and removed. The Cluster Autoscaler adds nodes under load and removes them when idle. ArgoCD continuously reconciles the cluster state. Kubernetes Services abstract away pod IPs that change with every restart. Nothing in a Kubernetes cluster has a fixed address.

6. There Is One Administrator
#

Different teams manage different parts. The platform team manages the cluster. The app team manages deployments. Google manages the GKE control plane. Cloudflare manages the CDN. Your monitoring alerts go to different channels based on severity because different people own different systems.

7. Transport Cost Is Zero
#

Cloud egress costs money. Every byte leaving your GCP project is billed. Cross-region traffic costs more than same-region. This is why our container registry is in asia-southeast1 — same region as the cluster. Pulling images cross-region would be slower and more expensive.

8. The Network Is Homogeneous
#

Your cluster runs Linux pods. Some run Windows containers. The load balancer speaks HTTP/2 to clients and HTTP/1.1 to backends. Prometheus scrapes over HTTP, AlertManager sends webhooks, the Cloud SQL Proxy uses a proprietary Google protocol. Nothing uses the same protocol end to end.

The Three Concepts That Make It Work
#

Despite all these fallacies, distributed systems work. Three concepts make that possible.

The CAP Theorem (It’s Not “Pick Two”)
#

You’ve probably seen the Venn diagram: Consistency, Availability, Partition Tolerance — pick two. I certainly had. Turns out, even Eric Brewer — the person who formulated the theorem — says the “pick 2 out of 3” framing is misleading.

Here’s what’s actually going on. Partitions (network splits between nodes) are rare. Most of the time, your system is not partitioned, and you can have both consistency and availability. The tradeoff between C and A only kicks in during a partition — and partitions are temporary.

More importantly:

  • The choice is granular, not global. Different parts of your system can make different tradeoffs. Your payment service can prioritize consistency while your recommendation engine prioritizes availability. You can even choose per-operation or per-user.
  • All three properties are a spectrum, not binary. “Consistent” doesn’t mean “perfectly consistent everywhere instantly.” There’s strong consistency, eventual consistency, read-your-writes, causal consistency — each with different performance costs.
  • The real tension is consistency vs latency, not consistency vs availability. Achieving strong consistency requires waiting for all nodes to agree, which takes time.

Brewer’s actual recommendation: don’t accept the tradeoff as permanent. Detect the partition, limit operations during it, and recover once it heals — compensating for any inconsistencies that occurred. Design for the common case (no partition = full C and A), and handle the rare case (partition = graceful degradation) explicitly.

In practice: our PostgreSQL database prioritizes consistency — during a network issue, writes will fail rather than risk divergent data. Our Redis cache prioritizes availability — serving slightly stale data is fine, going down is not. But neither permanently sacrifices the other — it’s about what happens during the (rare) moments when you can’t have both.

Eventual Consistency (It’ll Get There)
#

Not everything needs strong consistency. When ArgoCD syncs a deployment, there’s a window where some pods run the old version and some run the new. When the CDN cache serves a response, it might be 30 seconds behind the database. When a read replica streams changes from the primary, there’s milliseconds of lag.

All of these converge to consistency eventually. The question is whether your application can tolerate the window. For a blog? Absolutely. For a bank balance? Probably not.

Idempotency (Run It Twice, Same Result)
#

In a world where requests get retried, operations need to be safe to repeat.

# Idempotent — safe to run twice
def perform(url_id)
  url = Url.find_by(id: url_id)
  return unless url            # deleted? skip
  return if url.title.present? # already done? skip
  title = UrlMetadataService.call(url.target_url)
  url.update!(title: title) if title.present?
end

This is from the background jobs pattern. If the job runs twice (because the queue delivered it twice, or the first execution timed out and it got retried), the second run does nothing. No duplicate side effects.

For API endpoints, idempotency keys solve the same problem. The client sends a unique key with each request. The server checks: “Have I seen this key before? Yes → return the original response. No → process the request.”

Everything Connects
#

FallacyWhat We Built to Handle It
Network unreliableArgoCD retries, health checks, circuit breakers
Latency not zeroReadiness probes, timeouts on every call
Bandwidth finite15s scrape intervals, regional registries
Network insecureTLS everywhere, Cloud SQL Proxy, 2FA
Topology changesK8s Services, Cluster Autoscaler, ArgoCD
Multiple adminsSeverity-based alert routing, RBAC
Transport has costSame-region registries, efficient image layers
Network not homogeneousProtocol-aware load balancing, sidecars

This is why the infrastructure series exists. Each component — Docker, Kubernetes, GKE, ArgoCD, Prometheus, Grafana — solves a specific distributed systems problem. The bouncer handles rate limiting. The traffic cop handles load balancing. The monitoring stack handles observability. The GitOps pipeline handles deployment consistency.

None of it eliminates the fallacies. All of it manages them.

Aaron Yong
Author
Aaron Yong
Building things for the web. Writing about development, Linux, cloud, and everything in between.

Related

Traffic Cops
·682 words·4 mins
Photograph By Adil Edin
Blog Software Engineering System Design
Load balancing algorithms, L4 vs L7, and why your requests end up where they do
The Bouncer at the Door
·854 words·5 mins
Photograph By Enrico Bet
Blog Software Engineering System Design
Rate limiting algorithms, layered protection, and why your API needs a velvet rope
The Fastest Code Never Runs
·1531 words·8 mins
Photograph By Kelly Sikkema
Blog Software Engineering System Design
Caching, Redis, and the art of not hitting your database