Scaling the Right Thing

Table of Contents

Infrastructure From Scratch - This article is part of a series.

Part 1: Containerizing Everything

Part 2: Building and Shipping with Docker

Part 3: Enter the Cluster

Part 4: Running on Google's Cloud

Part 5: GitOps with ArgoCD

Part 6: Watching the Watchers

Part 7: Making Sense of Metrics

Part 8: This Article

More Pods, Same Problem
#

Our backend APIs were slowing down. Users were noticing. The response was immediate and instinctive — scale up. More instances, more pods, more capacity. We’d built this whole autoscaling setup for exactly this scenario, right?

The pods scaled. The response times didn’t improve. We now had more instances making the same slow queries in parallel, which actually made the database situation worse. More concurrent connections, same unoptimized queries, same bottleneck. We scaled the wrong layer.

This is the post I wish I read before that happened.

Three Layers, One Bill
#

Scaling isn’t one thing. It happens at three distinct layers, and each has wildly different costs, speeds, and tradeoffs:

Layer	What Scales	Speed	Cost
Application	Pod count or pod size	Seconds	Low
Infrastructure	Node count or node size	Minutes	Medium
Database	Instance size, replicas, connections	Minutes to weeks	High

The instinct is always to scale the easiest layer first — app pods are cheap and fast. But if the bottleneck is in the database, adding more app pods is like adding more lanes to a highway that ends in a single-lane bridge. You just moved the traffic jam.

Layer 1: Application (The Easy Button)
#

This is what HPA does. Traffic goes up, pods scale out. Traffic drops, pods scale back. For stateless services — APIs, web servers — this is the right first move if the pods themselves are the bottleneck.

We use dual-metric HPA — both CPU and memory thresholds:

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 70 # scale if CPU > 70%
  - type: Resource
    resource:
      name: memory
      target:
        averageUtilization: 80 # scale if memory > 80%

CPU-only thresholds bit us early on. Some of our services involve headless browser rendering — they barely touch CPU but consume a lot of memory. With only CPU-based scaling, those pods would get OOM-killed without HPA ever triggering. Adding the memory metric fixed that.

But not everything should scale horizontally. Our scheduler deployment runs as a single replica. Scaling it to 3 would mean 3 instances all trying to fire the same scheduled job. Same with the generator — it processes background tasks that shouldn’t be duplicated. For those, it’s one pod with a Pod Disruption Budget to keep it alive during cluster maintenance.

Layer 2: Infrastructure (The Money Layer)
#

Pods need nodes. When HPA wants to add a pod but no node has capacity, the pod sits in Pending and nothing happens until the Cluster Autoscaler adds a node. This takes a few minutes (provisioning a VM isn’t instant), so there’s a lag between “we need more capacity” and “we have more capacity.”

The real cost conversation happens here. Nodes are VMs, and VMs cost money whether your pods use them or not. Two strategies keep this in check:

Spot VMs for anything that can tolerate interruption — batch jobs, CI runners, stateless workers. 60-91% cheaper than regular VMs. Google can reclaim them with 30 seconds notice, so your workload needs to handle that gracefully.

Right-sizing is the boring but effective one. We caught ourselves running n1-standard-4 nodes for workloads that would’ve been fine on e2-medium. The 2025 stat that 30% of enterprise cloud spending is addressable waste? Felt personally attacked by that one.

Layer 3: Database (The Real Bottleneck)
#

This is where our scaling story actually starts. The API slowdown wasn’t a CPU problem or a memory problem. It was a query problem.

Step 0: Optimize First (Free)
#

Before touching any infrastructure:

EXPLAIN ANALYZE SELECT * FROM transactions WHERE user_id = 12345;
-- Result: Seq Scan on transactions (cost=0.00..45123.00)
-- Translation: full table scan, no index, reading every row

One CREATE INDEX statement later:

-- Result: Index Scan using idx_transactions_user_id (cost=0.42..8.44)
-- Translation: 5000x fewer rows scanned

That single index did more for our response times than any amount of pod scaling would have. This is the lesson that sticks with me: “optimize before you scale” is advice everyone gives and nobody follows — until you see the numbers.

Other free wins: fixing N+1 queries (one service was making 200+ database calls per request instead of 2), adding Redis for frequently-read data, and running EXPLAIN ANALYZE on every slow query the Prometheus alerts flagged.

Step 1: Vertical Scaling (Simple, Expensive)
#

After optimization, if the database CPU is still consistently above 80%, make it bigger. More cores, more RAM, faster disks. No app changes needed — everything just gets faster.

The downside: exponential cost curve. Doubling your Cloud SQL instance doesn’t double the price — it more than doubles it. And there’s a hard ceiling. The biggest instance available is the biggest instance available.

Step 2: Read Replicas (For Read-Heavy Workloads)
#

Most web apps are 80-95% reads. Instead of one database handling everything, route reads to replicas that stream changes from the primary:

App writes → Primary DB
App reads  → Read Replica (one of several)

This requires app-level changes (routing reads vs writes to different connections), but the payoff is significant. For context, OpenAI runs ChatGPT on a single PostgreSQL primary with ~50 read replicas. One primary, many readers.

Step 3: Connection Pooling (The Hidden Bottleneck)
#

Each database connection eats ~10MB of memory. When Kubernetes scales your app to 50 pods, each opening 10 connections, that’s 500 connections to your database. Most of them are idle most of the time, but the database doesn’t know that.

Connection pooling (PgBouncer, or Cloud SQL’s new managed pooling) sits between your app and the database, reusing connections instead of opening new ones. Connection setup drops from 50ms to 5ms. The database handles queries instead of handshakes.

Step 4: Sharding (Last Resort, Probably Never)
#

Splitting your database across multiple instances based on a key. I’m including it for completeness, but if you’re reading this blog, you probably don’t need it. Most applications never will. OpenAI scaled to 800 million users before needing to think about it seriously.

The Diagnostic Framework
#

When response times increase, resist the urge to immediately scale. Ask where the bottleneck actually is:

Response times increasing
  │
  ├─ App pods CPU/memory high, DB metrics fine
  │  → HPA (add pods) — this is the right time to scale app layer
  │
  ├─ Pods stuck in Pending
  │  → Cluster Autoscaler (add nodes) — infra bottleneck
  │
  ├─ App pods fine, DB CPU/connections high
  │  → Optimize queries → vertical scale → read replicas
  │
  └─ Everything looks fine on paper
     → Profile the app code. It's almost always an N+1 query.

This is why the monitoring stack matters. Without Prometheus metrics and Grafana dashboards , you’re guessing. With them, you see exactly which layer is saturated and can make an informed decision instead of an expensive one.

The Over-Provisioning Dilemma
#

There’s a constant tension between running lean (cost-efficient, but fragile under load) and running generous (resilient, but burning money on idle resources). I’ve been on both sides.

Run too lean: traffic spikes, pods scale, but nodes take minutes to provision. Users experience slowdowns during the gap. You get paged. It’s 3 AM. You question your life choices.

Run too generous: everything is fine, always. Until someone looks at the cloud bill and asks why you’re paying for 10 nodes when average utilization is 30%.

The honest answer is there’s no perfect setting. Committed Use Discounts cover the baseline you always need. Spot VMs handle burst capacity cheaply. Autoscaling profiles tune how aggressively nodes scale down. And monitoring tells you whether your current settings are working or wasting money.

The Golden Rule
#

Always optimize before you scale. Always scale the cheapest layer first.

Fix queries and indexes (free, immediate)
Add app pods (cheap, seconds)
Right-size infrastructure (medium cost, minutes)
Scale database vertically (expensive, minutes)
Add read replicas (expensive, requires app changes)
Shard (very expensive, months of work)

Scaling is a diagnostic skill. Know where the bottleneck is before you spend money on it. We learned that the hard way, and now it’s the first thing I check.