Watching the Watchers

Table of Contents

Infrastructure From Scratch - This article is part of a series.

Part 1: Containerizing Everything

Part 2: Building and Shipping with Docker

Part 3: Enter the Cluster

Part 4: Running on Google's Cloud

Part 5: GitOps with ArgoCD

Part 6: This Article

Part 7: Making Sense of Metrics

The Silence Before the Storm
#

You’ve got a self-healing cluster with ArgoCD . Deployments happen through Git. Pods restart themselves when they crash. Everything is automated and beautiful. And then a service starts responding slowly. Memory creeps up. A pod gets OOM-killed, restarts, gets killed again. Kubernetes is technically self-healing — the pod keeps restarting — but the underlying problem isn’t going away. And you have no idea any of this is happening.

I wrote about logging before . Logs tell you what happened. Metrics tell you what’s happening right now. This post is about the latter.

Prometheus: The Pull Model
#

Most monitoring tools work by having agents push data to a central server. Prometheus flips this — it pulls metrics by scraping HTTP endpoints on a schedule. Every 15 seconds, Prometheus hits the /metrics endpoint on every target and stores the time-series data.

Why pull instead of push? Because if a target goes down, the absence of data is itself a signal. You don’t need the target to tell you it’s dead — Prometheus notices that it can’t scrape it.

What Gets Scraped
#

In a Kubernetes cluster, Prometheus scrapes several layers:

Component	What It Reports	Exporter
Nodes	CPU, memory, disk, network at OS level	node-exporter
Containers	CPU, memory, filesystem per container	cAdvisor (built into kubelet)
K8s objects	Deployment replicas, pod status, job completions	kube-state-metrics
API server	Request latency, etcd health	Built-in
Your apps	Whatever you expose on `/metrics`	Your code

For your own apps, opt-in is a single annotation:

metadata:
  annotations:
    prometheus.io/scrape: "true" # Prometheus will find and scrape this pod
    prometheus.io/port: "8080" # on this port
    prometheus.io/path: "/metrics" # at this endpoint

The Metric Zoo
#

Prometheus has four metric types. You only really need to understand two to start:

Counters — only go up (or reset to 0 on restart). Total requests, total errors, bytes sent. Use rate() to get per-second values:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

Gauges — go up and down. Current memory usage, active connections, temperature. Use directly:

# Current memory usage as a percentage
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100

(There are also histograms and summaries for measuring distributions like request latency, but counters and gauges cover 80% of what you’ll need.)

PromQL: A Taste
#

PromQL is Prometheus’s query language. It looks weird at first, but it’s powerful:

# CPU usage per pod in a namespace
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# Top 5 pods by memory usage
topk(5, container_memory_usage_bytes{namespace="production"})

You won’t need to write PromQL daily — most of it lives in dashboards and alert rules. But knowing the basics helps when you’re debugging at 2 AM and need an answer the dashboard doesn’t have.

Alert Rules: The 36 Watchers
#

We have 36 alert rules. That sounds like a lot, but they fall into clear categories:

# Pod has been down for 5 minutes
- alert: PodDown
  expr: up{job="kubernetes-pods"} == 0
  for: 5m
  labels:
    severity: critical

# Container restarting repeatedly (crash loop)
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: warning

# Memory usage above 85% of limit
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 > 85
  for: 5m
  labels:
    severity: warning

Category	What It Watches	Severity
Pod health	Down, crash looping, restarting	Critical / Warning
Resources	Memory > 85%, CPU > 85%	Warning
Nodes	NotReady, memory pressure, disk pressure	Critical
Storage	Persistent volume < 15% free	Warning
Deployments	Desired replicas ≠ actual replicas	Warning

AlertManager: The Traffic Cop
#

Prometheus evaluates alert rules. AlertManager decides what to do about them. It handles routing (where to send alerts), grouping (combine related alerts), and inhibition (suppress noise).

The key concept is severity-based routing:

Critical alerts → immediate webhook notification
Warning alerts → standard notification channel

And the inhibition rule that keeps things sane: when a critical alert fires, matching warning alerts are suppressed. If a node is down (critical), you don’t also need 15 warnings about pods being unreachable on that node. You already know.

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "cluster", "service"]

Without inhibition, alert fatigue sets in fast. You start ignoring notifications because there are too many, and eventually you miss the one that actually matters. Severity routing + inhibition keeps the signal-to-noise ratio manageable.

The Resolution Timeout
#

One small but important config: resolve_timeout: 5m. When an alert stops firing, AlertManager waits 5 minutes before sending a “resolved” notification. This prevents flapping — a pod that’s on the edge of its memory limit might trigger and resolve repeatedly, and you don’t want a notification every time it bounces.

Raw Numbers to Real Understanding
#

Prometheus tells you what’s happening. But staring at PromQL output in a terminal at 2 AM is nobody’s idea of a good time. You want dashboards — visual, color-coded, with big numbers that tell you at a glance whether things are fine or on fire. That’s Grafana, and that’s the next post .