Skip to main content
  1. Posts/

Watching the Watchers

·834 words·4 mins
Photograph By Hadi Yazdi Aznaveh
Blog Prometheus Monitoring Kubernetes
Table of Contents
Infrastructure From Scratch - This article is part of a series.
Part 6: This Article

The Silence Before the Storm
#

You’ve got a self-healing cluster with ArgoCD . Deployments happen through Git. Pods restart themselves when they crash. Everything is automated and beautiful. And then a service starts responding slowly. Memory creeps up. A pod gets OOM-killed, restarts, gets killed again. Kubernetes is technically self-healing — the pod keeps restarting — but the underlying problem isn’t going away. And you have no idea any of this is happening.

I wrote about logging before . Logs tell you what happened. Metrics tell you what’s happening right now. This post is about the latter.

Prometheus: The Pull Model
#

Most monitoring tools work by having agents push data to a central server. Prometheus flips this — it pulls metrics by scraping HTTP endpoints on a schedule. Every 15 seconds, Prometheus hits the /metrics endpoint on every target and stores the time-series data.

Why pull instead of push? Because if a target goes down, the absence of data is itself a signal. You don’t need the target to tell you it’s dead — Prometheus notices that it can’t scrape it.

What Gets Scraped
#

In a Kubernetes cluster, Prometheus scrapes several layers:

ComponentWhat It ReportsExporter
NodesCPU, memory, disk, network at OS levelnode-exporter
ContainersCPU, memory, filesystem per containercAdvisor (built into kubelet)
K8s objectsDeployment replicas, pod status, job completionskube-state-metrics
API serverRequest latency, etcd healthBuilt-in
Your appsWhatever you expose on /metricsYour code

For your own apps, opt-in is a single annotation:

metadata:
  annotations:
    prometheus.io/scrape: "true" # Prometheus will find and scrape this pod
    prometheus.io/port: "8080" # on this port
    prometheus.io/path: "/metrics" # at this endpoint

The Metric Zoo
#

Prometheus has four metric types. You only really need to understand two to start:

Counters — only go up (or reset to 0 on restart). Total requests, total errors, bytes sent. Use rate() to get per-second values:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

Gauges — go up and down. Current memory usage, active connections, temperature. Use directly:

# Current memory usage as a percentage
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100

(There are also histograms and summaries for measuring distributions like request latency, but counters and gauges cover 80% of what you’ll need.)

PromQL: A Taste
#

PromQL is Prometheus’s query language. It looks weird at first, but it’s powerful:

# CPU usage per pod in a namespace
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# Top 5 pods by memory usage
topk(5, container_memory_usage_bytes{namespace="production"})

You won’t need to write PromQL daily — most of it lives in dashboards and alert rules. But knowing the basics helps when you’re debugging at 2 AM and need an answer the dashboard doesn’t have.

Alert Rules: The 36 Watchers
#

We have 36 alert rules. That sounds like a lot, but they fall into clear categories:

# Pod has been down for 5 minutes
- alert: PodDown
  expr: up{job="kubernetes-pods"} == 0
  for: 5m
  labels:
    severity: critical

# Container restarting repeatedly (crash loop)
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: warning

# Memory usage above 85% of limit
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 > 85
  for: 5m
  labels:
    severity: warning
CategoryWhat It WatchesSeverity
Pod healthDown, crash looping, restartingCritical / Warning
ResourcesMemory > 85%, CPU > 85%Warning
NodesNotReady, memory pressure, disk pressureCritical
StoragePersistent volume < 15% freeWarning
DeploymentsDesired replicas ≠ actual replicasWarning

AlertManager: The Traffic Cop
#

Prometheus evaluates alert rules. AlertManager decides what to do about them. It handles routing (where to send alerts), grouping (combine related alerts), and inhibition (suppress noise).

The key concept is severity-based routing:

  • Critical alerts → immediate webhook notification
  • Warning alerts → standard notification channel

And the inhibition rule that keeps things sane: when a critical alert fires, matching warning alerts are suppressed. If a node is down (critical), you don’t also need 15 warnings about pods being unreachable on that node. You already know.

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "cluster", "service"]

Without inhibition, alert fatigue sets in fast. You start ignoring notifications because there are too many, and eventually you miss the one that actually matters. Severity routing + inhibition keeps the signal-to-noise ratio manageable.

The Resolution Timeout
#

One small but important config: resolve_timeout: 5m. When an alert stops firing, AlertManager waits 5 minutes before sending a “resolved” notification. This prevents flapping — a pod that’s on the edge of its memory limit might trigger and resolve repeatedly, and you don’t want a notification every time it bounces.

Raw Numbers to Real Understanding
#

Prometheus tells you what’s happening. But staring at PromQL output in a terminal at 2 AM is nobody’s idea of a good time. You want dashboards — visual, color-coded, with big numbers that tell you at a glance whether things are fine or on fire. That’s Grafana, and that’s the next post .

Aaron Yong
Author
Aaron Yong
Building things for the web. Writing about development, Linux, cloud, and everything in between.
Infrastructure From Scratch - This article is part of a series.
Part 6: This Article

Related

GitOps with ArgoCD
·760 words·4 mins
Photograph By Fabio Sasso
Blog ArgoCD Kubernetes DevOps
How Git became our deployment tool and kubectl became obsolete
Running on Google's Cloud
·810 words·4 mins
Photograph By Fabio Sasso
Blog Google Cloud Kubernetes
GKE, load balancers, and the Cloud SQL Proxy sidecar pattern
Enter the Cluster
·892 words·5 mins
Photograph By Sarah Kilian
Blog Kubernetes Infrastructure
From Docker Compose to Kubernetes — what changes and why