A Wall of Numbers#
Prometheus is collecting everything. CPU usage, memory, request rates, error counts, latency distributions — thousands of time-series data points, updating every 15 seconds. And all of it is accessible through PromQL, which is great for precise queries but terrible for getting a quick read on whether your cluster is healthy.
What you actually want at 2 AM (and yes, this is a theme) is a screen with big green numbers that tells you everything is fine. Or big red numbers that tells you exactly where the fire is. That’s Grafana.
The Dashboard Anatomy#
Grafana doesn’t store metrics — it connects to Prometheus (or any other data source) and visualizes the data. A dashboard is a page of panels, organized in rows:
┌──────────────────────────────────────────────────┐
│ Overview (stat panels — big numbers) │
│ [Pods: OK] [CPU: 45%] [Mem: 62%] [Errors: 0.3%]│
├──────────────────────────────────────────────────┤
│ Resource Usage (time-series graphs) │
│ [CPU by Pod ~~~~] [Memory by Pod ~~~~] │
├──────────────────────────────────────────────────┤
│ Request Metrics (RED) │
│ [Request Rate] [Error Rate + Latency] │
├──────────────────────────────────────────────────┤
│ Active Alerts (table) │
│ [alert name | severity | duration | pod] │
└──────────────────────────────────────────────────┘
The most important information goes at the top. Stat panels with color-coded thresholds (green < 70%, yellow < 85%, red > 85%) give you instant health at a glance. Details live below in time-series graphs for when you need to dig deeper.
The RED Method (For Services)#
When monitoring services (APIs, microservices), the RED method gives you three metrics that answer the question “is this service healthy?”:
| Metric | What It Tells You | PromQL |
|---|---|---|
| Rate | How many requests per second | rate(http_requests_total[5m]) |
| Errors | What percentage are failing | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 |
| Duration | How fast are responses (p95) | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
If the rate drops suddenly, something might be blocking requests. If errors spike, something is broken. If duration increases, something is slow. Three panels tell you everything about a service’s health.
For infrastructure (nodes, containers), there’s the USE method — Utilization, Saturation, Errors — but RED covers what you’ll check most often.
The Magic of Variables#
A $namespace dropdown at the top of your dashboard makes one dashboard work for every environment:
# Without variables — hardcoded, one dashboard per environment
container_memory_usage_bytes{namespace="production"}
# With variables — one dashboard, dropdown selects environment
container_memory_usage_bytes{namespace="$namespace"}
You define a variable as a PromQL query that returns all available namespaces:
label_values(namespace)
Grafana renders it as a dropdown. Select “production” and every panel filters. Select “staging” and the same dashboard shows staging data. One dashboard to rule them all.
Provisioning: GitOps for Monitoring#
Here’s where it gets interesting. Instead of manually configuring Grafana through the UI (datasources, dashboards, plugins), you define everything as config files:
# provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-service:9090 # internal K8s service
isDefault: true
# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
- name: "default"
type: file
options:
path: /var/lib/grafana/dashboards # JSON dashboard files here
Build a dashboard in the UI, export it as JSON, commit it to Git, and it auto-loads on Grafana startup. Delete your Grafana pod? It comes back with all your dashboards intact. New cluster? Same dashboards, no manual setup. It’s GitOps for your monitoring stack.
Grafana Alerting vs AlertManager#
Grafana has its own alerting system. So does AlertManager (which we set up with Prometheus ). Which do you use?
| Grafana Alerting | AlertManager | |
|---|---|---|
| Config | UI-driven (click to create) | YAML files (config-as-code) |
| Grouping | Basic | Advanced (grouping + inhibition) |
| Severity routing | Basic | Full routing tree |
| Best for | Ad-hoc, team-specific alerts | Production alerts, GitOps workflows |
We use AlertManager for production alerts because it’s config-as-code (fits our GitOps approach) and has proper inhibition rules (critical suppresses warning). Grafana alerting is there for quick, ad-hoc alerts that don’t need the full routing tree.
The Full Stack#
Let’s zoom out and see what we’ve built across this series:
Code → GitHub Actions → Docker Image → Container Registry
↓
ArgoCD Image Updater detects new tag
↓
ArgoCD syncs cluster from Git
↓
GKE runs the pods
↓
Prometheus scrapes metrics
↓
AlertManager routes alerts
↓
Grafana visualizes everything
We went from Docker on a laptop to a CI/CD pipeline to Kubernetes on GKE with GitOps deployments and a full monitoring stack . Each piece solves a specific problem, and together they form an infrastructure that deploys itself, heals itself, and tells you when something is wrong.
Is it overkill for a personal project? Absolutely. Did I learn more building it than any course or tutorial could teach? Also absolutely.
