Skip to main content
  1. Posts/

Making Sense of Metrics

·753 words·4 mins
Photograph By 1981 Digital
Blog Grafana Monitoring
Table of Contents
Infrastructure From Scratch - This article is part of a series.
Part 7: This Article

A Wall of Numbers
#

Prometheus is collecting everything. CPU usage, memory, request rates, error counts, latency distributions — thousands of time-series data points, updating every 15 seconds. And all of it is accessible through PromQL, which is great for precise queries but terrible for getting a quick read on whether your cluster is healthy.

What you actually want at 2 AM (and yes, this is a theme) is a screen with big green numbers that tells you everything is fine. Or big red numbers that tells you exactly where the fire is. That’s Grafana.

The Dashboard Anatomy
#

Grafana doesn’t store metrics — it connects to Prometheus (or any other data source) and visualizes the data. A dashboard is a page of panels, organized in rows:

┌──────────────────────────────────────────────────┐
│  Overview (stat panels — big numbers)            │
│  [Pods: OK] [CPU: 45%] [Mem: 62%] [Errors: 0.3%]│
├──────────────────────────────────────────────────┤
│  Resource Usage (time-series graphs)             │
│  [CPU by Pod ~~~~]  [Memory by Pod ~~~~]         │
├──────────────────────────────────────────────────┤
│  Request Metrics (RED)                           │
│  [Request Rate]  [Error Rate + Latency]          │
├──────────────────────────────────────────────────┤
│  Active Alerts (table)                           │
│  [alert name | severity | duration | pod]        │
└──────────────────────────────────────────────────┘

The most important information goes at the top. Stat panels with color-coded thresholds (green < 70%, yellow < 85%, red > 85%) give you instant health at a glance. Details live below in time-series graphs for when you need to dig deeper.

The RED Method (For Services)
#

When monitoring services (APIs, microservices), the RED method gives you three metrics that answer the question “is this service healthy?”:

MetricWhat It Tells YouPromQL
RateHow many requests per secondrate(http_requests_total[5m])
ErrorsWhat percentage are failingrate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
DurationHow fast are responses (p95)histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

If the rate drops suddenly, something might be blocking requests. If errors spike, something is broken. If duration increases, something is slow. Three panels tell you everything about a service’s health.

For infrastructure (nodes, containers), there’s the USE method — Utilization, Saturation, Errors — but RED covers what you’ll check most often.

The Magic of Variables
#

A $namespace dropdown at the top of your dashboard makes one dashboard work for every environment:

# Without variables — hardcoded, one dashboard per environment
container_memory_usage_bytes{namespace="production"}

# With variables — one dashboard, dropdown selects environment
container_memory_usage_bytes{namespace="$namespace"}

You define a variable as a PromQL query that returns all available namespaces:

label_values(namespace)

Grafana renders it as a dropdown. Select “production” and every panel filters. Select “staging” and the same dashboard shows staging data. One dashboard to rule them all.

Provisioning: GitOps for Monitoring
#

Here’s where it gets interesting. Instead of manually configuring Grafana through the UI (datasources, dashboards, plugins), you define everything as config files:

# provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus-service:9090 # internal K8s service
    isDefault: true
# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: "default"
    type: file
    options:
      path: /var/lib/grafana/dashboards # JSON dashboard files here

Build a dashboard in the UI, export it as JSON, commit it to Git, and it auto-loads on Grafana startup. Delete your Grafana pod? It comes back with all your dashboards intact. New cluster? Same dashboards, no manual setup. It’s GitOps for your monitoring stack.

Grafana Alerting vs AlertManager
#

Grafana has its own alerting system. So does AlertManager (which we set up with Prometheus ). Which do you use?

Grafana AlertingAlertManager
ConfigUI-driven (click to create)YAML files (config-as-code)
GroupingBasicAdvanced (grouping + inhibition)
Severity routingBasicFull routing tree
Best forAd-hoc, team-specific alertsProduction alerts, GitOps workflows

We use AlertManager for production alerts because it’s config-as-code (fits our GitOps approach) and has proper inhibition rules (critical suppresses warning). Grafana alerting is there for quick, ad-hoc alerts that don’t need the full routing tree.

The Full Stack
#

Let’s zoom out and see what we’ve built across this series:

Code → GitHub Actions → Docker Image → Container Registry
                                            ↓
                              ArgoCD Image Updater detects new tag
                                            ↓
                              ArgoCD syncs cluster from Git
                                            ↓
                              GKE runs the pods
                                            ↓
                              Prometheus scrapes metrics
                                            ↓
                              AlertManager routes alerts
                                            ↓
                              Grafana visualizes everything

We went from Docker on a laptop to a CI/CD pipeline to Kubernetes on GKE with GitOps deployments and a full monitoring stack . Each piece solves a specific problem, and together they form an infrastructure that deploys itself, heals itself, and tells you when something is wrong.

Is it overkill for a personal project? Absolutely. Did I learn more building it than any course or tutorial could teach? Also absolutely.

Aaron Yong
Author
Aaron Yong
Building things for the web. Writing about development, Linux, cloud, and everything in between.
Infrastructure From Scratch - This article is part of a series.
Part 7: This Article

Related

Watching the Watchers
·834 words·4 mins
Photograph By Hadi Yazdi Aznaveh
Blog Prometheus Monitoring Kubernetes
Prometheus, AlertManager, and the art of knowing what’s broken before your users do
GitOps with ArgoCD
·760 words·4 mins
Photograph By Fabio Sasso
Blog ArgoCD Kubernetes DevOps
How Git became our deployment tool and kubectl became obsolete
Running on Google's Cloud
·810 words·4 mins
Photograph By Fabio Sasso
Blog Google Cloud Kubernetes
GKE, load balancers, and the Cloud SQL Proxy sidecar pattern