The rsync Era#
The first deployment system I worked with in production was SSH + rsync. A GitHub Actions workflow would SSH into the server, rsync the code, run npm install, execute Prisma migrations, and restart the service with PM2. For each service. In sequence. One at a time.
# The old way (simplified, obfuscated)
deploy:
steps:
- name: Deploy service
run: |
rsync -avz ./src $SERVER_HOST:/app/service/
ssh $SERVER_HOST "cd /app/service && npm install && npx prisma migrate deploy && pm2 restart service"
It worked. Deployments took a few minutes. If something broke, you’d SSH in and pm2 restart manually. Rollback meant reverting the commit and running the pipeline again (if you were lucky) or SSHing in and git checkout the previous version (if you were desperate).
The problem wasn’t that it was bad — it was that it was fragile. One failed SSH connection and the deployment was half-done. One missed migration and the app crashed. One developer running a quick fix directly on the server and the deployed code diverged from Git.
The Docker Compose Chapter#
The next evolution was Docker-based deployments
. Instead of syncing files, the pipeline built a Docker image, pushed it to a registry, and ran docker compose up on the server.
deploy:
needs: test
steps:
- name: Tear down
run: docker compose -f compose.prod.yaml down
- name: Build and deploy
run: docker compose -f compose.prod.yaml up --build -d
Better. The image was immutable — what you tested is what you deployed. No more “it works on the server but not in CI” because the environment was the same Docker image everywhere. Rollback was still “deploy the previous version,” but at least the previous version was a known, tested image.
We also started testing inside Docker containers — a test Dockerfile ran the test suite, and if it exited non-zero, the pipeline stopped. No deployment without passing tests.
But this was still a single machine. One server, one Docker daemon. And the deployment had a brief downtime window — docker compose down followed by docker compose up.
The Matrix Evolution#
As services multiplied, we started using GitHub Actions matrix builds to build them in parallel:
strategy:
matrix:
service: [api-service, auth-service, scheduler, worker]
fail-fast: false
fail-fast: false was a deliberate choice — if the auth-service build fails, you don’t want to cancel the api-service build. They’re independent. This cut total build time significantly because services built simultaneously instead of sequentially.
Self-Hosted Runners#
We moved to self-hosted GitHub Actions runners for two reasons:
- Cost — cloud runner minutes add up, especially for Windows container builds
- Access — self-hosted runners can talk directly to the Docker daemon and internal registry without exposing credentials externally
Some of our services need Windows containers (legacy .NET workloads). Self-hosted Windows runners handle those builds while Linux runners handle everything else. Same pipeline, different runner labels.
The Tagging Strategy#
This was a surprisingly important decision. We needed to distinguish between QA and production builds without ambiguity:
| Environment | Tag Pattern | Trigger |
|---|---|---|
| QA | qa-v1.0.X (auto-increment) | PR merge to develop |
| Production | v1.3.0 (explicit semver) | Manual git tag |
QA tags auto-increment: the pipeline reads the latest qa-v1.0.* tag, bumps the patch, and tags the build. This means every merge to develop automatically deploys to QA — no human in the loop.
Production requires someone to explicitly create a semantic version tag. This is intentional — deploying to production should be a deliberate, auditable action. The ArgoCD Image Updater
filters tags with regexp:^v[0-9]+\.[0-9]+\.[0-9]+$, so only proper semver tags trigger production deployments. QA tags (qa-v1.0.47) are invisible to production.
The ArgoCD Chapter#
The final evolution was GitOps with ArgoCD . The pipeline no longer deploys anything — it builds an image, pushes it, and walks away. ArgoCD watches the registry, detects the new tag, and syncs the cluster.
Before: Pipeline → SSH → server → restart service
Middle: Pipeline → Docker build → push → docker compose up
Now: Pipeline → Docker build → push → done
ArgoCD → detects new image → syncs cluster → deployed
Nobody runs kubectl apply. Nobody SSH-es into anything. The cluster state is defined in Git and enforced by ArgoCD. If someone manually modifies the cluster, ArgoCD reverts it.
Deployment Strategies I Haven’t Used (But Should Know)#
Our current setup uses Kubernetes rolling updates — the default. Pods replace one at a time, zero downtime, works fine. But there are more sophisticated strategies:
Blue-Green: two identical environments. Deploy to the inactive one, test it, switch traffic instantly. Rollback = switch back. The cost: double the infrastructure during deployments.
Canary: route 5% of traffic to the new version. Monitor. If it’s stable, increase to 20%, then 50%, then 100%. The smallest blast radius, but you need traffic splitting and good monitoring to detect issues in that 5%.
Feature Flags: deploy the code but hide it behind a flag. Enable for specific users or percentages. Decouple deployment (code is live) from release (feature is enabled). Kill switch if something breaks.
We use rolling updates because they’re simple and sufficient for our scale. If I were starting fresh with higher traffic and stricter SLAs, I’d look at canary deployments with Prometheus-based automated rollback.
The Pattern#
Looking back, each evolution solved a specific pain point:
| Era | Pain | Solution |
|---|---|---|
| SSH + rsync | Fragile, manual, divergent state | Docker (immutable images) |
| Docker Compose | Single machine, brief downtime | Kubernetes (multi-node, rolling updates) |
| kubectl apply | Drift, no audit trail, manual | ArgoCD (Git-driven, self-healing) |
| Manual tagging | Human error in deploys | Auto-tagging QA, semver filtering prod |
Each step removed a manual intervention. The pipeline got shorter (from the CI/CD perspective) while the delivery system got more sophisticated. The current state: a developer merges a PR, and 5 minutes later, it’s running in QA. No human touched the deployment.
