#devops

21 posts

API Dependency Health Checks: Why /health Is Not Enough
Your /health endpoint returns 200 OK while your database is unreachable. Kubernetes keeps routing traffic. Users see 500s. Here is how to build dependency-aware health checks that actually protect your uptime.

May 24, 2026
api node.js reliability devops
Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys
Every redeploy your users see a 4–7 second window of 502s. Here is exactly why, the 40 lines of Node code that eliminate it, and how to verify the fix with a real load test.

May 9, 2026
node.js devops reliability
Your Docker Image Is 1.2GB. Here Is How To Get It Under 80MB.
A step-by-step optimization of a real Node.js Docker image, from a 1.2GB monster to a 78MB production container. Each technique is benchmarked, copy-paste ready, and explained with the trade-offs.

May 1, 2026
docker devops productivity
CI/CD From Zero to Production in 30 Minutes With GitHub Actions
A no-fluff guide to shipping a real CI/CD pipeline that lints, tests, builds, and deploys automatically, without the enterprise boilerplate.

February 14, 2025
ci-cd devops productivity
Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable
Most teams set CPU and memory requests by guessing. The result is over-provisioning that wastes money or under-provisioning that causes evictions. Here is the practical method for picking each number, the difference between requests and limits, and why CPU limits are often a mistake.

October 11, 2024
kubernetes devops performance
Terraform Modules As An Internal Platform: How To Build A Self-Service Infrastructure Layer
A new service requires database, queue, secrets, alerts, IAM roles, monitoring. Without modules, every team copies a previous service's Terraform and modifies. With well-designed modules, "new service" is 10 lines of HCL. Here is the module design that scales, the testing approach, and the four traps.

July 19, 2024
terraform devops platform-engineering
The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath
Most teams ship features as “merge to main and deploy.” The result is that a bug affects 100% of users immediately. Five-stage rollouts (internal, 1%, 10%, 50%, 100%) turn “oh no” into “catch it at 1%.” Here is the working pattern, the metrics that gate each stage, and the rollback procedure.

May 24, 2024
process reliability devops
GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work
A naive monorepo CI runs all jobs on every PR, takes 25 minutes, and burns money. The version that works has path-filtered jobs, cross-job caching, and reusable workflows. Here is the working setup that runs in 4 minutes for a typical PR.

May 10, 2024
ci-cd devops github-actions
Service Mesh: When Istio Or Linkerd Earns Its Operational Cost, And When Not
Service mesh promises automatic mTLS, traffic shifting, and observability. The operational cost is real: Istio doubles a cluster's control-plane complexity. Here is the honest framework for whether your team needs a mesh, the lighter alternatives, and the migration that doesn't break production.

April 12, 2024
kubernetes devops distributed-systems
Terraform State In A Team: The Setup That Stops Two Engineers From Corrupting Prod
Local Terraform state on a laptop is fine until somebody else pushes infra changes too. Then you have a corrupted state file and a long debugging session. Here is the remote-state-with-locking setup, the workspaces vs directories debate, and the four habits that keep IaC sane.

January 19, 2024
terraform devops infrastructure-as-code
Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance
You set up rolling deploys carefully. Then a node drains during cluster upgrade and takes 80% of your pods at once. PodDisruptionBudget is the manifest that says “never evict more than N at a time.” Three lines of YAML, real production benefits.

January 5, 2024
kubernetes devops reliability
Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale
Most “chaos engineering” discussions are about Chaos Monkey at Netflix and have nothing to do with how a 20-engineer team should test resilience. The five drills here are practical, scoped, runnable in an afternoon, and will surface the broken assumption your monitoring missed.

November 24, 2023
reliability devops distributed-systems
Secrets Management For Real Teams: Vault, SOPS, And The .env File You Should Burn
Almost every team starts with a .env file in 1Password and ends with secrets in Slack. Here are the three credible options for production secrets (Vault, SOPS-encrypted-in-git, cloud-native AWS/GCP) with the trade-offs, the migration paths, and the rotation policy that survives a year.

October 27, 2023
security devops reliability
Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works
Default HPA scales on CPU, which is wrong for most modern workloads. Memory, queue depth, request rate, and custom business metrics are what actually correlate with “need more pods.” Here is the working setup with custom metrics, the formula HPA uses, and the four mistakes that cause flapping.

September 15, 2023
kubernetes devops reliability
Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick
Most teams install Husky, configure ten pre-commit checks, and disable the whole thing within a month because commits take 30 seconds. Here is the minimal pre-commit setup that catches real bugs, runs in under 2 seconds on the changed files only, and does not need a `--no-verify` workaround.

March 17, 2023
productivity devops tools
Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All
Renaming a column on a 50-million-row table looks like a one-line SQL change and is actually a six-step deploy spread across two PRs. Here is the pattern (expand, migrate, contract) applied to renames, type changes, and NOT NULL backfills, with the locks each step takes and the rollback at every stage.

March 3, 2023
database postgres devops reliability
Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages
Most teams configure liveness and readiness probes identically and wonder why a slow database makes Kubernetes restart their pods in a death spiral. Here is what each probe is actually for, the right endpoint shape for each, and the four-line config that turns an outage into a non-event.

January 20, 2023
kubernetes devops reliability
Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers)
Most load tests slam one endpoint with a constant rate of requests and report a percentile. That graph means almost nothing. Real bugs live in ramp-up, soak, and spike scenarios. Here are the k6 scripts for each, the metric to read, and why the constant-load test you ran last quarter missed the regression.

November 11, 2022
performance reliability devops
Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each
Most teams have one feature-flag system and four kinds of flags pretending to live in it. Release toggles, ops toggles, permission toggles, and experiments behave differently, decay differently, and need different cleanup rules. Here is the taxonomy that prevents flag debt from eating your codebase.

September 16, 2022
productivity devops process
Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config
Postgres falls over not because of slow queries but because of too many connections. Most teams reach for pgbouncer and copy a config they do not understand. Here is the actual job each setting does, the three pool modes ranked by what they break, and the rule for sizing pool_size that holds at any traffic level.

August 19, 2022
database postgres reliability devops
Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You
Half the production incidents that start with “but the script said it succeeded” come from the same three missing lines at the top of a bash file. Here is what set -euo pipefail actually does, the traps it has, and the deploy-script pattern that fails loudly instead of quietly succeeding.

July 22, 2022
devops productivity bash