The Practical Developer

Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale

Most “chaos engineering” discussions are about Chaos Monkey at Netflix and have nothing to do with how a 20-engineer team should test resilience. The five drills here are practical, scoped, runnable in an afternoon, and will surface the broken assumption your monitoring missed.

Circuit board close-up — the right metaphor for stress-testing the hidden machinery of a production system

The team’s PagerDuty page count is growing. The on-call goes home Friday tense. Somebody suggests “we should do chaos engineering” and the room nods, but nobody knows what it means in practice for a 20-engineer team that does not have Netflix’s SRE org.

Chaos engineering is not “randomly break production.” It is “deliberately introduce a known failure mode and verify the system handles it as expected.” Done well, the failure modes you care about are the ones that paged you last quarter — DB hiccups, downstream timeouts, deploy interruptions. You don’t need a fancy framework. You need a list of five drills, an afternoon, and willingness to find out something embarrassing.

This post is those five drills, the playbook for running them safely, and the metric that tells you whether you actually got better.

The setup: what you need before drill 1

Three prerequisites:

  1. A staging environment that mirrors prod’s topology. Different scale is fine; same components and connections.
  2. Monitoring on the staging environment. You need to see the failure. If chaos drill happens and no dashboard shows it, you learned nothing.
  3. A rollback / kill switch for the chaos. The drill must be reversible. If the drill turns into a real incident, you stop.

The five drills below are ordered by complexity. Do them in order.

Drill 1: Kill a pod

The simplest chaos action: pick a pod, kill it. Validate that the system recovers.

kubectl get pods -l app=api -o name | shuf -n 1 | xargs kubectl delete pod

Watch:

  • Does the load balancer stop sending traffic to the killed pod within probe-interval seconds?
  • Does a new pod come up and rejoin?
  • Are any in-flight requests dropped or do they get clean errors?
  • Is there any user-visible impact?

Common findings: readiness probe is checking the wrong thing (pod stays in LB after kill), graceful shutdown is broken (in-flight requests dropped instead of completing), startup probe is too slow (replacement pod takes minutes).

This is the drill that earns a proper graceful shutdown setup and the right probe configuration.

Drill 2: Break a downstream dependency

Pick a non-critical downstream service. Make it return errors or timeouts.

If the dependency is internal, scale it to zero or block it with a NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: chaos-block-email }
spec:
  podSelector: { matchLabels: { app: email-service } }
  policyTypes: [Ingress]
  ingress: []   # deny all ingress

If the dependency is third-party, use a toxiproxy sidecar to inject latency or errors:

toxiproxy-cli toxic add stripe -t latency -a latency=5000

Watch:

  • Does the calling service’s circuit breaker trip cleanly?
  • Does the fallback path actually work, or does it 500 too?
  • Does the dependent traffic queue up at the load balancer or fail fast?
  • Do user-facing endpoints degrade gracefully or stop entirely?

Common findings: the team thought they had circuit breakers but discovered they only have retries. Cascading failures show up here that don’t show up in load tests.

Drill 3: Database failover

Promote the read replica to primary. Or, less aggressively, restart the primary.

# AWS RDS: forced failover (Multi-AZ)
aws rds reboot-db-instance --db-instance-identifier prod-db --force-failover

Watch:

  • How long does the application take to detect the new primary?
  • Are there errors during the cutover window?
  • Does pgbouncer (or your pooler) reconnect cleanly?
  • Are any in-flight transactions lost?

The RDS Multi-AZ failover is typically 30-60s of downtime for the database. The application should recover automatically; if it does not, you have a connection pool problem. (See pgbouncer post.)

Drill 4: Latency injection

Inject 500ms or 2000ms of latency on a network path. Toxiproxy or Linux tc:

# Add 500ms to all traffic to the database.
sudo tc qdisc add dev eth0 root netem delay 500ms

# Remove
sudo tc qdisc del dev eth0 root

Or via a service mesh (Istio, Linkerd) which can inject faults via configuration without touching pods.

Watch:

  • Does p99 of every endpoint blow up?
  • Do timeouts kick in cleanly or do connections pile up until OOM?
  • What’s the user experience — slow but works, or hangs?

Common findings: timeouts are configured at one layer (HTTP client) but not another (database connection). A 500ms downstream latency causes 30s end-to-end latency because connection pool is starving and queueing.

Drill 5: Disk fill

Fill up disk to 95% and watch what happens:

# Inside the pod (only on a chaos test, never on prod):
fallocate -l $(df --output=avail / | tail -1 | awk '{print int($1*0.95)}')K /tmp/chaos-fill

Watch:

  • Do logs continue to write? Or does the app crash?
  • Does the database (if local to the host) stop accepting writes?
  • Does monitoring even fire on disk-full?

Common findings: nothing alerts on disk-full because disks have been over-provisioned for years. When the alert finally fires, the on-call learns there is no runbook for clearing space and three services break.

The drill playbook

Every drill follows the same six-step structure:

  1. Hypothesize. Write down what you expect to happen. “Killing a pod will cause ~30s of elevated 5xx; the LB will route to other pods after ~10s.”
  2. Plan rollback. Write the exact command to stop the chaos.
  3. Schedule. Tell the team. Pick a low-traffic window. Page yourself for the duration.
  4. Run. Start the chaos. Time it.
  5. Observe. Watch the dashboards. Note discrepancies between hypothesis and reality.
  6. Document. Capture what surprised you. File tickets for the gaps.

The hypothesis step is the highest leverage — it forces you to articulate your model of the system before testing it. Most “chaos engineering finds bugs” stories are actually “the team’s mental model didn’t match reality.”

What “production chaos” looks like

Most teams never do production chaos. The progression is:

  1. Drills in staging.
  2. Drills in production during business hours, with the team standing by.
  3. Drills in production at random times (e.g., Chaos Monkey).

Don’t jump to step 3 without having lived through 1 and 2 for at least a few quarters. The cost of a bad chaos drill in production is real; the team has to be confident the system is resilient before introducing automated chaos.

The metric that tracks improvement

Pick a single number: “MTTR for the last 10 incidents.” Mean time to recovery. As your chaos drills surface and fix issues, MTTR should drop.

Other useful metrics:

  • Number of unique incident causes per quarter. Chaos drills should reduce repeated causes (the team learned, fixed, the same thing doesn’t happen twice).
  • % of incidents where the on-call’s first action was correct. Chaos drills build muscle memory.

A team that does chaos drills consistently has incidents that are short and uneventful. A team that doesn’t has incidents that produce 4-hour postmortems.

Tools that help

For most teams’ first 6 months of chaos work, you don’t need a tool. kubectl delete pod, tc, iptables, scaling a deployment to 0 — these are enough.

When NOT to do chaos

Three cases:

  • You don’t have monitoring for what you’re testing. Drill is pointless if you can’t see the result.
  • You don’t have a kill switch. Don’t introduce chaos you can’t stop.
  • The team is in firefighting mode. Chaos is for steady-state systems where you want to find weaknesses. Adding chaos to a system already on fire just causes confusion.

If any of those is true, work on monitoring or the immediate fires first.

The takeaway

Chaos engineering is a practice, not a tool. Five practical drills — kill a pod, break a downstream, fail over the database, inject latency, fill the disk — surface most of the resilience gaps a typical mid-size system has. Run them in staging first; promote to production drills with the team standing by once you trust the system.

The team that runs one of these drills per month is dramatically better at handling real incidents six months later. The drill is not the point — the muscle memory and documented runbook gaps are.


A note from Yojji

The kind of reliability practice that turns a fragile system into one that survives its bad days — chaos drills, runbook gaps, MTTR tracking — is the kind of long-haul engineering Yojji’s teams put into the platforms they ship for clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms (AWS, Azure, GCP), and full-cycle product engineering — including the resilience and reliability work that decides whether your service is robust or just unbroken so far.