The Practical Developer

Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers)

Most load tests slam one endpoint with a constant rate of requests and report a percentile. That graph means almost nothing. Real bugs live in ramp-up, soak, and spike scenarios — here are the k6 scripts for each, the metric to read, and why the constant-load test you ran last quarter missed the regression.

A laptop full of dashboards — the right view for the moment a load test starts breaking the system

The week before launch, the team runs a load test. They point it at the staging environment, ramp to 1000 RPS, hold for ten minutes, and the graph is flat at 80ms p95. Everyone goes home. On launch day, the system falls over at 400 RPS — half the load — within the first three minutes.

The reason the test passed and production failed has nothing to do with the load level. It is that “1000 RPS for ten minutes” tests one of the easiest scenarios for a system to handle: steady-state, fully warmed-up, pre-allocated connections. The actual load on launch day was a ramp — zero to 400 RPS in thirty seconds — which exposed cold-start, connection-pool saturation, and autoscaler lag in that order.

A useful load test is not “how fast can you go.” It is “what shape of traffic breaks you.” This post is the three k6 scripts that cover the realistic shapes, and the metric to watch for each.

Why k6

k6 is a load-testing tool that has the right tradeoff for application teams. Tests are JavaScript, the runner is a Go binary that comfortably hits 100k RPS from a single machine, and the output format works with Grafana, Prometheus, and any HTTP-pushgateway-shaped target. Compared to JMeter, k6 is dramatically more pleasant to author. Compared to Artillery, k6 has a saner concurrency model.

# Install (macOS)
brew install k6
# Or use Docker
docker run --rm -i grafana/k6 run - < script.js

A test is a JS file that exports a default function describing what one virtual user (VU) does, plus an options object describing how many VUs and how long.

Scenario 1: Ramp-up — finds the bottleneck during scale-out

The most useful test you can run on a real service is a slow ramp from zero to peak. The point is not to confirm the peak number — it is to find the exact load level at which the system starts misbehaving, and what fails first.

// ramp.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    ramp_to_peak: {
      executor: 'ramping-arrival-rate',
      startRate: 10,                         // RPS at start
      timeUnit: '1s',
      preAllocatedVUs: 200,                  // VUs available
      maxVUs: 1000,
      stages: [
        { duration: '2m', target: 100 },     // 0–2m: 10 → 100 RPS
        { duration: '5m', target: 500 },     // 2–7m: 100 → 500 RPS
        { duration: '5m', target: 1000 },    // 7–12m: 500 → 1000 RPS
        { duration: '3m', target: 1000 },    // hold
      ],
    },
  },
  thresholds: {
    http_req_failed:   ['rate<0.01'],        // <1% errors
    http_req_duration: ['p(95)<500'],        // p95 < 500ms
  },
};

export default function () {
  const res = http.get('https://staging.example.com/api/feed', {
    headers: { Authorization: `Bearer ${__ENV.TOKEN}` },
  });
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.5 + Math.random() * 0.5);
}

The ramping-arrival-rate executor is the right tool: it produces a target RPS, regardless of how many VUs are needed to hit it. (The other executor — ramping-vus — gives you a target concurrency, which is rarely what you actually want for an HTTP test.)

What to look for in the graph:

  • The cliff. At some RPS — call it X — http_req_failed shoots up and p95 doubles. That is the first bottleneck. It is almost never the CPU. It is usually a connection pool, a downstream rate limit, or a thread starvation in the runtime.
  • The plateau. After the cliff, throughput stops increasing even though VUs keep climbing. The system is at its capacity ceiling.
  • The recovery. When you stop ramping, does latency return to normal? If not, you have a queue building up that needs draining.

A constant-RPS test misses all three.

Scenario 2: Soak — finds memory leaks and slow degradations

Run the system at moderate load (say 60% of peak) for an hour or longer. The point is not “is it fast” — it is “does it stay fast.” Soak tests catch memory leaks, file-descriptor leaks, slow-growing GC pressure, and the kind of bug where p99 starts at 80ms and is at 800ms by minute 45.

// soak.js
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  scenarios: {
    soak: {
      executor: 'constant-arrival-rate',
      rate: 200,                             // 200 RPS, steady
      timeUnit: '1s',
      duration: '1h',
      preAllocatedVUs: 200,
      maxVUs: 400,
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<300', 'p(99)<800'],
  },
};

export default function () {
  http.get('https://staging.example.com/api/feed');
  sleep(1);
}

The metric to graph alongside k6’s output is memory of the application process over time. If RSS climbs steadily over the hour with no plateau, you have a leak. If it climbs and then drops with GC, fine. The shape matters more than the absolute number.

A soak test is also where you catch things like “the cache fills up and then performance collapses” or “the connection pool slowly leaks connections because of an unhandled rejection on timeout.” These never show up in a 10-minute test.

Scenario 3: Spike — finds autoscaler and circuit-breaker bugs

A spike test is the inverse of a ramp: jump from zero (or low) to peak in seconds. This is the launch-day shape, the post-tweet shape, the “we got featured” shape. It exercises autoscaling, cold-start cost, and any in-process initialization (TLS handshakes, JIT warmup, lazy module loading).

// spike.js
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  scenarios: {
    spike: {
      executor: 'ramping-arrival-rate',
      startRate: 10,
      timeUnit: '1s',
      preAllocatedVUs: 500,
      maxVUs: 2000,
      stages: [
        { duration: '30s', target: 1000 },   // 0:00–0:30: ramp to 1000 RPS
        { duration: '2m',  target: 1000 },   // hold
        { duration: '30s', target: 10 },     // drop back
        { duration: '2m',  target: 10 },     // recover
        { duration: '30s', target: 1000 },   // second spike — does the system handle it better?
        { duration: '2m',  target: 1000 },
      ],
    },
  },
};

export default function () {
  http.get('https://staging.example.com/api/health');
  sleep(0.1);
}

What you are looking for:

  • First-spike error rate vs second-spike error rate. The first spike will probably trigger autoscaling. The second one, after the system has scaled, should be handled cleanly. If both fail equally, autoscaling is broken or too slow.
  • p99 in the first 30 seconds. Cold-start cost shows up here. If your service initializes slowly (TLS, large modules, DB pool warm-up), every fresh instance contributes to the p99.
  • Circuit breakers tripping. If you have circuit breakers between services, a sudden spike will trip them. That is the design — but you want to verify they reset cleanly.

Autoscaling that takes 60 seconds is often hidden by the steady-state test. A spike test makes it obvious within the first run.

What you should test, not just at the boundary

A common mistake is to load-test only the “hot” endpoints. The endpoints that bite you in production are usually:

  • Authentication. Every request goes through it. A 50ms regression on /auth/verify is a 50ms regression on everything.
  • Health checks. If /healthz is heavy (queries a DB), the load balancer or k8s probe may take down healthy pods during a load spike.
  • Webhook receivers. Often unloved. Almost always synchronous when they should be async.
  • Search endpoints. Higher tail latency than other reads.
  • Anything with a query parameter that the cache cannot handle. Filters, sorts, pagination cursors. Edge cases in caching show up under load.

Build a test profile that mimics the real mix:

// realistic-mix.js
import http from 'k6/http';
import { sleep, group } from 'k6';

export const options = {
  scenarios: {
    realistic: {
      executor: 'constant-arrival-rate',
      rate: 100,
      timeUnit: '1s',
      duration: '10m',
      preAllocatedVUs: 200,
    },
  },
};

export default function () {
  // Each VU runs a session of multiple actions.
  group('feed', () => {
    http.get('https://staging.example.com/api/feed');
  });
  sleep(2);

  group('post-detail', () => {
    http.get('https://staging.example.com/api/posts/abc');
  });
  sleep(3);

  if (Math.random() < 0.1) {
    group('search', () => {
      http.get('https://staging.example.com/api/search?q=hello');
    });
    sleep(2);
  }

  if (Math.random() < 0.05) {
    group('comment', () => {
      http.post('https://staging.example.com/api/comments',
        JSON.stringify({ post: 'abc', body: 'hi' }),
        { headers: { 'Content-Type': 'application/json' } });
    });
  }
}

group lets you see per-section metrics in the output. Probabilistic branching matches the natural mix of an active session.

Reading the output

k6 prints a summary at the end:

http_req_duration............: avg=120.4ms p(90)=180ms  p(95)=240ms  p(99)=520ms
http_req_failed..............: 0.30%   ✓ 1234   ✗ 4
iterations...................: 1238
http_reqs....................: 1238    198/s

The numbers worth reading, in order of importance:

  1. http_req_failed — anything above 1% during a “should be fine” test means you are not actually testing the system, you are testing its error behavior. Investigate first.
  2. p(99) of http_req_duration — the long tail. The average and p50 are mostly noise; p99 is where users feel pain.
  3. http_reqs/sec — actual throughput vs. the target. If you asked for 1000 RPS and got 600, the system was the bottleneck, not your test.
  4. iterations — number of full virtual-user sessions. Useful for sanity-checking “did my test actually run.”

For richer output, send k6 metrics to Prometheus or a Grafana Cloud k6 instance. The summary is fine for CI; for actual investigation you want time-series.

CI integration: a short load test on every PR

You do not need to run an hour-long soak in CI. You do want a 90-second smoke test that fails the build if a PR regresses p95 by more than 20%.

# .github/workflows/load.yml
- name: smoke load test
  run: |
    docker run --rm -i \
      -e BASE_URL=https://pr-${{ github.event.number }}.preview.example.com \
      grafana/k6 run --duration 60s --vus 50 --quiet \
      -e P95_THRESHOLD=300 \
      < tests/load/smoke.js

In smoke.js, fail the test if the threshold is exceeded:

export const options = {
  thresholds: {
    http_req_duration: [`p(95)<${__ENV.P95_THRESHOLD}`],
    http_req_failed:   ['rate<0.01'],
  },
  // ...
};

A failing smoke test does not have to block the PR — it can be advisory. But the data from every PR builds a baseline you can chart, and a sudden regression is visible the moment it lands.

The takeaway

A single constant-load test is not a load test — it is a confidence-building exercise. Real systems break under ramp (where bottlenecks emerge), under soak (where leaks accumulate), and under spike (where autoscaling fails). Three k6 scripts, three different graphs, three different bug classes.

The next time somebody asks “have we load-tested this?” — the answer is not a number. It is “we ramped to peak, soaked at 60%, and spiked from cold start, and here is what we found.” That is the difference between a system that holds up on launch day and one that holds up only in the test environment.


A note from Yojji

The kind of pre-production work that turns “we hope it scales” into “we know what breaks first” — load profiles, soak tests, spike scenarios — is the discipline that decides whether a launch is uneventful or front-page. It is the kind of engineering Yojji’s teams build into the products they deliver for clients.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their dedicated teams ship across the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and full-cycle product engineering — including the load and reliability testing that decides whether a system survives its first viral moment.