Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works

The team’s API has an HPA configured to scale at 80% CPU. Traffic doubles. Pods do not scale up. Latency goes up, customers complain. The reason: the workload is I/O-bound, CPU never crosses 50%, but the request rate has doubled — and CPU was a poor proxy for “we need more pods.”

The default-CPU HPA is a relic. Most modern workloads — Node.js APIs, Python web apps, queue consumers — are not CPU-bound. They are I/O-bound, queue-bound, or memory-bound. Scaling on CPU misses the actual signal. Custom metrics — request rate, queue depth, p95 latency — are what should drive scaling.

This post is the working setup for a custom-metric HPA, the formula HPA uses internally (so you can predict what it does), and the four mistakes that cause autoscaler flap.

The HPA formula

For any metric m, HPA computes:

desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))

CPU example: 4 pods, current avg CPU 80%, target 50%:

desiredReplicas = ceil(4 * (80/50)) = ceil(6.4) = 7

The formula is identical for any metric. Pick a metric that linearly correlates with load and a target that represents “we want each pod to be at this utilization.” Then HPA does the math.

The right metric for the workload

A few common workloads and the right scaling metric:

HTTP API, light per-request work. Scale on requests per second per pod. Each pod handles roughly the same RPS at saturation; doubling RPS means double the pods. CPU is a poor proxy because Node’s event loop is happy at low CPU.

Queue consumer. Scale on queue depth. Backlog = need more workers. CPU is irrelevant.

Memory-heavy app (Java, Python with large in-memory state). Scale on memory utilization or GC pause time.

WebSocket / long-lived connection server. Scale on active connections per pod.

Latency-sensitive API. Scale on p95 latency. When latency rises, add capacity. (Be careful — this can mask underlying problems.)

For most teams, the right combination is “RPS per pod” plus a memory ceiling.

Wiring up custom metrics

Out of the box, HPA can use CPU and memory. For anything else, you need a metrics adapter. The standard combo: Prometheus for collection, Prometheus Adapter for exposure to the metrics API, HPA reading from the metrics API.

# prometheus-adapter ConfigMap (excerpt)
rules:
- seriesQuery: 'http_requests_total{namespace!="", pod!=""}'
  resources:
    overrides:
      namespace: { resource: "namespace" }
      pod:       { resource: "pod" }
  name:
    matches: "^(.*)_total$"
    as: "${1}_per_second"
  metricsQuery: |
    sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)

This exposes http_requests_per_second as a per-pod metric the HPA can read. Then:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric: { name: http_requests_per_second }
      target:
        type: AverageValue
        averageValue: "100"           # target 100 RPS per pod
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # don't scale down for 5min after a peak
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30                # double quickly if needed
      - type: Pods
        value: 5
        periodSeconds: 30
      selectPolicy: Max

The two metrics: scale up on RPS-per-pod (primary signal), scale up also on memory utilization (safety net). The behavior block tunes how aggressively each direction reacts.

The four mistakes that cause flap

1. Target too tight. Setting averageValue: 50 when the steady-state is 60 means HPA tries to scale up by 20% constantly. Pick a target that represents peak healthy utilization, with headroom — typically 70-80% of the saturation point.

2. No stabilization window. Default scaleDownStabilizationWindow is 300s for a reason. Without it, a brief traffic dip causes scale-down, and the next spike causes a scramble. Keep it ≥5 minutes for scale-down. Scale-up can be 0.

3. Scaling on a noisy metric. A 1-minute average of latency is noisy at low traffic. Use a longer window or scale on a steadier metric (RPS).

4. Scaling on a metric the application controls. Scaling on cpu while the app is yielding to event loop is a circular dependency — the metric is high because the pod is busy, but adding pods doesn’t reduce per-pod CPU if the bottleneck is elsewhere (database, downstream API). Pick metrics that represent demand, not internal state.

The KEDA alternative

KEDA is a CRD that wraps HPA with a richer set of scalers — Kafka, RabbitMQ, AWS SQS, Redis, Prometheus, cron, even Postgres. For queue-based workloads it is much simpler than wiring up Prometheus Adapter.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-worker
spec:
  scaleTargetRef:
    name: queue-worker
  minReplicaCount: 1
  maxReplicaCount: 30
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/.../my-queue
      queueLength: "10"     # target 10 messages per replica

For “scale workers based on queue depth” — KEDA in 10 lines, vs Prometheus Adapter and custom metrics setup. Strong default for queue-driven systems.

Vertical Pod Autoscaler: the other dimension

HPA scales horizontally (more pods). VPA scales vertically (resize pods). They are not interchangeable — VPA evicts and restarts pods to resize them, which is disruptive.

VPA is most useful in recommendation mode: it watches usage, suggests right-sized requests/limits, but you apply them at deploy time. This is dramatically better than guessing.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"   # recommendations only

kubectl describe vpa api after a few days shows recommended CPU and memory. Apply, redeploy. Free right-sizing.

Cluster autoscaler: the layer above

HPA decides how many pods. The cluster autoscaler decides how many nodes. For full elasticity, both need to be on. Make sure your cluster has scaling configured (managed Kubernetes services typically handle this).

Watch for: pods stuck pending because no node has room. That is a cluster-autoscaler issue, not an HPA issue.

Monitoring the autoscaler itself

Three signals to watch:

kube_horizontalpodautoscaler_status_current_replicas: how many pods HPA actually has.
kube_horizontalpodautoscaler_status_desired_replicas: how many it wants.
kube_horizontalpodautoscaler_status_condition{condition="ScalingLimited"}: HPA is at min or max and cannot adjust.

If desired > current for more than a minute, HPA is asking for more capacity than the cluster can give — usually a node-pool issue.

Common-sense limits

Always set:

minReplicas ≥ 2 for production. Single-pod services cannot survive any disruption.
maxReplicas sane (10× minReplicas is a reasonable starting point). Without it, a runaway metric can spawn 1000 pods and burn money.
PodDisruptionBudget alongside HPA so voluntary disruptions (node drains) respect minimums. (See: PDB post once published.)

The takeaway

Default-CPU HPA is the wrong choice for most modern workloads. Pick metrics that represent demand: RPS, queue depth, active connections. Use KEDA for queue-driven systems. Tune the behavior block to prevent flap. Set min/max sanely. Test in staging by simulating load.

The next time the team is debugging “why didn’t the HPA scale,” the question is “what metric is it scaling on?” Nine times out of ten, the answer is “CPU, and that is not the right signal.”

A note from Yojji

The kind of platform-engineering judgment that picks the right autoscaling metric — and prevents the “we have 50 pods but the dashboard says we still need more” flap — is the kind of detail Yojji’s DevOps and platform engineers build into the Kubernetes deployments they ship for clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and Kubernetes operations — including the autoscaling and capacity-planning work that decides whether a service stays up under load.