kubernetesautoscalingPractitioner

HPA in Production: What the Docs Don't Tell You

5 min read Kubernetes DocsApr 27, 2026Reviewed for accuracy

Practitioner — Hands-on experience recommended

In Kubernetes, managing workloads efficiently is vital for maintaining performance under varying loads. The Horizontal Pod Autoscaler (HPA) exists to automatically adjust the number of pods in a deployment or stateful set based on real-time demand. This means your application can scale up during peak usage and scale down when demand decreases, optimizing resource usage and cost.

The HPA works by monitoring the average CPU utilization across all pods in a deployment. By default, it targets an average CPU utilization of 50%. You can configure the minimum and maximum number of replicas, with defaults set to 1 and 10, respectively. When the load increases, the HPA controller increases the number of replicas; conversely, it scales down if the load decreases and the number of pods exceeds the minimum. To set this up, you need to enable the Metrics Server, which collects resource metrics from your cluster and exposes them via the Kubernetes API.

In production, be aware that it may take a few minutes for the number of replicas to stabilize after scaling actions. Also, if there are no clients sending requests, the current CPU consumption may show as 0%, which can be misleading. Ensure your Kubernetes server is version 1.23 or later to utilize HPA effectively. Running this on a cluster with at least two nodes is recommended to avoid control plane host issues.

Key takeaways

→Configure HPA with a target average CPU utilization of 50% for optimal performance.
→Set minimum and maximum replicas to control scaling behavior effectively.
→Enable the Metrics Server to collect and expose resource metrics for HPA.
→Monitor the stabilization time after scaling actions to manage expectations.
→Ensure your Kubernetes version is 1.23 or later for HPA functionality.

Why it matters

In production, effective autoscaling can lead to significant cost savings and improved application performance. By dynamically adjusting resources, you can handle traffic spikes without over-provisioning.

Code examples

Bash

kubectl autoscale deployment php-apache --cpu=50% --min=1 --max=10

Bash

kubectl get hpa

Bash

kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"