OpsCanary
kubernetesautoscalingPractitioner

GPU Autoscaling in Kubernetes: Mastering KEDA with External Scalers

5 min read CNCF BlogMay 27, 2026Reviewed for accuracy
Share
PractitionerHands-on experience recommended

In the world of Kubernetes, efficiently managing GPU resources can be a game changer, especially for workloads that demand high computational power. KEDA (Kubernetes Event-driven Autoscaling) allows you to autoscale based on external metrics, making it ideal for applications that require dynamic scaling based on GPU usage. By leveraging KEDA, you can ensure that your GPU resources are utilized effectively, scaling up when demand spikes and scaling down when it's not needed.

To implement GPU autoscaling, you can build a custom DaemonSet that runs on GPU nodes. Each pod in this DaemonSet will call NVML (NVIDIA Management Library) to read local GPU metrics. It then serves these metrics over gRPC using KEDA's ExternalScaler interface. The KEDA operator connects to your scaler and drives HPA decisions based on the metrics provided. Key configuration parameters include scalerAddress, which defaults to keda-gpu-scaler.gpu-scaler.svc.cluster.local:6000, and profile, which you can set to your specific scaling profile, such as vllm-inference. You can also define minReplicaCount and maxReplicaCount to control the scaling limits.

When deploying this in production, remember that the integration of KEDA with your existing workloads can lead to significant resource savings. However, ensure you thoroughly test your setup, especially the communication between the DaemonSet and KEDA. Use the provided Helm command to install the GPU scaler and the YAML configuration to define your ScaledObject. Keep in mind that as of May 27, 2026, this setup is still evolving, so stay updated with any changes in the KEDA project that may affect your implementation.

Key takeaways

  • Build a custom DaemonSet to read GPU metrics using NVML.
  • Serve GPU metrics over gRPC with KEDA's ExternalScaler interface.
  • Configure scaling limits with minReplicaCount and maxReplicaCount.
  • Use the provided Helm command for easy deployment of the GPU scaler.
  • Stay updated on changes in KEDA for optimal performance.

Why it matters

Effective GPU autoscaling can drastically reduce costs and improve performance for compute-intensive applications, ensuring resources are allocated efficiently based on real-time demand.

Code examples

Bash
helm install gpu-scaler deploy/helm/keda-gpu-scaler \
  --namespace gpu-scaler --create-namespace
YAML
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4  name: vllm-gpu-scaler
5spec:
6  scaleTargetRef:
7    name: vllm-deployment
8  minReplicaCount: 0
9  maxReplicaCount: 8
10  triggers:
11    - type: external
12      metadata:
13        scalerAddress: "keda-gpu-scaler.gpu-scaler.svc.cluster.local:6000"
14        profile: "vllm-inference"
Bash
go test -v -tags=e2e -race ./tests/e2e/

When NOT to use this

The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.

Want the complete reference?

Read official docs

Test what you just learned

Quiz questions written from this article

Take the quiz →
Better StackSponsor

Unified observability — logs, uptime monitoring, and on-call in one place. Used by 50,000+ engineering teams to ship faster and sleep better.

Try Better Stack free →

Get the daily digest

One email. 5 articles. Every morning.

No spam. Unsubscribe anytime.