OpsCanary
Back to daily brief
azureaksPractitioner

Mastering AKS Upgrades: Strategies for Zero Downtime

5 min read Microsoft LearnApr 21, 2026
PractitionerHands-on experience recommended

Upgrading AKS clusters is crucial for maintaining security, performance, and access to new features. However, poorly planned upgrades can lead to downtime and service disruptions. By understanding the upgrade options available, you can minimize risks and ensure your applications remain available during the process.

AKS performs pre-upgrade validations to ensure cluster health. It checks for API breaking changes, Kubernetes upgrade version compatibility, Pod Disruption Budget (PDB) configurations, and more. Key parameters to configure include maxSurge, which controls the number of surge nodes during an upgrade, and maxUnavailable, which limits the number of unavailable nodes. Additionally, setting a Pod Disruption Budget can help manage the number of pods that can go down during upgrades, while configuring the node drain timeout allows you to control how long AKS waits for pod eviction.

In production, you need to be cautious with the force upgrade option. It bypasses PDB constraints and can drain all pods simultaneously, leading to service disruption. Always check your PDB settings before using this option. Also, ensure you have the Azure CLI aks-preview extension version 18.0.0b9 or later to utilize the max blocked nodes feature effectively. Remember, staggered upgrades with node soak time can help minimize downtime and improve user experience.

Key takeaways

  • Configure maxSurge to speed up upgrades while being mindful of workload disruptions.
  • Set maxUnavailable to manage capacity effectively during upgrades.
  • Use Pod Disruption Budgets to limit the number of pods down during upgrades.
  • Adjust node drain timeout to control pod eviction wait duration.
  • Ensure you have the latest Azure CLI aks-preview extension for advanced features.

Why it matters

Properly managing AKS upgrades can significantly reduce downtime and improve application reliability, which is critical for maintaining user trust and operational efficiency.

Code examples

Bash
1az aks upgrade \
2  --name $CLUSTER_NAME \
3  --resource-group $RESOURCE_GROUP_NAME \
4  --kubernetes-version $KUBERNETES_VERSION \
5  --enable-force-upgrade \
6  --upgrade-override-until 2023-10-01T13:00:00Z
Bash
1az aks nodepool update \
2  --resource-group <resource-group-name> \
3  --cluster-name <cluster-name> \
4  --name <node-pool-name> \
5  --undrainable-node-behavior Cordon \
6  --max-blocked-nodes 2 \
7  --drain-timeout 30
Bash
1az aks nodepool update \
2  --cluster-name jizenMC1 \
3  --name nodepool1 \
4  --resource-group jizenTestMaxBlockedNodesRG \
5  --max-surge 1 \
6  --undrainable-node-behavior Cordon \
7  --max-blocked-nodes 2 \
8  --drain-timeout 5

When NOT to use this

The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.

Want the complete reference?

Read official docs

Test what you just learned

Quiz questions written from this article

Take the quiz →

Get the daily digest

One email. 5 articles. Every morning.

No spam. Unsubscribe anytime.