Mastering AKS Upgrades: Strategies for Zero Downtime
Upgrading AKS clusters is crucial for maintaining security, performance, and access to new features. However, poorly planned upgrades can lead to downtime and service disruptions. By understanding the upgrade options available, you can minimize risks and ensure your applications remain available during the process.
AKS performs pre-upgrade validations to ensure cluster health. It checks for API breaking changes, Kubernetes upgrade version compatibility, Pod Disruption Budget (PDB) configurations, and more. Key parameters to configure include maxSurge, which controls the number of surge nodes during an upgrade, and maxUnavailable, which limits the number of unavailable nodes. Additionally, setting a Pod Disruption Budget can help manage the number of pods that can go down during upgrades, while configuring the node drain timeout allows you to control how long AKS waits for pod eviction.
In production, you need to be cautious with the force upgrade option. It bypasses PDB constraints and can drain all pods simultaneously, leading to service disruption. Always check your PDB settings before using this option. Also, ensure you have the Azure CLI aks-preview extension version 18.0.0b9 or later to utilize the max blocked nodes feature effectively. Remember, staggered upgrades with node soak time can help minimize downtime and improve user experience.
Key takeaways
- →Configure maxSurge to speed up upgrades while being mindful of workload disruptions.
- →Set maxUnavailable to manage capacity effectively during upgrades.
- →Use Pod Disruption Budgets to limit the number of pods down during upgrades.
- →Adjust node drain timeout to control pod eviction wait duration.
- →Ensure you have the latest Azure CLI aks-preview extension for advanced features.
Why it matters
Properly managing AKS upgrades can significantly reduce downtime and improve application reliability, which is critical for maintaining user trust and operational efficiency.
Code examples
1az aks upgrade \
2 --name $CLUSTER_NAME \
3 --resource-group $RESOURCE_GROUP_NAME \
4 --kubernetes-version $KUBERNETES_VERSION \
5 --enable-force-upgrade \
6 --upgrade-override-until 2023-10-01T13:00:00Z1az aks nodepool update \
2 --resource-group <resource-group-name> \
3 --cluster-name <cluster-name> \
4 --name <node-pool-name> \
5 --undrainable-node-behavior Cordon \
6 --max-blocked-nodes 2 \
7 --drain-timeout 301az aks nodepool update \
2 --cluster-name jizenMC1 \
3 --name nodepool1 \
4 --resource-group jizenTestMaxBlockedNodesRG \
5 --max-surge 1 \
6 --undrainable-node-behavior Cordon \
7 --max-blocked-nodes 2 \
8 --drain-timeout 5When NOT to use this
The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.
Want the complete reference?
Read official docsMastering AKS Node Pool Snapshots: A Game Changer for Cluster Management
Node pool snapshots in Azure Kubernetes Service (AKS) are a powerful tool for managing your clusters. They allow you to capture the configuration of your node pools and replicate them seamlessly. Learn how to leverage this feature effectively in your production environment.
Mastering Microsoft Entra Workload ID in AKS: A Practical Guide
Unlock the power of Microsoft Entra Workload ID to streamline authentication in your Azure Kubernetes Service (AKS) deployments. This integration allows your workloads to securely access Azure resources using federated identities. Dive in to learn how to configure and leverage this feature effectively.
Mastering AKS: Best Practices for Cluster Operators and Developers
Building and managing applications on Azure Kubernetes Service (AKS) requires a solid grasp of best practices. From leveraging multi-tenancy with namespaces to implementing pod security with digital key vaults, these strategies are essential for a robust deployment. Dive in to elevate your AKS game.
Get the daily digest
One email. 5 articles. Every morning.
No spam. Unsubscribe anytime.