Fix AKS Node Pool Version Drift | Lensix

TL;DR

This check flags AKS clusters where node pools run mismatched Kubernetes versions, which leads to scheduling surprises, API incompatibilities, and unsupported version drift. Align your node pools to a single minor version (or keep them at most one minor behind the control plane) and gate version changes through IaC.

Running an Azure Kubernetes Service cluster usually means juggling several node pools: a system pool for cluster-critical add-ons, one or more user pools for workloads, maybe a spot pool for batch jobs. It is easy for those pools to drift apart over time. You upgrade one pool during an incident, forget the others, and six weeks later half your fleet is on 1.28 while the rest sits on 1.30.

The aks_nodepoolversion check catches exactly that situation: node pools within the same cluster running different Kubernetes versions.

What this check detects

Every AKS node pool has its own Kubernetes version, separate from the control plane version. The control plane (the managed API server, scheduler, and controller manager) is upgraded independently from the agent pools that run your kubelet and container runtime.

This check inspects all node pools in a cluster and compares their orchestratorVersion values. If two or more pools report different versions, the check fails.

You can see the same thing manually:

az aks nodepool list \
  --resource-group my-rg \
  --cluster-name my-aks \
  --query "[].{Name:name, Version:orchestratorVersion, Mode:mode}" \
  --output table

Output that triggers this finding looks like this:

Name        Version    Mode
----------  ---------  ------
systempool  1.30.3     System
userpool    1.30.3     User
spotpool    1.28.9     User

Here spotpool is two minor versions behind the rest of the cluster, which is both a compatibility risk and a likely support violation.

Note: AKS supports node pools that are at most two minor versions behind the control plane, and they can never be ahead of it. Crossing that boundary blocks upgrades entirely until you reconcile the versions.

Why it matters

Kubernetes promises compatibility within a narrow version skew window, not across arbitrary gaps. When your node pools drift apart, you start operating outside the tested envelope, and the failure modes are rarely obvious.

API and feature skew

A kubelet on an older node may not understand features or API behaviors that newer workloads expect. Deprecated APIs removed in a newer release might still be served by lagging nodes, so a Deployment that schedules onto a 1.28 node behaves differently than the same Deployment on a 1.30 node. These are the bugs that take a full day to reproduce because they depend on which node the pod landed on.

Inconsistent behavior under load

Scheduler defaults, cgroup handling, and container runtime versions change between releases. A pod that runs fine on the newer pool may hit subtly different memory accounting or eviction thresholds on the older pool. When the autoscaler shifts workloads between pools during a spike, you get nondeterministic incidents.

Loss of support and blocked upgrades

If a node pool falls outside the supported skew, Azure will refuse further control plane upgrades until you bring the pool forward. During an active CVE response, that delay is the difference between patching in an hour and being stuck for a day untangling version dependencies.

Warning: Old node pool versions also fall out of the AKS supported version window. Once a minor version is deprecated, you lose security patches and Azure support for those nodes while they keep serving production traffic.

Audit and compliance findings

Frameworks like CIS Azure and internal patch-management policies expect a consistent, supported Kubernetes version across the cluster. Mixed versions show up as findings and force you to write exceptions you would rather not own.

How to fix it

The fix is to bring lagging node pools up to the target version. The recommended target is the control plane version, with all pools matching each other.

Step 1: Check the control plane version first

az aks show \
  --resource-group my-rg \
  --name my-aks \
  --query "kubernetesVersion" \
  --output tsv

Node pools should never be upgraded past the control plane. If the control plane itself is behind, upgrade it first:

az aks upgrade \
  --resource-group my-rg \
  --name my-aks \
  --control-plane-only \
  --kubernetes-version 1.30.3

Step 2: Upgrade the lagging node pool

Danger: A node pool upgrade cordons and drains nodes one at a time, evicting pods as it goes. On production workloads, confirm you have PodDisruptionBudgets in place and enough surge capacity before running this, or you risk dropping availability during the rollout.

az aks nodepool upgrade \
  --resource-group my-rg \
  --cluster-name my-aks \
  --name spotpool \
  --kubernetes-version 1.30.3

Repeat for any other lagging pool. AKS rolls nodes with a configurable surge so the upgrade is incremental rather than all-at-once.

Step 3: Tune the surge for safer rollouts

Set max-surge so AKS adds new nodes before draining old ones. A higher surge is faster but costs more during the upgrade window:

az aks nodepool update \
  --resource-group my-rg \
  --cluster-name my-aks \
  --name spotpool \
  --max-surge 33%

Warning: A surge value spins up extra nodes for the duration of the upgrade, so you pay for additional VMs temporarily. On large pools, even 33% surge can mean a noticeable spike on your bill.

Step 4: Verify alignment

az aks nodepool list \
  --resource-group my-rg \
  --cluster-name my-aks \
  --query "[].{Name:name, Version:orchestratorVersion}" \
  --output table

All pools should now report the same version.

Fixing it in Terraform

If you manage AKS with Terraform, pin the version explicitly on both the cluster and each node pool, and use the same variable so they cannot drift:

variable "k8s_version" {
  type    = string
  default = "1.30.3"
}

resource "azurerm_kubernetes_cluster" "this" {
  name                = "my-aks"
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location
  dns_prefix          = "myaks"
  kubernetes_version  = var.k8s_version

  default_node_pool {
    name                 = "systempool"
    vm_size              = "Standard_D4s_v5"
    orchestrator_version = var.k8s_version
    node_count           = 3
  }

  identity {
    type = "SystemAssigned"
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "spot" {
  name                  = "spotpool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.this.id
  vm_size               = "Standard_D4s_v5"
  orchestrator_version  = var.k8s_version
  node_count            = 2
}

Driving every orchestrator_version from a single variable means a version bump is one PR, and a plan will surface any pool left behind.

Tip: Wrap the upgrade in a script that loops over az aks nodepool list, finds any pool whose version does not match the control plane, and upgrades it. Run that script on a schedule and you turn drift remediation into a no-op instead of a manual checklist.

How to prevent it from happening again

Drift creeps in when version changes happen by hand. Close that gap by making version a managed, reviewed value.

Use a single source of truth for the version

As shown above, bind the cluster and all node pools to one variable in Terraform, Bicep, or your tooling of choice. Manual az aks nodepool upgrade runs against an IaC-managed cluster will be flagged as drift on the next plan.

Add a CI/CD gate

Run a check in your pipeline that fails if pool versions diverge. A simple script works well in a pull request workflow:

#!/usr/bin/env bash
set -euo pipefail

versions=$(az aks nodepool list \
  --resource-group my-rg \
  --cluster-name my-aks \
  --query "[].orchestratorVersion" -o tsv | sort -u)

count=$(echo "$versions" | wc -l)

if [ "$count" -gt 1 ]; then
  echo "FAIL: node pools on multiple versions:"
  echo "$versions"
  exit 1
fi

echo "OK: all node pools on $versions"

Enforce with Azure Policy

Azure Policy has built-in definitions for AKS, including one that restricts clusters to allowed Kubernetes versions. Assigning it makes any non-conforming version visible as a compliance violation and can block deployment outright depending on the effect you choose.

az policy assignment create \
  --name "aks-allowed-versions" \
  --scope "/subscriptions/<sub-id>/resourceGroups/my-rg" \
  --policy "<allowed-k8s-versions-policy-definition-id>" \
  --params '{"allowedVersions": {"value": ["1.30.3"]}}'

Adopt cluster auto-upgrade channels

AKS auto-upgrade channels keep the control plane and node pools moving together on a schedule, which removes the human step that causes drift in the first place:

az aks update \
  --resource-group my-rg \
  --name my-aks \
  --auto-upgrade-channel patch \
  --node-os-upgrade-channel NodeImage

Note: The patch channel keeps the cluster on its current minor version while applying patches automatically. Use stable if you also want minor version upgrades managed for you. Pair this with a maintenance window so upgrades land at predictable, low-traffic times.

Best practices

Keep all node pools on one version. Treat a mixed-version cluster as a temporary state that exists only during a rolling upgrade, never as a steady state.
Upgrade the control plane first, then the pools. Node pools can lag the control plane but never lead it. Always move the control plane forward before the agents.
Stay within the supported version window. Track the AKS supported versions list and plan upgrades before your version is deprecated, not after.
Configure PodDisruptionBudgets. They are what keep an upgrade drain from taking down a workload. Without them, a node drain can evict every replica at once.
Set a maintenance window. Combine auto-upgrade channels with a defined window so version changes happen on your terms, not at 2 AM during peak traffic.
Test upgrades in a non-production cluster first. Minor version bumps occasionally change behavior. Validate against staging before rolling into production.
Continuously monitor for drift. A one-time fix does not stay fixed. Let Lensix watch for reappearing version mismatches so you catch the next drift before it blocks an upgrade or trips an audit.

Version consistency is one of those things nobody notices until it breaks an urgent security upgrade. A few minutes spent aligning your node pools and gating future changes saves you from discovering the problem mid-incident.

AKS Node Pools On Different Kubernetes Versions: Risks and Fixes