Back to blog
Best PracticesCompute & ContainersGCPKubernetesReliability

GKE Node Auto-Repair Disabled: Why It Matters and How to Fix It

Learn why disabled GKE node auto-repair puts your cluster at risk, how to enable it via gcloud and Terraform, and how to enforce it in CI to prevent drift.

TL;DR

This check flags GKE node pools running without automatic node repair, which means unhealthy nodes stay broken until someone notices and acts. Turn it on with gcloud container node-pools update POOL --enable-autorepair so GKE drains and recreates failed nodes for you.

Node auto-repair is one of those GKE features that does nothing visible until the day it saves you. When it is off, a node that drops out of the cluster, runs out of disk, or stops reporting a healthy kubelet status just sits there, soaking up scheduling decisions and quietly degrading your workloads. This check catches node pools where that safety net is missing.


What this check detects

The gke_noautorepair check inspects each node pool in your GKE clusters and flags any pool where the autoRepair management setting is set to false. Auto-repair is configured per node pool, not per cluster, so it is entirely possible to have one cluster where some pools are protected and others are not.

When auto-repair is enabled, GKE continuously runs health checks against each node. If a node reports an unhealthy status for a sustained period, or fails to report any status at all over consecutive checks, GKE marks it for repair. Repair means the node is drained and then recreated from the node pool template.

Note: GKE considers a node unhealthy if it reports a NotReady status on consecutive checks over roughly 10 minutes, or reports no status at all over a similar window. Auto-repair also triggers on boot disk problems and certain kubelet failures.

Node pools created through the Google Cloud console have auto-repair enabled by default, and Autopilot clusters manage it for you. The pools that fail this check are usually older pools, pools created via the API or Terraform without the setting specified, or pools where someone explicitly disabled it during troubleshooting and never turned it back on.


Why it matters

A Kubernetes cluster is only as reliable as the nodes underneath it. When a node goes unhealthy and nothing repairs it, you get a slow bleed of capacity and a pile of operational pain.

Workloads pile onto fewer healthy nodes

When a node becomes NotReady, the scheduler stops placing new pods on it, and the kubelet eventually evicts the pods that were running there. Those pods reschedule onto the remaining healthy nodes. If you do not notice and the node is never repaired, you are running the same workload on less hardware. During a traffic spike, that headroom you thought you had is gone.

Silent failures become incidents at 3am

Without auto-repair, the recovery path is manual. Someone has to notice the node is unhealthy, decide it is not coming back, cordon and drain it, and then delete or recreate it. If your team is not staring at node metrics around the clock, the gap between a node failing and a human responding can be hours. That gap is where outages live.

Warning: Auto-repair is not a replacement for proper pod-level resilience. If a single node failing causes user-facing downtime, you also have a problem with replica counts, pod disruption budgets, or anti-affinity rules. Auto-repair limits the blast radius, it does not eliminate it.

Disk pressure and kubelet drift

Long-lived nodes accumulate problems. Container image layers fill the boot disk, log files grow, and occasionally the kubelet enters a bad state that a restart will not fix. Auto-repair recreates the node from a clean template, which clears all of that. Pools without it tend to develop a handful of "weird" nodes that engineers learn to avoid scheduling onto, which is not a healthy pattern.


How to fix it

Auto-repair is a node pool management setting, so you enable it per pool. The change is non-disruptive on its own, since enabling the flag does not recreate any nodes immediately. It only changes what GKE does when a node later becomes unhealthy.

Find the affected node pools

List your clusters and inspect the management config for each node pool:

gcloud container clusters list --format="table(name,location)"

gcloud container node-pools list \
  --cluster CLUSTER_NAME \
  --location LOCATION \
  --format="table(name,management.autoRepair,management.autoUpgrade)"

Any pool showing autoRepair as empty or False needs fixing.

Enable auto-repair with gcloud

gcloud container node-pools update POOL_NAME \
  --cluster CLUSTER_NAME \
  --location LOCATION \
  --enable-autorepair

For a zonal cluster use --zone instead of --location, and for a regional cluster use --region. Recent gcloud versions accept --location for both.

Note: Auto-repair requires auto-upgrade for nodes running on certain release channels. If you are on a release channel, GKE manages both together. On the static (no channel) version, you can enable auto-repair independently.

Enable it in the console

  1. Open Kubernetes Engine → Clusters and select your cluster.
  2. Go to the Nodes tab and click the node pool name.
  3. Click Edit.
  4. Under Management, check Enable auto-repair.
  5. Click Save.

Fix it in Terraform

If you manage GKE with Terraform, set auto_repair in the node pool management block. This is the durable fix, because the next terraform apply would otherwise revert a manual change.

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.id
  location = "us-central1"

  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

Tip: If you have several pools across many clusters, loop over them in a small script rather than clicking through the console. Pipe the output of gcloud container node-pools list into a loop that runs the update command for each pool where autoRepair is false.


How to prevent it from happening again

Fixing the pools you have today is the easy part. Keeping them fixed as new clusters and pools appear is where most teams slip. Build a guardrail at the point where infrastructure is defined.

Gate it in CI with Terraform policy

If you use Terraform, an OPA or Conftest policy can reject any plan that creates a node pool with auto-repair disabled. Here is a Rego rule for Conftest:

package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_container_node_pool"
  management := resource.change.after.management[_]
  management.auto_repair == false
  msg := sprintf("Node pool '%s' has auto_repair disabled", [resource.address])
}

Run it against a plan output in your pipeline:

terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json

Enforce it at runtime with Org Policy or Gatekeeper

For clusters not managed by your IaC pipeline, a Gatekeeper constraint or a scheduled scan keeps you honest. The most reliable approach is a continuous check that runs against the live API, because that catches drift no matter how a pool was created or modified.

Tip: Lensix runs the gke_noautorepair check continuously across every project and cluster in your GCP organization, so a newly created pool with auto-repair off shows up in your findings without anyone having to remember to scan for it.


Best practices

  • Enable auto-repair and auto-upgrade together. They solve overlapping problems. Auto-upgrade keeps nodes patched, auto-repair keeps them healthy, and a node recreated by either ends up on a clean template.
  • Configure pod disruption budgets. Auto-repair drains nodes before recreating them. A PDB ensures the drain does not take down all replicas of a workload at once.
  • Use a release channel. Regular, Stable, or Rapid channels let GKE coordinate maintenance, and they make auto-repair and auto-upgrade behavior predictable.
  • Spread workloads across nodes and zones. Topology spread constraints and a regional cluster mean a single repaired node never represents your entire capacity for a service.
  • Watch your repair activity. A node pool that repairs nodes constantly is telling you something, often disk pressure or a bad image. Treat frequent repairs as a signal, not background noise.

Note: Auto-repair respects maintenance windows and exclusions you define on the cluster, so you can keep repairs from kicking off during a sensitive deploy or a known traffic peak.

Enabling auto-repair costs nothing and asks almost nothing of you. The only real decision is whether you want a human in the loop every time a node fails, and for the vast majority of workloads, the answer is no. Turn it on, pair it with sensible disruption budgets, and let GKE handle the boring recovery work so your team can stay focused on the problems that actually need a person.