Back to blog
Best PracticesCloud SecurityCompute & ContainersGCPReliability

Instance Group Has No Auto-Healing: Why It Matters and How to Fix It on GCP

Learn why GCP managed instance groups without auto-healing leave broken VMs serving traffic, plus step-by-step gcloud and Terraform fixes and CI/CD prevention.

TL;DR

This check flags GCP managed instance groups that have no auto-healing policy, which means a hung or crashed VM keeps serving traffic instead of being replaced. Add a health check and attach it as an auto-healing policy so the group recreates broken instances automatically.

A managed instance group (MIG) is supposed to keep your fleet of VMs healthy and at the size you asked for. Most teams assume that just means "if a VM disappears, recreate it." That part is true by default. What is not true by default is that the MIG will notice when an instance is technically running but no longer doing its job. That is the gap auto-healing fills, and the compute_noautoheal check exists to catch MIGs that never closed it.


What this check detects

The check looks at every managed instance group in your GCP project and inspects its auto-healing policy. If the group has no autoHealingPolicies configured, or has a policy with no associated health check, Lensix flags it.

It is worth being precise about what auto-healing is and is not, because the terms get blurred:

  • Instance recreation happens automatically whenever a VM is deleted, preempted, or its host fails. This is built into every MIG and does not require any configuration.
  • Auto-healing is the behavior that watches an application-level health check and recreates instances that fail that check, even if the VM is still technically running.

Note: A MIG without auto-healing still maintains your target instance count. The difference is what counts as "unhealthy." Without auto-healing, an instance is only replaced when the VM itself goes away. With auto-healing, an instance is replaced when your health check says the application inside it is broken.


Why it matters

The failure modes that hurt most in production are rarely a VM vanishing. They are the slow, quiet ones: a process that deadlocks, a memory leak that makes an app stop responding, a dependency timeout that leaves a service returning 500s. In all of these cases the VM is still up, the MIG sees a running instance, and it does nothing. Meanwhile the load balancer keeps routing real users to a node that cannot serve them.

Concretely, here is what a missing auto-healing policy costs you:

  • Stuck traffic to dead apps. If you use a separate health check on the backend service but no auto-healing, the load balancer stops sending traffic to a bad instance, but the instance is never replaced. Your capacity quietly shrinks and nobody recovers it until someone pages.
  • Longer incidents. Recovery becomes manual. An on-call engineer has to notice the degradation, identify the bad instances, and delete them by hand so the MIG recreates them. That is minutes to hours of avoidable downtime.
  • Masked failures. Without auto-healing, a single bad instance can silently drag down a percentage of requests. With ten instances and one stuck, ten percent of users get errors and your dashboards may not make it obvious.
  • Cascading overload. If unhealthy instances are still counted toward capacity, the healthy ones absorb more load than they were sized for, which can push them into failure too.

Warning: Auto-healing uses a separate health check from your load balancer's backend health check. People often assume the load balancer health check also drives healing. It does not. You can have a perfectly working load balancer health check and still have zero auto-healing.


How to fix it

Fixing this is a two-step job: create a health check, then attach it to the MIG as an auto-healing policy with an initial delay.

Step 1: Create a health check

Pick a check that reflects whether the application is actually serving, not just whether the OS booted. An HTTP check against a real readiness endpoint is usually the right call.

gcloud compute health-checks create http app-autoheal-hc \
  --port=8080 \
  --request-path=/healthz \
  --check-interval=10s \
  --timeout=5s \
  --healthy-threshold=2 \
  --unhealthy-threshold=3

Tip: Use a dedicated health endpoint like /healthz that checks the app's real dependencies (database connection, cache, etc.) rather than a static 200 OK on /. A healthcheck that always passes is worse than none because it gives false confidence.

Step 2: Attach the auto-healing policy to the MIG

The initial delay is critical. It tells the MIG how long to wait after an instance starts before health checks begin counting against it. Set it longer than your application's worst-case boot and warm-up time, otherwise healthy instances get killed mid-startup in a restart loop.

gcloud compute instance-groups managed update app-mig \
  --zone=us-central1-a \
  --health-check=app-autoheal-hc \
  --initial-delay=180

For a regional MIG, swap --zone for --region:

gcloud compute instance-groups managed update app-mig \
  --region=us-central1 \
  --health-check=app-autoheal-hc \
  --initial-delay=180

Warning: If your initial-delay is too short, the MIG will start recreating instances before they finish booting, producing a loop where instances are continuously destroyed and rebuilt. This can rack up disk and instance churn costs and take the service fully offline. Measure your real startup time and add a margin.

Step 3: Make sure the firewall allows health check probes

GCP health check probes come from fixed source ranges. If your firewall blocks them, every instance will look unhealthy and auto-healing will tear your whole group apart.

gcloud compute firewall-rules create allow-health-checks \
  --direction=INGRESS \
  --action=ALLOW \
  --rules=tcp:8080 \
  --source-ranges=35.191.0.0/16,130.211.0.0/22 \
  --target-tags=app-mig

Danger: Before enabling auto-healing on a production MIG, confirm the health check passes for currently healthy instances. If the check is misconfigured (wrong port, wrong path, blocked firewall), enabling auto-healing will delete and recreate every instance in the group at once, causing a full outage. Verify with gcloud compute instance-groups managed list-instances and watch the HEALTH_STATE column before walking away.

Terraform example

If you manage infrastructure as code, define the health check and the auto-healing policy directly on the MIG resource:

resource "google_compute_health_check" "app_autoheal" {
  name                = "app-autoheal-hc"
  check_interval_sec  = 10
  timeout_sec         = 5
  healthy_threshold   = 2
  unhealthy_threshold = 3

  http_health_check {
    port         = 8080
    request_path = "/healthz"
  }
}

resource "google_compute_instance_group_manager" "app" {
  name               = "app-mig"
  zone               = "us-central1-a"
  base_instance_name = "app"
  target_size        = 3

  version {
    instance_template = google_compute_instance_template.app.id
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.app_autoheal.id
    initial_delay_sec = 180
  }
}

How to prevent it from happening again

One-off fixes drift. New MIGs get created without auto-healing, someone copies an old template, and the gap reappears. Close it with policy and automation.

Gate it in CI/CD with Terraform validation

If you provision MIGs through Terraform, reject any plan that creates a MIG without an auto_healing_policies block. You can enforce this with an OPA/Conftest policy run in your pipeline:

package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_compute_instance_group_manager"
  not resource.change.after.auto_healing_policies
  msg := sprintf("MIG '%s' has no auto_healing_policies block", [resource.address])
}

Wire that into the pipeline so a missing policy fails the build:

terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
conftest test plan.json

Enforce with Organization Policy and continuous scanning

Policy-as-code in CI only catches what flows through CI. For the rest, lean on continuous detection. Lensix runs compute_noautoheal across all your projects on a schedule, so a MIG created by hand in the console or by a script that bypassed Terraform still gets flagged. Pair that with alerting so a new finding opens a ticket rather than sitting in a dashboard.

Tip: Make auto-healing part of your MIG module rather than something each team configures. If every team consumes a shared internal Terraform module that requires a health check input, it becomes impossible to create a non-compliant MIG by accident.


Best practices

  • Always set a realistic initial delay. Time a cold start of your instance under load and set initial-delay comfortably above it. Too low causes restart loops, too high delays recovery.
  • Use application-aware health checks. A check that hits a real readiness endpoint catches deadlocks and dependency failures. A TCP-port-open check only catches a fully dead process.
  • Keep auto-healing and load balancer health checks separate but consistent. The load balancer check should drain traffic quickly. The auto-healing check should be slightly more tolerant so you replace instances that are genuinely broken, not ones with a brief blip.
  • Set sensible thresholds. An unhealthy-threshold of 1 reacts to transient failures and causes unnecessary churn. Require a few consecutive failures before healing.
  • Combine auto-healing with a maxUnavailable rollout policy. During updates, control how many instances heal or roll at once so you never drop below safe capacity.
  • Monitor healing events. Frequent auto-healing is a symptom, not a cure. If a group heals constantly, the underlying app or sizing problem needs fixing.

Note: Auto-healing is a recovery mechanism, not a substitute for fixing root causes. It buys time and keeps users served while you investigate why instances are failing. Treat a spike in healing events as you would any other incident signal.

Configured correctly, auto-healing turns a class of silent, traffic-eating failures into a non-event. The MIG notices the bad instance, drains it, recreates it, and your on-call engineer never gets paged. That is the whole point of running a managed group in the first place, so it is worth the five minutes to set up.