Fix Single-Zone Instance Groups on GCP | Lensix

TL;DR

This check flags managed instance groups (MIGs) pinned to a single GCP zone, which means a zonal outage takes your whole service offline. Convert the zonal MIG to a regional one so instances spread across multiple zones in the region.

A managed instance group in Google Cloud can be either zonal or regional. The difference sounds like an implementation detail, but it decides whether your application survives a zone failure or goes dark with it. This check catches MIGs that live entirely inside one zone, leaving you with a single point of failure that often goes unnoticed until the zone has a bad day.

What this check detects

The compute_singlezone check inspects your GCP managed instance groups and flags any that are deployed in a single availability zone. A zonal MIG creates and manages all of its instances within one zone, for example us-central1-a. If that zone experiences a hardware failure, network partition, or maintenance event, every instance in the group becomes unreachable at the same time.

Regional MIGs, by contrast, distribute instances across multiple zones within a region (typically three). When one zone fails, the instances in the other zones keep serving traffic, and the MIG can recreate the lost capacity in healthy zones.

Note: A GCP region is a geographic area like us-central1. Each region contains multiple isolated zones (us-central1-a, -b, -c, and so on). Zones are designed to fail independently, so spreading workloads across them is the primary way to survive infrastructure-level outages.

Why it matters

Zonal outages are not hypothetical. Google publishes incident reports regularly, and individual zones do go down for hours at a time due to power, cooling, or network issues. If your MIG is zonal, a single one of these events means full downtime for whatever it serves, whether that is a customer-facing API, an internal service, or a batch processing fleet.

The risk is amplified by how easy it is to create a zonal MIG by accident. The gcloud compute instance-groups managed create command defaults to zonal behavior when you pass a --zone flag, and plenty of Terraform examples online use google_compute_instance_group_manager (zonal) rather than google_compute_region_instance_group_manager (regional). Teams copy a working snippet, ship it, and never revisit it.

Real-world impact looks like this:

Availability SLA breaches. A zonal MIG cannot meet a 99.9% or higher uptime target if a single zone failure causes complete loss of service.
Failed autoscaling recovery. When the zone is unhealthy, a zonal MIG has nowhere to recreate instances. It waits for the zone to come back.
Cascading failures. If a load balancer points only at a zonal backend, health checks fail across the board with no fallback capacity.

Warning: A regional MIG spreads instances across zones, so your data egress and inter-zone traffic patterns may change slightly. Inter-zone traffic within the same region is billed, though the cost is usually small compared to the availability benefit. Review network-heavy workloads before assuming the change is free.

How to fix it

The fix is to run your workload as a regional MIG instead of a zonal one. GCP does not let you convert an existing zonal MIG to regional in place, so you create a new regional MIG and migrate traffic to it. Here is the practical path.

1. Confirm which MIGs are zonal

List your managed instance groups and check the location column:

gcloud compute instance-groups managed list \
  --format="table(name, location, location_scope, size)"

Any row where location_scope is zone is a zonal MIG and a candidate for this check.

2. Create a regional MIG from the same instance template

Reuse your existing instance template so the new group runs identical instances. The key difference is --region instead of --zone:

gcloud compute instance-groups managed create my-app-regional \
  --region=us-central1 \
  --template=my-app-template \
  --size=3 \
  --target-distribution-shape=EVEN

The --target-distribution-shape=EVEN flag tells GCP to keep instances balanced across zones, which is what gives you the resilience you want. Set --size to a multiple of the number of zones so distribution stays even.

3. Re-attach autoscaling and named ports

Autoscalers, health checks, and named ports are configured per MIG, so recreate them on the regional group:

gcloud compute instance-groups managed set-autoscaling my-app-regional \
  --region=us-central1 \
  --min-num-replicas=3 \
  --max-num-replicas=12 \
  --target-cpu-utilization=0.6

gcloud compute instance-groups managed set-named-ports my-app-regional \
  --region=us-central1 \
  --named-ports=http:8080

4. Add the regional MIG to your load balancer backend and shift traffic

Add the new regional MIG as a backend, verify it passes health checks, then drain and remove the old zonal backend. Doing it in this order avoids a gap in capacity.

# Add the regional MIG as a backend
gcloud compute backend-services add-backend my-app-backend \
  --instance-group=my-app-regional \
  --instance-group-region=us-central1 \
  --global

# Once healthy, remove the old zonal backend
gcloud compute backend-services remove-backend my-app-backend \
  --instance-group=my-app-zonal \
  --instance-group-zone=us-central1-a \
  --global

5. Delete the old zonal MIG

Danger: Deleting a managed instance group terminates all of its instances immediately. Confirm the regional MIG is serving production traffic and passing health checks before you delete the zonal one. There is no undo.

gcloud compute instance-groups managed delete my-app-zonal \
  --zone=us-central1-a

Fixing it in Terraform

If you manage infrastructure as code, the cleaner approach is to switch resource types. Replace the zonal resource with the regional one rather than patching attributes.

Before (zonal, the thing this check flags):

resource "google_compute_instance_group_manager" "app" {
  name               = "my-app"
  zone               = "us-central1-a"
  base_instance_name = "my-app"
  target_size        = 3

  version {
    instance_template = google_compute_instance_template.app.id
  }
}

After (regional, spread across zones):

resource "google_compute_region_instance_group_manager" "app" {
  name                      = "my-app"
  region                    = "us-central1"
  base_instance_name        = "my-app"
  target_size               = 3
  distribution_policy_zones = [
    "us-central1-a",
    "us-central1-b",
    "us-central1-c",
  ]

  version {
    instance_template = google_compute_instance_template.app.id
  }

  update_policy {
    type                  = "PROACTIVE"
    minimal_action        = "REPLACE"
    max_surge_fixed       = 3
    max_unavailable_fixed = 0
  }
}

Warning: Changing the resource type causes Terraform to destroy the zonal MIG and create the regional one. Plan this as a controlled migration with traffic routed through a load balancer, not a casual terraform apply during business hours.

How to prevent it from happening again

One fix does not stop the next zonal MIG from being created. Bake the rule into your pipeline so it never ships in the first place.

Catch it in CI with policy-as-code

Use Open Policy Agent (OPA) with Conftest to fail any Terraform plan that defines a zonal instance group manager:

package main

deny[msg] {
  resource := input.resource.google_compute_instance_group_manager[name]
  msg := sprintf(
    "Instance group '%s' is zonal. Use google_compute_region_instance_group_manager for multi-zone resilience.",
    [name],
  )
}

Run it against the plan output in your pipeline:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy ./policy

Enforce distribution at runtime

For environments where developers create resources directly, GCP Organization Policy constraints and custom Cloud Asset Inventory feeds can detect zonal MIGs as they appear. Pair that with a scheduled Lensix scan so the compute_singlezone check runs continuously rather than once.

Tip: Set the regional resource as the default in your internal Terraform modules. If the only MIG module your teams can import is regional, nobody has to remember the rule, and the zonal path simply does not exist in your codebase.

Best practices

Default to regional MIGs. Treat zonal MIGs as the exception that needs justification, not the starting point. Most production workloads should be regional.
Use even distribution. Set target-distribution-shape=EVEN and size your group as a multiple of the zone count so capacity stays balanced after recovery.
Pair with a load balancer. A regional MIG behind a regional or global load balancer gives you automatic health-check-based routing across zones.
Test zone failure. Periodically drain a zone or abandon instances in one zone to confirm the MIG recreates capacity elsewhere as expected.
Mind quotas. Spreading across zones can hit per-zone resource quotas differently than concentrating in one. Check CPU and IP quotas in each target zone before migrating.
Document the exceptions. Some workloads genuinely need to stay in one zone, for example latency-sensitive jobs that talk to a zonal resource. When that is the case, record why, so the check finding is a known accepted risk rather than an oversight.

Moving from zonal to regional is one of the highest-leverage reliability changes you can make in GCP Compute. It costs little, requires no application changes, and turns a full outage during a zone failure into a brief, self-healing dip.

Instance Group Deployed in Single Zone: Fixing Zonal MIG Risk on GCP