Back to blog
Best PracticesCloud SecurityCompute & ContainersGCPReliability

VM Terminates During Host Maintenance: Why Your GCP VMs Should Migrate, Not Stop

Learn why Compute Engine VMs set to TERMINATE on host maintenance cause avoidable downtime, and how to switch them to MIGRATE with CLI, Terraform, and CI gates.

TL;DR

This check flags Compute Engine VMs set to TERMINATE on host maintenance instead of MIGRATE, which means Google will shut the instance down during routine infrastructure maintenance. Set onHostMaintenance to MIGRATE unless the VM genuinely cannot be live migrated.

Google Compute Engine runs your VMs on physical hosts that need regular maintenance: kernel patches, hardware swaps, microcode updates, and security fixes. When the host running your VM needs work, Compute Engine has two options depending on how the instance is configured. It can live migrate the VM to a healthy host with no downtime, or it can terminate the VM outright. This check catches instances configured for the second option.

For most workloads, terminating a VM during routine maintenance is an unforced outage. It is the kind of misconfiguration that sits quietly until Google schedules maintenance on the right host, and then your service goes dark at a time you did not choose.


What this check detects

The compute_nomaintmigration check inspects the scheduling configuration of each Compute Engine VM and reports any instance where the host maintenance behavior is set to TERMINATE rather than MIGRATE.

The relevant field lives under the instance's scheduling block:

{
  "scheduling": {
    "onHostMaintenance": "TERMINATE",
    "automaticRestart": true,
    "preemptible": false
  }
}

When onHostMaintenance is TERMINATE, Compute Engine stops the VM during a maintenance event instead of moving it to another host. If automaticRestart is true, the VM is restarted afterward, but it still incurs a hard stop and a fresh boot. The instance loses everything in memory, drops every in-flight connection, and reboots from scratch.

Note: Live migration is the default for standard VMs. You usually end up with TERMINATE either because someone set it deliberately, or because the VM uses a feature that is incompatible with live migration, such as GPUs or certain sole-tenant configurations. The check surfaces both cases so you can decide whether the setting is intentional.


Why it matters

Host maintenance is not rare. Google performs it on a rolling basis across its fleet, and any given VM can be affected several times a year. With MIGRATE, you never notice. With TERMINATE, every maintenance event becomes a reboot.

The concrete risks

  • Unscheduled downtime. The VM goes down when Google decides to maintain the host, not when your change window opens. A stateful single-instance service, a self-hosted database, or a license server can take an outage at 3pm on a Tuesday.
  • Lost in-memory state. Caches, session data, queued work held only in RAM, and anything not yet flushed to disk disappears on termination.
  • Dropped connections. Long-lived gRPC streams, websocket sessions, database connections, and file transfers all break.
  • Cascading failures. If the VM is a singleton dependency for other services, its restart can ripple outward into retries, timeouts, and degraded performance across your stack.
  • Slow recovery. Boot time plus application warm-up can mean minutes of unavailability, not seconds. For VMs with attached local SSDs, the data on those disks is lost entirely on termination.

Warning: If your VM has local SSD scratch disks, a TERMINATE maintenance event wipes that data. Local SSDs do not survive a stop. If you rely on local SSD for anything you cannot afford to lose, you need a fundamentally different design, not just a scheduling flag.

When TERMINATE is actually the right call

This is not always a mistake. Some VMs cannot be live migrated, and forcing MIGRATE on them will fail. GPU-attached instances are the classic example. In those cases TERMINATE with automaticRestart enabled is the correct configuration, and the goal is to make sure your application handles the restart gracefully rather than to flip the flag blindly.


How to fix it

The fix is to set onHostMaintenance to MIGRATE. The instance does not need to be deleted and recreated, but it does need to be stopped to change this setting.

Warning: Changing scheduling options requires the VM to be stopped first. This means a brief, planned outage. Do it during a maintenance window, and make sure you have a path to drain traffic or fail over before you stop the instance.

Option 1: gcloud CLI

First, confirm the current setting:

gcloud compute instances describe my-instance \
  --zone=us-central1-a \
  --format="value(scheduling.onHostMaintenance)"

Stop the instance, update the scheduling, then start it again:

# Stop the VM
gcloud compute instances stop my-instance --zone=us-central1-a

# Set host maintenance behavior to MIGRATE
gcloud compute instances set-scheduling my-instance \
  --zone=us-central1-a \
  --maintenance-policy=MIGRATE \
  --restart-on-failure

# Start it back up
gcloud compute instances start my-instance --zone=us-central1-a

Verify the change took effect:

gcloud compute instances describe my-instance \
  --zone=us-central1-a \
  --format="value(scheduling.onHostMaintenance)"
# Expected output: MIGRATE

Tip: If you have many instances to fix, loop over them. List affected VMs with a filter, then iterate. Always test on a non-production instance first so you understand the stop and start timing for your environment.

# Find all VMs in a zone set to TERMINATE
gcloud compute instances list \
  --filter="scheduling.onHostMaintenance=TERMINATE AND zone:us-central1-a" \
  --format="value(name)"

Option 2: Google Cloud Console

  1. Open Compute Engine and select the VM instance.
  2. Stop the instance if it is running.
  3. Click Edit.
  4. Scroll to Management, then Availability policies.
  5. Set On host maintenance to Migrate VM instance.
  6. Save, then start the instance.

Option 3: Terraform

If you manage instances with Terraform, set the scheduling block explicitly so the configuration is enforced and visible in code:

resource "google_compute_instance" "app" {
  name         = "app-server"
  machine_type = "e2-standard-4"
  zone         = "us-central1-a"

  scheduling {
    on_host_maintenance = "MIGRATE"
    automatic_restart   = true
    preemptible         = false
  }

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-12"
    }
  }

  network_interface {
    network = "default"
  }
}

Danger: Changing on_host_maintenance in Terraform may force a stop or, depending on other attributes, a replacement of the instance. Run terraform plan and read the output carefully before applying. A replacement destroys the VM and any local SSD data attached to it.

What about VMs that cannot migrate?

For GPU instances and other workloads where MIGRATE is not supported, keep TERMINATE but make the behavior safe and explicit:

  scheduling {
    on_host_maintenance = "TERMINATE"
    automatic_restart   = true
  }

  guest_accelerator {
    type  = "nvidia-tesla-t4"
    count = 1
  }

The goal here is graceful handling, not avoidance. Make sure automatic_restart is on, persist state to durable storage, and design the application to recover cleanly from a reboot.


How to prevent it from happening again

Fixing the VMs you have today is half the job. The other half is stopping new TERMINATE instances from creeping in.

Policy-as-code with Organization Policy

Google Cloud does not have a built-in org policy constraint for maintenance behavior, so the practical enforcement point is your IaC pipeline and a CSPM tool like Lensix scanning continuously. Treat the scheduling block as a required, reviewed field.

OPA / Conftest gate in CI

Add a policy to your CI pipeline that rejects Terraform plans setting on_host_maintenance to TERMINATE unless the instance has an accelerator attached:

package compute.maintenance

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_compute_instance"
  sched := resource.change.after.scheduling[_]
  sched.on_host_maintenance == "TERMINATE"
  not has_accelerator(resource.change.after)
  msg := sprintf(
    "Instance '%s' uses TERMINATE on host maintenance without a GPU. Use MIGRATE.",
    [resource.change.after.name],
  )
}

has_accelerator(after) {
  count(after.guest_accelerator) > 0
}

Wire it into your pipeline against the JSON plan:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy ./policy

Tip: The GPU exception in the policy above keeps the gate from blocking legitimate workloads. Encode the exception in code rather than disabling the check, so the reasoning stays visible to the next engineer who reads it.

Continuous scanning

IaC gates only cover resources created through IaC. Click-ops VMs, instances spun up by scripts, and drift all slip past. Run a continuous check across your live GCP environment so any VM that ends up on TERMINATE, regardless of how it got there, is flagged. This is exactly what the Lensix compute_nomaintmigration check does on every scan.


Best practices

  • Default to MIGRATE. Unless a workload genuinely cannot live migrate, MIGRATE should be the standard across your fleet. It is the lowest-friction way to avoid maintenance-driven outages.
  • Set scheduling explicitly in IaC. Do not rely on provider defaults. Spell out on_host_maintenance and automatic_restart in every instance and instance template so the behavior is reviewable in code.
  • Design for restarts anyway. Live migration is not a guarantee of perfect uptime, and migration can briefly degrade performance. Build applications that tolerate restarts, persist state to durable storage, and recover automatically.
  • Use managed instance groups for stateless workloads. MIGs with autohealing and rolling updates make individual VM disruptions a non-event, which matters far more than any single scheduling flag.
  • Never trust local SSD as durable. Local SSD data does not survive a stop or a terminate. If you use it, treat it strictly as scratch space.
  • Document the exceptions. When a VM legitimately uses TERMINATE, record why. A GPU instance with automaticRestart on is fine, but the next person should not have to guess whether it was deliberate.

The cheapest outage to prevent is the one you cause yourself. A single scheduling field is the difference between Google quietly moving your VM and Google quietly shutting it down.

Audit your fleet, flip the workloads that should be on MIGRATE, keep an explicit and justified exception list for the rest, and put a gate in CI so the problem does not come back. Lensix will keep watching the live environment for the cases that bypass your pipeline.