Back to blog
Best PracticesCloud SecurityCompute & ContainersGCPReliability

VM Automatic Restart Disabled on GCP Compute Engine

Learn why GCP Compute Engine VMs with automatic restart disabled risk silent outages, plus gcloud and Terraform fixes and CI policy gates to prevent drift.

TL;DR

This check flags Compute Engine VMs that have automatic restart turned off, which means GCP will not bring the instance back after a maintenance event or host failure. Re-enable it by setting --automatic-restart on the instance scheduling config to avoid silent, unrecovered outages.

Automatic restart is one of those settings that nobody thinks about until a VM disappears at 3 AM and never comes back. On Google Cloud, Compute Engine instances have a scheduling configuration that controls how they behave when the underlying host needs maintenance or crashes. When automatic restart is disabled, Google will not relaunch your instance after these events. The VM just stays stopped until a human notices.

This Lensix check, compute_noautorestart, looks at the scheduling block of each Compute Engine VM and reports any instance where automaticRestart is set to false.


What this check detects

Every Compute Engine instance has a scheduling object that includes three relevant fields:

  • automaticRestart: whether Compute Engine restarts the VM if it is terminated by a non-user event (host failure, host maintenance, internal error).
  • onHostMaintenance: whether the VM migrates to another host (MIGRATE) or is terminated (TERMINATE) during maintenance.
  • preemptible / provisioningModel: whether the VM is a Spot or preemptible instance.

The check fires when automaticRestart is false on a standard (non-Spot) VM. You can confirm the current value with:

gcloud compute instances describe INSTANCE_NAME \
  --zone=ZONE \
  --format="yaml(scheduling)"

A flagged instance looks like this:

scheduling:
  automaticRestart: false
  onHostMaintenance: MIGRATE
  preemptible: false

Note: Spot and preemptible VMs cannot have automatic restart enabled by design. Google reclaims them on demand, so a value of false is expected and correct there. Lensix scopes this check to standard instances where the setting actually changes recovery behavior.


Why it matters

The default for a standard VM is automaticRestart: true. Someone has to actively turn it off, often by copying an old IaC template or by misreading a Spot configuration. The consequences depend on what the VM does, and they are rarely good.

Host failures are not rare at scale

Physical hosts fail. Google runs a massive fleet and individual machines crash, lose power, or hit hardware faults regularly. When that happens to a VM with automatic restart enabled, Compute Engine brings it back on healthy hardware within minutes, usually with no intervention. With automatic restart disabled, the VM stays TERMINATED until someone manually starts it.

Host maintenance becomes an outage

Google performs live infrastructure maintenance on a rolling basis. If onHostMaintenance is set to TERMINATE and automatic restart is off, a routine maintenance window shuts your VM down and leaves it down. You did not cause it, you cannot schedule it, and you will not get it back automatically.

Real-world impact

  • A database VM goes down during host maintenance and stays offline, taking dependent services with it.
  • A batch processing node fails overnight and jobs silently stop. Nobody notices until reports are missing the next morning.
  • An auto-healing assumption breaks: teams build runbooks expecting VMs to recover, then discover they never will.

Warning: Disabled automatic restart is especially dangerous on stateful single-instance workloads, the kind without a load balancer or managed instance group in front of them. There is no second node to absorb the failure, so the outage is total until manual recovery.


How to fix it

You enable automatic restart by updating the instance scheduling configuration. The VM does not need to be stopped to change this on most machine types, but stopping and starting guarantees the new policy is applied cleanly.

Using gcloud

gcloud compute instances set-scheduling INSTANCE_NAME \
  --zone=ZONE \
  --restart-on-failure

The --restart-on-failure flag sets automaticRestart to true. If you ever need to confirm or revert, the opposite flag is --no-restart-on-failure.

While you are in there, it is worth pairing automatic restart with host maintenance migration so routine maintenance never terminates the VM:

gcloud compute instances set-scheduling INSTANCE_NAME \
  --zone=ZONE \
  --restart-on-failure \
  --maintenance-policy=MIGRATE

Note: MIGRATE is the default maintenance policy for most general purpose machine types and means the VM moves to another host with no reboot. Some configurations, such as VMs with GPUs or certain confidential computing setups, only support TERMINATE. For those, automatic restart is what brings the VM back after maintenance, so enabling it matters even more.

Verify the change

gcloud compute instances describe INSTANCE_NAME \
  --zone=ZONE \
  --format="value(scheduling.automaticRestart)"

You want the output to read True.

Fixing it in Terraform

If your VMs are managed with Terraform, change the configuration at the source so the next apply does not undo your manual fix:

resource "google_compute_instance" "app" {
  name         = "app-server"
  machine_type = "e2-standard-4"
  zone         = "us-central1-a"

  scheduling {
    automatic_restart   = true
    on_host_maintenance = "MIGRATE"
    preemptible         = false
    provisioning_model  = "STANDARD"
  }

  # ... boot_disk, network_interface, etc.
}

Then apply:

terraform plan
terraform apply

Warning: If the VM uses a machine type or accelerator that forces on_host_maintenance = "TERMINATE", do not try to set MIGRATE. Terraform will reject it. Keep TERMINATE and rely on automatic_restart = true to recover the instance.

Fixing many VMs at once

If the check flagged dozens of instances, loop over them rather than fixing each by hand:

gcloud compute instances list \
  --filter="scheduling.automaticRestart=false AND scheduling.preemptible=false" \
  --format="csv[no-heading](name,zone)" | while IFS=, read -r NAME ZONE; do
    echo "Enabling automatic restart on $NAME in $ZONE"
    gcloud compute instances set-scheduling "$NAME" \
      --zone="$ZONE" \
      --restart-on-failure
done

Danger: Run the list command on its own first and review the output before piping it into a loop that modifies instances. A bad filter could touch Spot VMs or production hosts you did not intend to change. Test against a single non-production VM before running fleet-wide.

Tip: The set-scheduling call does not reboot a running VM, so this loop is safe to run during business hours for the automatic restart change alone. If you also flip the maintenance policy, schedule it during a quiet window since some policy changes require the instance to be stopped.


How to prevent it from happening again

Fixing the existing VMs is the easy part. Keeping the setting from drifting back is what actually protects you.

Enforce it with Organization Policy

GCP does not have a dedicated constraint for automatic restart, so the reliable enforcement path is policy-as-code at the IaC layer plus a CI gate. Block bad configs before they reach the API.

Catch it in Terraform with a policy check

Using Conftest and OPA, you can fail any plan that disables automatic restart on a standard VM. Add a Rego policy:

package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_compute_instance"
  sched := resource.change.after.scheduling[_]
  sched.provisioning_model != "SPOT"
  sched.preemptible == false
  sched.automatic_restart == false
  msg := sprintf(
    "Instance '%s' has automatic_restart disabled on a standard VM",
    [resource.change.after.name]
  )
}

Wire it into your pipeline against a Terraform plan in JSON form:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy policy/

Tip: Run this as a required status check on pull requests. Engineers get the feedback in the PR before merge, which is far cheaper than discovering the misconfiguration during an incident postmortem.

Use managed instance groups for critical workloads

Where the workload can be stateless or horizontally scaled, a regional managed instance group with autohealing gives you recovery that does not depend on a single VM setting. The group recreates failed instances against a health check, which covers far more failure modes than automatic restart alone.

Continuous detection with Lensix

Policy gates only cover resources created through your pipeline. Anything provisioned by hand in the console, by another team, or by a script that bypasses Terraform can still drift. Keep compute_noautorestart running on a schedule so any flagged instance, no matter how it was created, surfaces quickly rather than waiting for the next outage to reveal it.


Best practices

  • Treat automatic restart as the default, not the exception. The only legitimate reason to disable it on a standard VM is rare. If you find it off, assume it was a mistake until proven otherwise.
  • Pair restart with the right maintenance policy. Use MIGRATE where supported so maintenance is invisible, and rely on automatic restart as the safety net where only TERMINATE is available.
  • Do not confuse Spot configs with standard ones. Many disabled-restart findings come from teams reusing a Spot VM template for a workload that should be standard. Review the provisioning_model at the same time.
  • Add health monitoring on top of recovery. Automatic restart brings a host back, but it does not tell you it failed. Alert on instance termination events through Cloud Monitoring so you know a recovery happened.
  • Design for failure beyond a single VM. For anything business-critical, put it behind a load balancer or in a managed instance group. A single VM, however well configured, is still a single point of failure.

Automatic restart costs nothing to enable and removes an entire class of avoidable outages. Turn it on, gate it in CI, and let continuous checks catch the stragglers.