Back to blog
AzureBest PracticesCloud SecurityCompute & ContainersReliability

Enabling Automatic Repairs on Azure VM Scale Sets

Learn why Azure VM Scale Sets need automatic instance repairs enabled, the outage risk of leaving it off, and how to fix it with CLI, Terraform, and Azure Policy.

TL;DR

This check flags Azure VM Scale Sets that have automatic instance repairs turned off, which means unhealthy instances stay in rotation and quietly degrade your service. Enable automatic repairs with a health probe or application health extension so the scale set replaces broken instances on its own.

A VM Scale Set is supposed to be self-healing. You set a capacity, define a health signal, and the platform keeps that many healthy instances running. But the self-healing part is not on by default. If you skip the automatic repairs setting, the scale set will happily keep routing traffic to instances that have crashed, hung, or failed their health checks. This Lensix check, vm_autorepairs, looks at every scale set in your subscription and reports the ones where automatic instance repairs are disabled.


What this check detects

The check inspects the automaticRepairsPolicy property on each Azure VM Scale Set. When automatic repairs are enabled, the scale set continuously monitors instance health and automatically replaces any instance that reports as unhealthy for longer than a configured grace period. When the policy is missing or set to enabled: false, no such replacement happens.

Lensix marks the scale set as failing when either of the following is true:

  • The automaticRepairsPolicy block is absent entirely.
  • The policy exists but enabled is set to false.

Note: Automatic repairs are different from automatic OS image upgrades and from VM availability auto-recovery on single VMs. This setting is specific to scale sets and operates on instance health, not host hardware failures, which Azure handles separately through service healing.


Why it matters

Scale sets are usually the workhorses behind production traffic: web frontends, API tiers, batch workers, AKS node pools. The whole point of running a scale set instead of a handful of standalone VMs is that the platform manages instance lifecycle for you. Without automatic repairs, that promise is broken in a way that is easy to miss until it bites.

Consider a common failure mode. One instance in a five-node web tier runs out of memory and the application process dies. The instance is still running from Azure's point of view, so it stays in the load balancer backend pool. Roughly one in five requests now hits a dead instance and times out or returns a 502. Your dashboards show partial errors, on-call gets paged, and someone manually reimages or deletes the instance at 3 a.m. With automatic repairs enabled, the health probe would have marked that instance unhealthy and the scale set would have replaced it without anyone lifting a finger.

The business impact stacks up across a few dimensions:

  • Availability: Dead-but-running instances silently erode your effective capacity and serve errors to a fraction of users.
  • Toil: Engineers burn time manually identifying and cycling bad instances, work the platform was supposed to do.
  • Slow incidents: Partial failures are harder to detect than full outages, so they tend to run longer before someone notices the pattern.
  • Autoscaling skew: Unhealthy instances still count toward your desired capacity, so autoscale may not add replacements even though real capacity has dropped.

Warning: Automatic repairs require a working health signal. Enabling the policy without a configured health probe or Application Health Extension does nothing, and Azure will reject the configuration. The fix is two parts: a health signal and the repair policy that acts on it.


How to fix it

There are two pieces you need: a health source so the scale set knows which instances are unhealthy, and the repair policy itself. The health source can be either a load balancer health probe or the Application Health Extension. The extension is more flexible because it works without a load balancer and can check an application endpoint directly.

Option A: Enable automatic repairs with the Azure CLI

If your scale set already sits behind a load balancer with a health probe, you can turn on repairs in one command. The --grace-period is how long an instance can stay unhealthy before it gets repaired, expressed in ISO 8601 duration format (minimum 10 minutes).

az vmss update \
  --resource-group my-rg \
  --name my-scaleset \
  --enable-automatic-repairs true \
  --automatic-repairs-grace-period PT30M

Option B: Add the Application Health Extension first

If there is no load balancer probe, install the Application Health Extension so the scale set has a health signal. Create an extension config file:

{
  "protocol": "http",
  "port": 8080,
  "requestPath": "/health"
}

Then apply it and enable repairs:

az vmss extension set \
  --resource-group my-rg \
  --vmss-name my-scaleset \
  --name ApplicationHealthLinux \
  --publisher Microsoft.ManagedServices \
  --version 1.0 \
  --settings @health-extension.json

az vmss update \
  --resource-group my-rg \
  --name my-scaleset \
  --enable-automatic-repairs true \
  --automatic-repairs-grace-period PT30M

Note: Use ApplicationHealthLinux for Linux instances and ApplicationHealthWindows for Windows. Your /health endpoint should return a 200 only when the application is genuinely ready to serve traffic, not just when the process is alive.

Option C: Bake it into your IaC

The durable fix is in your infrastructure code, so the setting survives recreation. Here is a Terraform example using azurerm_linux_virtual_machine_scale_set:

resource "azurerm_linux_virtual_machine_scale_set" "web" {
  name                = "my-scaleset"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Standard_D2s_v5"
  instances           = 5

  # Health signal via the Application Health Extension
  extension {
    name                       = "HealthExtension"
    publisher                  = "Microsoft.ManagedServices"
    type                       = "ApplicationHealthLinux"
    type_handler_version       = "1.0"
    auto_upgrade_minor_version = true

    settings = jsonencode({
      protocol    = "http"
      port        = 8080
      requestPath = "/health"
    })
  }

  # The repair policy itself
  automatic_instance_repair {
    enabled      = true
    grace_period = "PT30M"
  }

  # ... os_profile, network, etc.
}

And the Bicep equivalent for teams on native Azure tooling:

resource scaleSet 'Microsoft.Compute/virtualMachineScaleSets@2023-09-01' = {
  name: 'my-scaleset'
  location: location
  properties: {
    automaticRepairsPolicy: {
      enabled: true
      gracePeriod: 'PT30M'
    }
    // virtualMachineProfile with health extension, etc.
  }
}

Danger: If your health probe is misconfigured and reports healthy instances as unhealthy, automatic repairs will reimage them in a loop. Test the health endpoint thoroughly before enabling repairs, and start with a generous grace period like PT30M so transient blips during deploys or startup do not trigger unnecessary repairs.

Verify the change

az vmss show \
  --resource-group my-rg \
  --name my-scaleset \
  --query "automaticRepairsPolicy"

You should see "enabled": true with your grace period reflected back.


How to prevent it from happening again

Fixing one scale set by hand is fine. Stopping the next ten from shipping without repairs is the real win. Push the guardrail as far left as you can.

Azure Policy

Azure Policy can audit or deny scale sets that lack automatic repairs. A custom policy with a deny effect blocks non-compliant deployments at the control plane, before resources are even created.

{
  "if": {
    "allOf": [
      {
        "field": "type",
        "equals": "Microsoft.Compute/virtualMachineScaleSets"
      },
      {
        "field": "Microsoft.Compute/virtualMachineScaleSets/automaticRepairsPolicy.enabled",
        "notEquals": "true"
      }
    ]
  },
  "then": {
    "effect": "deny"
  }
}

Assign it at the subscription or management group scope. Start with audit effect to find existing offenders, then switch to deny once you have cleaned them up.

CI/CD policy-as-code

For IaC pipelines, run a static check before terraform apply or az deployment. A Conftest/OPA rule against your Terraform plan keeps the bad config out of the merge:

# In your pipeline, after terraform plan -out=tfplan
terraform show -json tfplan | conftest test -

Tip: Pair the policy with continuous detection in Lensix so drift gets caught even when someone edits a scale set outside of CI/CD. Policy-as-code stops new bad deploys, while a scheduled scan catches manual changes in the console that bypass your pipeline entirely.


Best practices

  • Make health checks meaningful. A probe that just checks whether the port is open will miss an app that is up but broken. Have the endpoint verify real readiness, including downstream dependencies it cannot run without.
  • Tune the grace period to your workload. Apps with long startup times need a longer grace period so instances are not repaired before they finish booting. PT30M is a safe default; shorten it only once you trust your health signal.
  • Combine repairs with rolling upgrades. Automatic repairs plus an automatic rolling upgrade policy gives you a scale set that heals itself and patches itself with controlled batches.
  • Monitor repair events. Repairs are good, but a spike in repairs means something is wrong upstream. Alert on the rate of instance reimaging so a flapping health check or a bad deploy surfaces quickly.
  • Apply it everywhere, including AKS. AKS node pools are scale sets under the hood. The same health-and-repair thinking keeps your cluster nodes from sitting around in a broken state.

Automatic instance repairs are one of those settings that cost nothing, take five minutes to enable, and save you from a category of slow, painful, partial outages. Turn them on across every scale set, gate new deployments with policy, and let the platform do the job it was built to do.