Fix Azure Scale Set Single-Zone Risk | Lensix

TL;DR

This check flags Azure VM Scale Sets deployed into a single availability zone, leaving your workload exposed to a full outage if that datacenter fails. Fix it by recreating or redeploying the scale set across at least two or three zones and balancing instance distribution.

A VM Scale Set pinned to one availability zone looks fine right up until that zone has a problem. Then every instance in the set goes down at the same time, and the autoscaling you carefully configured does nothing to save you, because there is nowhere healthy left to scale into. This check exists to catch that single point of failure before an Azure region incident does it for you.

What this check detects

The vm_scaleset_singleaz check inspects every Azure VM Scale Set in your subscription and reports any that are not spread across multiple availability zones. A scale set is flagged when its zones property is empty (regional, no zone awareness) or contains only a single zone value.

In practical terms, the check looks at the zone configuration on the scale set resource. A compliant set lists two or more zones, for example ["1", "2", "3"], and distributes its instances across them. A non-compliant set either omits zones entirely or specifies just one.

Note: Availability zones are physically separate datacenters within an Azure region, each with independent power, cooling, and networking. Not every region supports them, and the zones available vary by region. A scale set with no zone configuration is "regional" and Azure places instances anywhere in the region without guaranteeing zone separation.

Why it matters

Availability zones are Azure's answer to datacenter-level failures. When a single zone loses power or network connectivity, resources in other zones in the same region keep running. A scale set confined to one zone gives up that protection entirely.

Here is the failure mode that bites teams in production:

Correlated failure. Every instance in a single-zone scale set shares the same physical fate. A zone outage takes down 100 percent of your capacity at once, not a fraction of it.
Autoscaling cannot rescue you. When the zone is down, new instances cannot be provisioned there either. Your scale set scaling rules fire and fail, so the workload stays dark until the zone recovers.
SLA gaps. Azure only offers its higher VM availability SLA when instances are spread across two or more zones. Run single-zone and you are not covered for the stronger guarantee, which matters when you are explaining downtime to customers.
Stateful surprises. If the scale set backs a stateful service or holds a leader election, losing the whole set simultaneously can corrupt state or trigger a cold start across the board.

Zone outages are not theoretical. Azure has had multiple publicized single-zone incidents where customers running zone-redundant workloads sailed through while single-zone deployments went offline. The cost of spreading across zones is close to zero. The cost of not doing it is your entire fleet at the worst possible moment.

Warning: Spreading across zones can introduce small cross-zone data transfer charges for traffic between instances in different zones. For most workloads this is negligible compared to the reliability gain, but high-throughput east-west traffic is worth measuring before you assume it is free.

How to fix it

The catch with scale sets is that the zones property is set at creation time and cannot be changed on an existing set. To make a single-zone scale set zone-redundant, you create a new scale set with the correct zone configuration and migrate traffic to it.

1. Confirm the current zone configuration

Check what zones, if any, the scale set is using:

az vmss show \
  --resource-group my-rg \
  --name my-scaleset \
  --query "zones" \
  --output json

An empty result (null or []) means regional with no zone guarantees. A single value like ["1"] means pinned to one zone. Both are flagged by this check.

2. Verify the region supports multiple zones

Not all regions or VM sizes support zones. Confirm before you plan the migration:

az vm list-skus \
  --location eastus \
  --size Standard_D2s_v5 \
  --query "[].locationInfo[].zones" \
  --output json

3. Create a zone-redundant scale set

Create the replacement set spread across the zones the region supports. The --zones flag takes the list, and --zone-balance keeps instance counts even across zones:

az vmss create \
  --resource-group my-rg \
  --name my-scaleset-zr \
  --image Ubuntu2204 \
  --instance-count 3 \
  --zones 1 2 3 \
  --vm-sku Standard_D2s_v5 \
  --upgrade-policy-mode Automatic \
  --load-balancer my-lb \
  --admin-username azureuser \
  --generate-ssh-keys

Note: Use an instance count that divides evenly across your chosen zones, for example a multiple of three when using zones 1, 2, and 3. This avoids lopsided distribution where one zone carries more load than the others.

4. Migrate traffic and retire the old set

Attach the new scale set to the same load balancer or Application Gateway backend pool, drain connections from the old set, then validate the new one is taking traffic and healthy across all zones.

Danger: The command below permanently deletes the old scale set and all its instances. Confirm the zone-redundant replacement is serving production traffic and passing health checks before you run it. There is no undo.

az vmss delete \
  --resource-group my-rg \
  --name my-scaleset

Infrastructure as Code

If you manage scale sets with Terraform, set the zones argument and enable zone balancing. Note that changing zones on an existing resource forces a replacement, so plan the rollout accordingly:

resource "azurerm_linux_virtual_machine_scale_set" "app" {
  name                = "my-scaleset-zr"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Standard_D2s_v5"
  instances           = 3

  zones        = ["1", "2", "3"]
  zone_balance = true

  # ... image, network, and os profile config
}

The Bicep equivalent sets the zones array on the resource and enables zone balancing through the scale set properties:

resource scaleSet 'Microsoft.Compute/virtualMachineScaleSets@2023-09-01' = {
  name: 'my-scaleset-zr'
  location: location
  zones: [ '1', '2', '3' ]
  sku: {
    name: 'Standard_D2s_v5'
    capacity: 3
  }
  properties: {
    zoneBalance: true
    // ... vm profile and upgrade policy
  }
}

How to prevent it from happening again

Recreating scale sets is painful, so the real win is making single-zone deployments impossible to ship in the first place.

Enforce with Azure Policy

Azure Policy can audit or deny scale sets that lack a zone configuration. A deny policy stops non-compliant deployments at the control plane before any resource is created:

{
  "if": {
    "allOf": [
      {
        "field": "type",
        "equals": "Microsoft.Compute/virtualMachineScaleSets"
      },
      {
        "anyOf": [
          { "field": "Microsoft.Compute/virtualMachineScaleSets/zones[*]", "exists": "false" },
          { "value": "[length(field('Microsoft.Compute/virtualMachineScaleSets/zones[*]'))]", "less": 2 }
        ]
      }
    ]
  },
  "then": {
    "effect": "deny"
  }
}

Gate it in CI/CD

Catch the misconfiguration in pull requests before it reaches Azure. Run a policy-as-code scan against your Terraform plan or Bicep templates:

# Scan Terraform for the missing-zones pattern
checkov -d ./infra --check CKV_AZURE_97

# Or run a custom conftest policy against the plan JSON
terraform show -json plan.out | conftest test -

Tip: Pair the deny policy with a separate audit policy scoped to existing resources. The deny rule keeps new scale sets compliant, while the audit rule surfaces the legacy single-zone sets you still need to migrate. Lensix continuously evaluates both so you see drift the moment it appears rather than at the next manual review.

Best practices

Use at least two zones, three where available. Two zones survive a single zone failure. Three gives you headroom to lose a zone and still have N+1 capacity in what remains.
Enable zone balancing. Set zoneBalance to true so Azure keeps instance counts even. Without it, scaling events can drift capacity toward one zone over time.
Size for zone loss, not just average load. If losing a zone drops you below the capacity needed to serve traffic, you have spread the risk but not removed it. Plan instance counts so the surviving zones can carry the load.
Make load balancers zone-redundant too. A zone-redundant scale set behind a zonal load balancer or public IP still has a single point of failure at the front door. Use Standard SKU zone-redundant load balancers and public IPs.
Match zones to dependencies. If your scale set talks to a database or storage account, prefer zone-redundant SKUs for those too so the whole path survives a zone outage, not just the compute layer.
Decide between zonal and zone-redundant deliberately. Pinning a set to specific zones is occasionally valid for latency-sensitive pairing with zonal resources. If you do it on purpose, document why, because this check cannot tell intent from accident.

Spreading a scale set across zones is one of the cheapest reliability improvements available in Azure. It costs a configuration change at creation time and protects you from the kind of outage that takes down everything at once. Build it into your templates, enforce it with policy, and the next zone incident becomes a non-event instead of an incident review.

Azure VM Scale Sets Not Spread Across Availability Zones