Back to blog
AzureBest PracticesCloud SecurityCompute & ContainersReliability

Azure VM Not in an Availability Set or Zone: Why It Matters and How to Fix It

Learn why Azure VMs outside an availability set or zone have no SLA protection, the failure risks involved, and how to redeploy and prevent it with policy.

TL;DR

This check flags Azure VMs that sit outside both an availability set and an availability zone, which means a single hardware fault or datacenter outage can take them offline with no SLA protection. Redeploy critical workloads across availability zones, or place them in an availability set, to earn Azure's high-availability SLA.

A standalone Azure VM with no redundancy is one of those configurations that looks fine until the day it isn't. The machine runs, the app responds, everyone moves on. Then Azure performs planned maintenance on the underlying host, or a rack loses power, and your single VM disappears for minutes or hours with no fallback. This Lensix check catches that exposure before an incident does.

The vm_availabilityset check inspects each Azure VM and reports any that are deployed neither into an availability set nor pinned to an availability zone. Both are mechanisms Azure provides to spread infrastructure across independent fault domains, and a VM that uses neither has no protection against localized hardware failures.


What this check detects

Every Azure VM can be associated with one of two high-availability constructs at deployment time:

  • Availability set — distributes VMs across multiple fault domains (separate racks with independent power and network) and update domains (groups patched at different times) inside a single datacenter.
  • Availability zone — places the VM in one of several physically separate datacenters within an Azure region, each with its own power, cooling, and networking.

The check fails when a VM has neither. In Azure Resource Manager terms, that means the VM has no availabilitySet reference and an empty or absent zones array.

Note: An availability set and an availability zone are mutually exclusive for a given VM. You pick one or the other at creation time. You cannot place a single VM in both, and you cannot change either after the VM is created without redeploying it.


Why it matters

The headline reason is the SLA. Microsoft only offers a financially backed uptime SLA when you meet specific redundancy requirements:

  • 99.99% when you spread two or more VMs across two or more availability zones.
  • 99.95% when you deploy two or more VMs in the same availability set.
  • 99.9% for a single VM, but only if it uses premium or ultra disks for all OS and data disks.

A single standard-disk VM with no availability set or zone gets no SLA guarantee at all. If it goes down during maintenance, you have no contractual recourse and no architectural cushion.

Beyond the paperwork, the operational risk is concrete:

  • Planned maintenance. Azure regularly patches and reboots physical hosts. A VM in an availability set is rebooted in one update domain at a time, so the rest of your fleet stays up. A standalone VM has nothing to fail over to.
  • Unplanned hardware failure. Disks, power supplies, and top-of-rack switches fail. Without fault domain separation, the failure of a single rack takes your VM with it.
  • Datacenter-level outages. Power events, cooling failures, and fiber cuts have all caused individual Azure datacenters to go dark. Only availability zones protect against this, because they put your VMs in physically separate buildings.

Warning: This is not a hypothetical. Azure has had multiple region-level and zone-level incidents over the years where workloads pinned to a single zone or a single datacenter were affected while zone-redundant workloads stayed online. If a VM hosts anything customers depend on, treat this finding as real production risk.

The business impact scales with what the VM does. A throwaway dev box failing the check is noise. A standalone VM running a payment gateway, an Active Directory domain controller, or a database primary is a single point of failure that can take down an entire service.


How to fix it

The honest part first: you cannot add an existing VM to an availability set or a zone in place. Both are set at creation. Fixing this finding means redeploying the VM, so plan a maintenance window and back up your disks first.

Danger: Moving a VM into an availability set or zone requires deleting the VM and recreating it from its disks. The VM will be offline during this process, and a mistake in the disk reattachment can lose data. Snapshot every disk before you begin, and never run these steps blind against production.

Decide: availability set or availability zone?

For new and redeployed workloads, prefer availability zones. They protect against a wider class of failures (full datacenter loss, not just rack loss) and carry the higher 99.99% SLA. Use an availability set only when the target region does not support zones, or when you have a legacy dependency that requires it.

Option 1: Redeploy across availability zones (recommended)

For real redundancy you want at least two VMs in two different zones, usually behind a load balancer. Create the new VMs and pin each to a zone with the --zone flag.

# VM in zone 1
az vm create \
  --resource-group prod-rg \
  --name app-vm-z1 \
  --image Ubuntu2204 \
  --zone 1 \
  --size Standard_D2s_v5 \
  --admin-username azureuser \
  --generate-ssh-keys

# VM in zone 2
az vm create \
  --resource-group prod-rg \
  --name app-vm-z2 \
  --image Ubuntu2204 \
  --zone 2 \
  --size Standard_D2s_v5 \
  --admin-username azureuser \
  --generate-ssh-keys

To rebuild an existing VM from its current OS disk into a zone, snapshot the disk, create a zonal copy, then build a new VM from it:

# 1. Snapshot the existing OS disk
az snapshot create \
  --resource-group prod-rg \
  --name app-vm-os-snap \
  --source $(az vm show -g prod-rg -n app-vm --query "storageProfile.osDisk.managedDisk.id" -o tsv)

# 2. Create a zonal managed disk from the snapshot
az disk create \
  --resource-group prod-rg \
  --name app-vm-os-z1 \
  --source app-vm-os-snap \
  --zone 1

# 3. Create the new VM from the zonal disk
az vm create \
  --resource-group prod-rg \
  --name app-vm-z1 \
  --attach-os-disk app-vm-os-z1 \
  --os-type linux \
  --zone 1 \
  --size Standard_D2s_v5

Option 2: Place VMs in an availability set

If zones are not an option, create an availability set and deploy your VMs into it. The set must exist before the VM is created.

# Create the availability set
az vm availability-set create \
  --resource-group prod-rg \
  --name app-avset \
  --platform-fault-domain-count 2 \
  --platform-update-domain-count 5

# Create VMs into the set
az vm create \
  --resource-group prod-rg \
  --name app-vm-1 \
  --availability-set app-avset \
  --image Ubuntu2204 \
  --size Standard_D2s_v5 \
  --admin-username azureuser \
  --generate-ssh-keys

Note: An availability set only earns its SLA when it contains two or more VMs. A set with a single VM gives you nothing. The fault and update domains exist to spread multiple instances apart, so always pair this with at least two identical VMs behind a load balancer.

Infrastructure as Code

If you manage VMs with Terraform, set the zone directly on the resource so the configuration is reproducible:

resource "azurerm_linux_virtual_machine" "app" {
  for_each            = toset(["1", "2"])
  name                = "app-vm-z${each.value}"
  resource_group_name = azurerm_resource_group.prod.name
  location            = azurerm_resource_group.prod.location
  size                = "Standard_D2s_v5"
  zone                = each.value
  admin_username      = "azureuser"

  network_interface_ids = [azurerm_network_interface.app[each.value].id]

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Premium_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  admin_ssh_key {
    username   = "azureuser"
    public_key = file("~/.ssh/id_rsa.pub")
  }
}

Tip: Front your zonal VMs with an Azure Standard Load Balancer or Application Gateway, both of which are zone-redundant by default. Without a load balancer distributing traffic, two zonal VMs are just two single points of failure rather than one resilient service.


How to prevent it from happening again

One-off remediation does not stick. The same configuration drifts back in the next time someone spins up a VM through the portal. Lock it down with policy and pipeline gates.

Azure Policy

Azure Policy can audit or deny VM deployments that do not specify a zone. Here is an audit policy that flags any VM without a zone assignment:

{
  "if": {
    "allOf": [
      {
        "field": "type",
        "equals": "Microsoft.Compute/virtualMachines"
      },
      {
        "field": "Microsoft.Compute/virtualMachines/availabilitySet.id",
        "exists": "false"
      },
      {
        "field": "zones",
        "exists": "false"
      }
    ]
  },
  "then": {
    "effect": "audit"
  }
}

Start with audit to measure your current exposure without breaking deployments, then move to deny for production subscriptions once teams have adjusted their templates.

Warning: Switching the effect to deny will block any new single-VM deployment in scope, including legitimate ones like short-lived test boxes. Scope deny policies to production management groups or subscriptions, and keep audit-only in non-production so engineers are not fighting the policy during experiments.

CI/CD gates

Catch the problem before it reaches Azure by scanning IaC in your pipeline. Tools like Checkov and tfsec flag Terraform VM resources that omit zone or availability set configuration. A minimal pipeline step:

# Run Checkov against Terraform on every PR
checkov -d ./infra --framework terraform \
  --check CKV_AZURE_97 --soft-fail-on LOW

Wire this into your pull request checks so a VM definition without redundancy never merges.

Tip: Let Lensix run the vm_availabilityset check continuously across all your subscriptions so you catch resources created outside your IaC pipeline, like portal click-ops or scripts run by other teams. Policy and pipeline gates only cover the paths you control. Continuous scanning covers the rest.


Best practices

  • Default to zones for anything production. Availability zones give the broadest protection and the highest SLA. Reach for availability sets only when zones are unavailable in your region.
  • Two is the minimum. A single VM in a zone or set is still a single point of failure. Run at least two instances, in two zones, behind a load balancer.
  • Match data tier redundancy. A zone-redundant app tier in front of a single-zone database has just moved the single point of failure. Use zone-redundant managed services (Azure SQL with zone redundancy, zone-redundant storage) for the data layer too.
  • Use zone-redundant disks and storage. Pair zonal VMs with ZRS managed disks where supported so a zone loss does not strand your data.
  • Bake redundancy into templates, not runbooks. Make the redundant configuration the default in your modules so the easy path is also the resilient path.
  • Don't ignore the dev boxes forever. They are low priority, but they teach habits. If your standard module is redundant by default, dev and prod both come out right.

The fix here is rarely difficult. The hard part is that it has to happen at creation time, which is exactly why prevention through policy and IaC pays off more than chasing individual findings. Catch it once in the pipeline and you never deploy an unprotected VM again.