Fix Single-AZ Auto Scaling Groups in AWS | Lensix

TL;DR

This check flags Auto Scaling groups that only run in one availability zone, which means a single AZ outage can take your whole service offline. Fix it by attaching subnets from at least two or three AZs to the ASG.

Auto Scaling groups are supposed to keep your application healthy by replacing failed instances and adjusting capacity to match demand. That promise falls apart the moment all of your instances live in a single availability zone. If that one AZ has a problem, your ASG cannot launch replacements there, your healthy capacity drops to zero, and your customers feel it.

The asg_singleaz check looks at every Auto Scaling group in your AWS account and flags any group whose configured subnets all map to the same availability zone. It is a low-effort fix with an outsized reliability payoff.

What this check detects

An Auto Scaling group launches instances into the subnets you assign to it through the VPCZoneIdentifier property. Each subnet lives in exactly one availability zone. When all of the subnets attached to an ASG belong to the same AZ, the group is effectively pinned to that zone.

Lensix marks the group as failing when:

The ASG has subnets in only one availability zone, or
The ASG has multiple subnets but every one of them resolves to the same AZ

Note: An Auto Scaling group does not pick availability zones directly. It infers them from the subnets you attach. So "spanning multiple AZs" really means "attaching subnets from multiple AZs." A common mistake is attaching three subnets that all happen to sit in us-east-1a.

Why it matters

Availability zones are AWS's unit of physical fault isolation. Each AZ is one or more discrete data centers with independent power, cooling, and networking. AWS designs them so that a failure in one zone does not cascade into another. That design only helps you if your workload actually runs in more than one zone.

The single-AZ failure scenario

Picture a web tier behind a load balancer, served by an ASG of four instances, all in eu-west-1b. AWS reports a power event in that zone. Here is what happens:

Your four instances become unreachable or terminate.
The ASG tries to launch replacements, but it can only launch them in eu-west-1b, which is the zone that is down.
Replacement launches fail or the new instances are also unhealthy.
Your service has zero healthy capacity until the AZ recovers, which could be minutes or hours.

Spread those same four instances across eu-west-1a, eu-west-1b, and eu-west-1c, and the loss of one zone removes roughly a third of your capacity. The remaining instances keep serving traffic, and the ASG launches replacements in the healthy zones.

Business impact

Downtime during AZ events. AZ-level disruptions are rare but real, and they are exactly what multi-AZ design protects against.
Failed deployments and rebalancing. Even routine instance refreshes can leave you exposed if there is no second zone to fall back on.
SLA breaches. Many AWS services only offer their availability SLA when resources span multiple AZs. A single-AZ deployment can void that protection.
Load balancer health. An Application or Network Load Balancer routes across zones. If all targets sit in one zone, you lose the cross-zone resilience the load balancer was meant to provide.

How to fix it

The fix is to attach subnets from at least two, ideally three, availability zones to the Auto Scaling group. You almost never need to touch the instances themselves.

Step 1: Find the AZs your ASG currently uses

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --query "AutoScalingGroups[0].AvailabilityZones"

If this returns a single zone, the group is single-AZ.

Step 2: Find available subnets in other AZs

List subnets in the same VPC, grouped by availability zone, so you can pick one per zone:

aws ec2 describe-subnets \
  --filters "Name=vpc-id,Values=vpc-0abc123456789def0" \
  --query "Subnets[].{ID:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock}" \
  --output table

Warning: Make sure the subnets you choose match the tier the instances belong to. Do not attach public subnets to a group of private backend instances, and confirm each subnet has the routing it needs. Mixing subnet types can break connectivity or unexpectedly expose instances.

Step 3: Update the ASG to span multiple AZs

Set VPCZoneIdentifier to a comma-separated list of subnets, one per availability zone:

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-app-asg \
  --vpc-zone-identifier "subnet-0aaa111,subnet-0bbb222,subnet-0ccc333"

This change is non-destructive. Existing instances keep running, and the ASG begins using the new zones for future launches.

Step 4: Rebalance capacity across the new zones

The ASG will naturally rebalance over time, but you can speed it up with an instance refresh so capacity is distributed immediately:

Danger: An instance refresh terminates and replaces running instances in batches. On a production workload, set a conservative MinHealthyPercentage and confirm your load balancer health checks are working before you start, or you risk dropping traffic mid-refresh.

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-app-asg \
  --preferences '{"MinHealthyPercentage": 90, "InstanceWarmup": 120}'

Fixing it in Terraform

If you manage infrastructure as code, point the ASG at multiple subnets and let AWS spread instances across zones:

resource "aws_autoscaling_group" "app" {
  name                = "my-app-asg"
  min_size            = 3
  max_size            = 9
  desired_capacity    = 3

  # One subnet per AZ
  vpc_zone_identifier = [
    aws_subnet.private_a.id,
    aws_subnet.private_b.id,
    aws_subnet.private_c.id,
  ]

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  health_check_type         = "ELB"
  health_check_grace_period = 120
}

Fixing it in CloudFormation

AppAutoScalingGroup:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: "3"
    MaxSize: "9"
    DesiredCapacity: "3"
    VPCZoneIdentifier:
      - !Ref PrivateSubnetA
      - !Ref PrivateSubnetB
      - !Ref PrivateSubnetC
    LaunchTemplate:
      LaunchTemplateId: !Ref AppLaunchTemplate
      Version: !GetAtt AppLaunchTemplate.LatestVersionNumber
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120

Tip: Set min_size to a multiple of the number of AZs (for example, 3 instances across 3 zones). This guarantees you keep at least one instance per zone even at minimum capacity, so losing a zone never leaves a zone empty during recovery.

How to prevent it from happening again

One-off fixes drift back. The reliable way to stop single-AZ groups from reappearing is to enforce multi-AZ at the point where infrastructure is defined.

Policy-as-code with OPA / Conftest

Run this Rego policy against your Terraform plan in CI to block any ASG with fewer than two subnets:

package terraform.asg

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_autoscaling_group"
  subnets := resource.change.after.vpc_zone_identifier
  count(subnets) < 2
  msg := sprintf(
    "ASG '%s' must span at least 2 subnets in different AZs",
    [resource.address],
  )
}

Catch it in CI/CD

Add a gate to your pipeline so plans fail before they reach production:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy ./policies

Detect drift continuously

Policy-as-code only catches changes that flow through your pipeline. Anything created by hand in the console, by a script, or by another team slips past it. Lensix scans your live AWS accounts and flags any ASG that has drifted into a single-AZ state, so you catch the gaps that static checks miss.

Best practices

Use three AZs where the region supports it. Two AZs survive one zone failure, but three give you better capacity headroom and smoother rebalancing during an outage.
Keep one subnet per AZ per tier. A clean pattern is one private subnet per AZ for application instances and one public subnet per AZ for load balancers.
Pair multi-AZ ASGs with cross-zone load balancing. Enable cross-zone load balancing on your ALB or NLB so traffic spreads evenly regardless of how many targets live in each zone.
Use ELB health checks, not just EC2 checks. Setting health_check_type to ELB lets the ASG replace instances that fail application-level checks, not just instances that fail the hypervisor.
Confirm capacity exists in each AZ. Certain instance types are not available in every AZ. Use multiple instance types in a mixed instances policy so the ASG can still launch when one type is constrained in a zone.
Test it. Periodically simulate an AZ loss by detaching or terminating all instances in one zone and confirming the group recovers. Resilience you have never exercised is resilience you cannot trust.

Note: Spreading across AZs does not increase data transfer cost the way cross-region traffic does, but inter-AZ traffic is not free. For most workloads the reliability gain far outweighs the modest data transfer charges, and it is the standard AWS-recommended pattern.

Single-AZ Auto Scaling groups are one of the cheapest reliability bugs to fix and one of the most painful to discover during an actual outage. Attach subnets from multiple zones, gate it in your pipeline, and let continuous scanning catch the drift.

Auto Scaling Group Spans Single AZ: Why It's a Reliability Risk and How to Fix It