Back to blog
AWSBest PracticesCloud SecurityNetworkingReliability

NAT Gateway Not Highly Available: Fixing Single-AZ Egress in AWS VPCs

Learn why a single-AZ NAT gateway is a hidden VPC-wide outage risk, and how to deploy one NAT gateway per availability zone with CLI and Terraform fixes.

TL;DR

This check flags VPCs that route outbound traffic through a NAT gateway in a single availability zone. If that AZ goes down, every private subnet loses internet access. The fix is to deploy one NAT gateway per AZ and point each private subnet's route table at the gateway in its own zone.

NAT gateways are one of those pieces of AWS plumbing that work so reliably you forget they exist, right up until an availability zone has a bad day. A single NAT gateway is a regional service, but it lives in exactly one AZ. When that AZ has problems, your private instances in every zone can lose their path to the internet if they all funnel through that one gateway.

This Lensix check, vpc_natmultiaz, looks at how your private subnets reach the outside world and warns you when the answer is "through a single point of failure."


What this check detects

The check inspects each VPC and its route tables to determine whether outbound internet traffic from private subnets depends on a NAT gateway located in only one availability zone. It flags a VPC when:

  • One or more private subnets route 0.0.0.0/0 traffic to a NAT gateway, and
  • All of those gateways sit in a single AZ, or multiple AZs share one gateway.

In practice the most common failing pattern looks like this: a VPC spread across us-east-1a, us-east-1b, and us-east-1c, with private subnets in all three zones, but a single NAT gateway in us-east-1a that every subnet routes through.

Note: A NAT gateway provides outbound connectivity for resources in private subnets, things like patching servers, pulling container images, calling third-party APIs, or reaching AWS service endpoints that don't have a VPC endpoint. The gateway itself is highly available within its AZ, but AWS does not replicate it across zones for you. Cross-AZ resilience is your responsibility.


Why it matters

The risk here is not theoretical. AZ-level disruptions happen, and when they do, a single-AZ NAT gateway turns a localized problem into an outage that spans your whole VPC.

The failure mode

Say your NAT gateway is in us-east-1a and that AZ experiences a network or power event. Instances in us-east-1b and us-east-1c are completely healthy, but their route tables still send outbound traffic to a gateway that is now unreachable. The result:

  • Background jobs that call external APIs start timing out.
  • Container hosts can no longer pull images, so autoscaling and deployments stall.
  • Instances fail to fetch OS updates or talk to package mirrors.
  • Anything depending on egress to a SaaS dependency (payment processors, logging vendors, identity providers) degrades.

Healthy compute in two out of three zones is doing you no good if its only door to the internet is locked.

Warning: Cross-AZ data processing through a NAT gateway also costs more. When a subnet in us-east-1b routes through a gateway in us-east-1a, you pay inter-AZ data transfer charges on top of the standard NAT processing fee. A single shared gateway is both less reliable and more expensive at scale.

Business impact

For workloads with uptime commitments, a single-AZ NAT gateway quietly undermines the multi-AZ architecture you paid for everywhere else. You can run your databases in Multi-AZ mode, spread your instances across three zones, and still take a VPC-wide egress outage because of one route table decision.


How to fix it

The remediation is straightforward: deploy a NAT gateway in each AZ that has private subnets, then update each private subnet's route table to use the gateway in its own zone. Traffic stays local, and the failure of one AZ no longer affects the others.

Step 1: Identify your current setup

List your NAT gateways and the AZs (via subnets) they live in:

aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=vpc-0abc123456789def0" \
  --query 'NatGateways[].{ID:NatGatewayId,Subnet:SubnetId,State:State}' \
  --output table

Then check which route tables point at a NAT gateway:

aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-0abc123456789def0" \
  --query 'RouteTables[].{RT:RouteTableId,Routes:Routes[?NatGatewayId!=null].NatGatewayId}' \
  --output json

Step 2: Allocate an Elastic IP and create a NAT gateway per AZ

Each NAT gateway needs an Elastic IP and a public subnet in the target AZ (the gateway sits in a public subnet but serves the private subnets in the same zone).

# Allocate an EIP for the new gateway
ALLOC_ID=$(aws ec2 allocate-address --domain vpc \
  --query 'AllocationId' --output text)

# Create a NAT gateway in the public subnet for us-east-1b
aws ec2 create-nat-gateway \
  --subnet-id subnet-0public1bxxxxxxxx \
  --allocation-id "$ALLOC_ID" \
  --tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=nat-use1b}]'

Repeat for every AZ that hosts private subnets. Wait for each gateway to reach the available state before routing traffic to it.

Step 3: Point each private subnet's route table at its local gateway

Update (or create) a route table per AZ so the default route uses the NAT gateway in that same zone.

# For the us-east-1b private route table
aws ec2 replace-route \
  --route-table-id rtb-0privateb1xxxxxxxx \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-0newgatewayb1xxxx

Danger: replace-route changes live egress routing the moment it runs. Existing connections through the old gateway may be dropped. Run this during a maintenance window or roll it out one route table at a time, validating connectivity before moving to the next AZ.

Doing it properly with Terraform

Manual fixes drift. The durable answer is to express the per-AZ pattern in infrastructure as code so it can never regress to a single gateway:

variable "azs" {
  default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

resource "aws_eip" "nat" {
  for_each = toset(var.azs)
  domain   = "vpc"
}

resource "aws_nat_gateway" "this" {
  for_each      = toset(var.azs)
  allocation_id = aws_eip.nat[each.key].id
  subnet_id     = aws_subnet.public[each.key].id

  tags = {
    Name = "nat-${each.key}"
  }
}

resource "aws_route_table" "private" {
  for_each = toset(var.azs)
  vpc_id   = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.this[each.key].id
  }

  tags = {
    Name = "private-${each.key}"
  }
}

resource "aws_route_table_association" "private" {
  for_each       = aws_subnet.private
  subnet_id      = each.value.id
  route_table_id = aws_route_table.private[each.value.availability_zone].id
}

Tip: The community terraform-aws-modules/vpc module handles this for you. Set enable_nat_gateway = true and one_nat_gateway_per_az = true (and leave single_nat_gateway = false), and the module wires up one gateway per AZ with matching route tables automatically.


How to prevent it from happening again

Fixing one VPC is easy. Keeping every future VPC from shipping with a single NAT gateway is the real work. A few layers help here.

Policy as code in CI/CD

Catch the misconfiguration before it merges. With Checkov or OPA/Conftest you can scan Terraform plans and fail the build when the per-AZ pattern is missing. A simple OPA rule can assert that the number of aws_nat_gateway resources is at least equal to the number of AZs in use.

package vpc

deny[msg] {
  nat_count := count([r | r := input.resource.aws_nat_gateway[_]])
  az_count  := count(input.variable.azs.default)
  nat_count < az_count
  msg := sprintf("Found %d NAT gateways for %d AZs; need one per AZ", [nat_count, az_count])
}

Continuous monitoring

IaC gates only cover infrastructure that flows through your pipeline. Click-ops changes, console experiments, and resources created by other teams slip past. Running vpc_natmultiaz continuously in Lensix closes that gap by scanning live account state and alerting when any VPC drifts back to a single-AZ NAT setup.

Note: Pair this check with route table monitoring. A VPC can have multiple NAT gateways yet still funnel every subnet through one of them because the route tables were never updated. Resilience comes from the routing, not just from the count of gateways.


Best practices

  • One NAT gateway per AZ, always. Treat single-gateway VPCs as a non-production-only pattern, useful for dev or sandbox accounts where saving the gateway cost outweighs resilience.
  • Keep egress traffic in-zone. Each private subnet should route to the gateway in its own AZ. This improves resilience and avoids inter-AZ transfer charges.
  • Use VPC endpoints to cut NAT dependence. Gateway and interface endpoints for S3, DynamoDB, ECR, and other AWS services let traffic bypass the NAT gateway entirely, reducing both cost and your blast radius if a gateway fails.
  • Right-size cost expectations. Running three gateways instead of one roughly triples your fixed NAT hourly cost. For most production workloads that is a fair trade for AZ independence, but measure it against your actual egress volume.
  • Test the failure. If you run game days or chaos exercises, simulate the loss of an AZ and confirm that egress from the surviving zones keeps working.

Warning: Do not solve resilience by replacing NAT gateways with self-managed NAT instances just to save money. NAT instances add patching, scaling, and failover burden, and a single NAT instance is even less reliable than a single NAT gateway. If cost is the concern, lean on VPC endpoints first.

A NAT gateway in one AZ is the kind of decision that looks harmless on the architecture diagram and stays invisible until the day an availability zone fails. Spreading them across zones is cheap insurance against an outage that would otherwise take down egress for your entire VPC.