Fix RDS Not Highly Available (Single-AZ) on AWS

TL;DR

This check flags RDS instances and Aurora clusters running in a single Availability Zone, which means a single zone outage takes your database down with no automatic failover. Fix it by enabling Multi-AZ deployment, which provisions a standby in a second AZ and fails over automatically.

A database is usually the hardest part of your stack to make resilient. You can run a dozen stateless web servers behind a load balancer and shrug off a node failure, but your primary database holds state, and losing it means losing the heart of your application. The RDS Not Highly Available check catches one of the most common and most preventable ways that happens: an RDS instance or Aurora cluster confined to a single Availability Zone.

When everything lives in one AZ, an outage in that zone, a hardware failure on the underlying host, or even routine maintenance can knock your database offline with no automatic recovery. This post walks through what the check looks at, why single-AZ databases are a real liability, and exactly how to fix and prevent it.

What this check detects

The check inspects your RDS instances and Aurora clusters and flags any that do not span multiple Availability Zones. Concretely, it looks at two things depending on the engine:

RDS instances (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server) where the MultiAZ property is set to false.
Aurora clusters that have no reader instance in a second Availability Zone, leaving the writer as a single point of failure.

Note: Multi-AZ for standard RDS is not the same as a read replica. A Multi-AZ standby is a synchronous copy in another AZ that you cannot read from or query. It exists purely for failover. Read replicas are asynchronous and meant for scaling reads, not for high availability.

An Availability Zone is a physically isolated datacenter (or group of datacenters) within an AWS region, with its own power, cooling, and networking. AWS designs zones to fail independently, so the whole point of spanning multiple AZs is that one going dark does not take your workload with it.

Why it matters

A single-AZ database is a single point of failure dressed up as a managed service. RDS being "managed" lulls teams into thinking AWS handles resilience for them. It does not, unless you ask for it.

What actually goes wrong

AZ outages happen. AWS zones do go down. When they do, every single-AZ resource in that zone is unreachable until the zone recovers, which can be minutes or hours.
Hardware fails. The physical host running your instance can die. With a single-AZ setup, AWS has to recover your instance on new hardware, which means extended downtime while it boots and replays logs.
Maintenance causes downtime. Patching the OS or database engine on a single-AZ instance takes the database offline for the duration. With Multi-AZ, AWS patches the standby first, then fails over, cutting the outage to roughly a minute.

Warning: Single-AZ instances also have a longer recovery window during a storage failure. AWS has to provision new storage and restore from your most recent backup or snapshot, so your RPO and RTO are entirely at the mercy of your backup configuration.

The business impact

For most applications the database outage is the outage. If checkout, login, or your core API depends on that database, a single-AZ failure becomes a full incident, complete with revenue loss, breached SLAs, and an on-call engineer paged at 3 a.m. Multi-AZ does not make your database invincible, but it turns a multi-hour scramble into a roughly 60 to 120 second automated failover that most users never notice.

There are also compliance angles. Frameworks like SOC 2 and ISO 27001 expect documented availability and resilience controls. A production database with no failover capability is a finding waiting to happen during an audit.

How to fix it

The remediation depends on whether you are running standard RDS or Aurora. Both are straightforward, but they behave differently.

Standard RDS: enable Multi-AZ

You can convert an existing single-AZ instance to Multi-AZ in place. AWS creates the standby and synchronizes it for you.

Warning: Converting to Multi-AZ adds a synchronous standby, which roughly doubles the instance cost for that database. It also triggers a brief performance impact during the initial sync. Plan to run the conversion during a low-traffic window.

First, check the current state of an instance:

aws rds describe-db-instances \
  --db-instance-identifier my-prod-db \
  --query 'DBInstances[0].{MultiAZ:MultiAZ,AZ:AvailabilityZone,Status:DBInstanceStatus}'

If MultiAZ is false, enable it:

aws rds modify-db-instance \
  --db-instance-identifier my-prod-db \
  --multi-az \
  --apply-immediately

Note: Without --apply-immediately, the change waits for your next maintenance window. Applying immediately starts the standby creation right away, but there is no downtime for the modification itself. The database stays available throughout.

Aurora: add a reader in a second AZ

Aurora handles availability differently. Its storage is already replicated across three AZs, but the compute layer is not highly available unless you add at least one reader instance in a different AZ. If the writer fails, Aurora promotes a reader to take over.

aws rds create-db-instance \
  --db-instance-identifier my-aurora-reader-1 \
  --db-cluster-identifier my-aurora-cluster \
  --engine aurora-postgresql \
  --db-instance-class db.r6g.large \
  --availability-zone us-east-1b

Place the reader in a different AZ from the writer. With at least one reader present, Aurora's failover target is ready and promotion typically completes in under 30 seconds.

Doing it with infrastructure as code

If your databases are managed with Terraform, the fix is a one-line property change. For standard RDS:

resource "aws_db_instance" "prod" {
  identifier        = "my-prod-db"
  engine            = "postgres"
  instance_class    = "db.r6g.large"
  allocated_storage = 100

  multi_az = true   # the fix

  backup_retention_period = 7
  skip_final_snapshot     = false
}

For Aurora, define the cluster plus at least two instances across different zones:

resource "aws_rds_cluster" "prod" {
  cluster_identifier = "my-aurora-cluster"
  engine             = "aurora-postgresql"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

resource "aws_rds_cluster_instance" "prod" {
  count              = 2
  identifier         = "my-aurora-${count.index}"
  cluster_identifier = aws_rds_cluster.prod.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.prod.engine
}

Tip: After enabling Multi-AZ, actually test a failover so you know it works and your application reconnects cleanly. You can trigger one on demand with aws rds reboot-db-instance --db-instance-identifier my-prod-db --force-failover. Run it in staging first.

How to prevent it from happening again

Fixing the instances you have today is half the job. The other half is making sure no one ships a single-AZ production database next quarter.

Gate it in CI/CD with policy as code

If you use Terraform, run a policy check on every plan. Here is an Open Policy Agent (Conftest) rule that fails any RDS instance without Multi-AZ:

package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_db_instance"
  resource.change.after.multi_az == false
  msg := sprintf("RDS instance '%s' must have multi_az enabled", [resource.address])
}

Wire that into your pipeline so a non-compliant plan never reaches apply:

terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json --policy ./policies

Catch drift continuously

Policy gates only cover infrastructure that flows through your pipeline. Resources created by hand in the console, or instances whose Multi-AZ setting was flipped off later, slip past. A continuous check across your live accounts catches those. This is exactly what the Lensix rds_not_highly_available check does, scanning your real RDS and Aurora estate and flagging any single-AZ database regardless of how it was created.

Tip: Pair a build-time policy gate with a continuous runtime check. The gate stops bad config from merging, and the runtime scan catches anything created out of band or changed after deployment. Neither alone is enough.

Use AWS Config for an account-level guardrail

The managed AWS Config rule rds-multi-az-support evaluates instances against the Multi-AZ requirement and reports non-compliant resources automatically. You can route findings to Security Hub or trigger an EventBridge alert when something drifts.

Best practices

Treat Multi-AZ as the default for anything production. Single-AZ is fine for dev, test, and ephemeral environments where downtime costs nothing. Anywhere users or revenue depend on the database, span at least two zones.
Do not confuse high availability with backups. Multi-AZ protects against infrastructure failure. It does not protect against accidental deletes, bad migrations, or corruption, because those replicate to the standby instantly. Keep automated backups and point-in-time recovery enabled alongside it.
Consider Multi-AZ DB clusters for read scaling plus HA. The newer Multi-AZ DB cluster deployment for RDS gives you two readable standbys across three AZs, combining failover with read capacity. It is worth evaluating for high-throughput workloads.
Test failover on a schedule. A failover capability you have never exercised is a capability you do not actually know works. Run forced failovers in staging regularly and confirm your connection pooling handles the endpoint cutover.
Use the cluster endpoint, not instance endpoints. For Aurora, always connect through the writer and reader endpoints so failover is transparent. Hardcoding an instance endpoint defeats the whole point.

Note: For workloads that cannot tolerate even a full region outage, Multi-AZ is the floor, not the ceiling. Look at cross-region read replicas or Aurora Global Database for regional resilience. That is a separate, more expensive tier of protection, but the principle is the same: never let one failure domain hold your only copy of the data.

Enabling Multi-AZ is one of the cheapest insurance policies in AWS relative to what it protects. The cost is a second instance and a few minutes of planning. The payoff is turning an AZ outage from a full-blown incident into a brief blip nobody notices. Find your single-AZ databases, enable failover, gate it in your pipeline, and keep a continuous check running so it stays that way.

RDS Not Highly Available: Fixing Single-AZ Databases on AWS