Aurora Single Instance: Add a Reader for Failover

TL;DR

An Aurora cluster with a single DB instance has no failover target, so a host failure or AZ outage takes your database down until AWS finishes a slow recovery. Add at least one reader instance in a second Availability Zone to enable fast automatic failover.

Aurora is built around a clever split between compute and storage. Your data lives in a distributed storage layer spread across three Availability Zones, while the database instances that read and write to it are a separate, scalable tier. That design is powerful, but it lulls a lot of teams into a false sense of safety. A cluster can be fully replicating its storage six ways and still go offline the moment its single writer instance dies.

The rds_nonredundant check flags exactly this situation: an Aurora cluster that has only one DB instance attached to it. No reader, no standby, no failover target.

What this check detects

Lensix walks through your Aurora (RDS) clusters and counts the number of DB instances in each one. If a cluster has exactly one instance, the check fails with rds_nonredundant.

It is worth being precise about the terminology here, because Aurora and standard RDS Multi-AZ behave differently:

Aurora cluster: A logical grouping that owns the shared storage volume. You attach one or more instances (one writer, zero or more readers) to it.
Writer instance: The single instance that accepts writes. Every cluster has exactly one.
Reader instance (replica): Read-only instances that share the same storage volume. Any reader can be promoted to writer during a failover.

Note: Aurora replicas do not copy data the way traditional read replicas do. They attach to the same underlying storage volume, so replication lag is typically in the tens of milliseconds. That is why Aurora failover to a reader is fast, usually under 30 seconds, while there is no equivalent target on a single-instance cluster.

With a single instance, your only recovery path after a failure is for Aurora to spin up a brand new instance and attach it to the storage volume. That process is much slower and is the gap this check is trying to close.

Why it matters

The whole point of running a managed database cluster is to survive the failure of any single component. A single-instance cluster throws that away.

Hardware and host failures

Physical hosts fail. When the host running your only Aurora instance dies, AWS has to provision a replacement instance, attach it to the storage volume, and warm it up before it can serve traffic. On a cluster with a healthy reader, Aurora simply promotes the reader, often in 15 to 30 seconds. On a single-instance cluster you are looking at several minutes of hard downtime, and sometimes longer under broad infrastructure events.

Availability Zone outages

If your single instance lives in us-east-1a and that AZ has a problem, your database is unreachable. The storage layer survives because it is replicated across AZs, but with no instance in a healthy zone there is nothing to fail over to. A reader in a second AZ turns an AZ outage into a brief blip instead of a sustained incident.

Patching and maintenance windows

AWS occasionally needs to apply mandatory updates that require an instance restart. With a reader in place, Aurora can fail over during the maintenance window to keep downtime to seconds. With one instance, the restart is your downtime.

Warning: A single-instance Aurora cluster is not covered by the Aurora SLA's higher availability tiers. AWS only commits to the stronger Multi-AZ availability guarantees when you run instances across at least two Availability Zones. Single-instance clusters fall outside that commitment.

The business impact

For a production system, the difference is stark. A failover with a reader is a 30 second hiccup that most users never notice. The same failure on a single-instance cluster can be a 5 to 10 minute outage, queued requests, dropped connections, and an incident channel full of pages. If that database backs checkout, auth, or anything customer facing, the cost adds up fast.

How to fix it

The fix is to add at least one reader instance, placed in a different Availability Zone from the writer. You can do this without downtime.

Option 1: AWS Console

Open the RDS console and go to Databases.
Select your Aurora cluster (the row with the cluster icon, not an individual instance).
Choose Actions, then Add reader.
Pick an instance class that matches your writer so the reader can take over without a performance cliff.
Under Availability Zone, explicitly choose a zone different from the writer.
Leave Failover priority at tier 0 or 1 so this reader is preferred during failover.
Create the reader. Aurora attaches it to the existing storage volume, so there is no data copy step.

Option 2: AWS CLI

First, confirm the cluster currently has one instance and note the writer's AZ:

aws rds describe-db-clusters \
  --db-cluster-identifier my-aurora-cluster \
  --query 'DBClusters[0].DBClusterMembers[*].[DBInstanceIdentifier,IsClusterWriter]' \
  --output table

Then add a reader in a second AZ:

aws rds create-db-instance \
  --db-instance-identifier my-aurora-reader-1 \
  --db-cluster-identifier my-aurora-cluster \
  --engine aurora-postgresql \
  --db-instance-class db.r6g.large \
  --availability-zone us-east-1b \
  --promotion-tier 1

Note: Set --engine to match your cluster (aurora-mysql or aurora-postgresql). The --promotion-tier value controls failover preference, where lower numbers are promoted first. Keep your most capable readers in tier 0 or 1.

Verify the reader is available and attached:

aws rds describe-db-instances \
  --db-instance-identifier my-aurora-reader-1 \
  --query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]' \
  --output table

Option 3: Terraform

If you manage Aurora in Terraform, the writer is usually one aws_rds_cluster_instance resource. Add a second one in a different AZ:

resource "aws_rds_cluster_instance" "writer" {
  identifier         = "my-aurora-writer"
  cluster_identifier = aws_rds_cluster.this.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.this.engine
  availability_zone  = "us-east-1a"
  promotion_tier     = 0
}

resource "aws_rds_cluster_instance" "reader" {
  identifier         = "my-aurora-reader-1"
  cluster_identifier = aws_rds_cluster.this.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.this.engine
  availability_zone  = "us-east-1b"
  promotion_tier     = 1
}

A cleaner pattern is to drive the instance count and AZs from a variable so the same module enforces redundancy everywhere:

variable "instance_count" {
  type    = number
  default = 2
}

resource "aws_rds_cluster_instance" "instances" {
  count              = var.instance_count
  identifier         = "my-aurora-${count.index}"
  cluster_identifier = aws_rds_cluster.this.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.this.engine
  promotion_tier     = count.index
}

Warning: A second instance roughly doubles the compute cost of the cluster, since you pay per instance-hour. Storage and I/O are shared and not duplicated. For non-production clusters where downtime is acceptable, it is reasonable to accept this finding rather than pay for a reader. Make that decision deliberately, not by accident.

Test the failover before you trust it

Adding a reader is only half the job. Confirm failover actually works by triggering one in a maintenance window:

Danger: The command below forces a failover and will briefly drop connections to the writer endpoint. Run it against production only during a planned window, and make sure your application reconnects cleanly.

aws rds failover-db-cluster \
  --db-cluster-identifier my-aurora-cluster \
  --target-db-instance-identifier my-aurora-reader-1

Watch how long your application takes to recover. If it hangs for minutes, your connection pool or DNS caching is the bottleneck, not Aurora. Point your application at the cluster's writer and reader endpoints rather than instance endpoints so failover is transparent.

How to prevent it from happening again

Manual reminders do not scale. Bake redundancy into the systems that create clusters.

Gate it in CI/CD with policy-as-code

If you provision Aurora through Terraform, fail the plan when a cluster has fewer than two instances. Here is an OPA/Conftest policy that does it:

package terraform.rds

import future.keywords.in

# Count cluster instances per cluster identifier
cluster_instance_count[cluster] = count {
  some cluster
  instances := [r |
    r := input.resource_changes[_]
    r.type == "aws_rds_cluster_instance"
    r.change.after.cluster_identifier == cluster
  ]
  count := count(instances)
}

deny[msg] {
  some cluster
  cluster_instance_count[cluster] < 2
  msg := sprintf("Aurora cluster '%s' has fewer than 2 instances and cannot fail over", [cluster])
}

Wire this into your pipeline so a one-instance cluster never reaches an apply:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy policy/

Tip: Pair the pipeline gate with continuous monitoring in Lensix. CI catches clusters at creation time, but Lensix catches the ones that drift, for example when someone deletes a reader during an incident and forgets to recreate it. Defense in depth applies to config checks too.

Enforce it with an SCP or Config rule

AWS Config can flag the same condition across every account continuously. The managed rule rds-multi-az-support covers standard RDS, and for Aurora you can use a custom rule that asserts DBClusterMembers has length of at least two. Route non-compliant findings to a Slack channel or ticket queue so they do not sit unnoticed.

Make the module the safe default

The most durable fix is to publish an internal Aurora module where two instances across two AZs is the default and reducing it requires an explicit, reviewed override. Teams should have to opt out of redundancy, not opt in.

Best practices

Run at least two instances in production, across at least two AZs. One writer plus one reader is the minimum for fast automatic failover.
Use the cluster endpoints, not instance endpoints. The writer endpoint always points at the current writer after a failover, and the reader endpoint load-balances across readers. Hardcoding instance endpoints breaks failover.
Match reader and writer instance classes. A tiny reader can technically be promoted, but it may buckle under full write load the moment it becomes the writer.
Set failover priority tiers deliberately. Aurora promotes the lowest-numbered tier first. Keep your best-sized instances in tier 0 or 1.
Consider Aurora multi-AZ writes or global database for stricter requirements. For workloads that cannot tolerate even a short regional issue, Aurora Global Database adds cross-region replicas with managed promotion.
Practice failover regularly. A failover path you have never tested is a failover path you do not actually have. Run game days.
Right-size for cost where redundancy is not needed. Dev and ephemeral clusters do not need a reader. Document that exception so the finding is an accepted risk rather than a blind spot.

A single-instance Aurora cluster is one of those misconfigurations that looks fine in every dashboard right up until the moment it does not. The data is safe, the metrics are green, and then a host fails and you discover there was nothing to fail over to. Adding a reader in a second AZ is a small, cheap change that turns a multi-minute outage into a few seconds of nothing.

RDS Cluster Has a Single Instance: Why Your Aurora Cluster Can't Fail Over