Fix ElastiCache Redis With No Backup | Lensix

TL;DR

This check flags ElastiCache Redis clusters with no snapshot retention, which means you cannot restore data after a failure, accidental flush, or corruption. Set SnapshotRetentionLimit to at least 7 days and configure a snapshot window to enable point-in-time recovery.

Redis is often treated as a disposable cache, something you can rebuild from the source of truth at any time. That assumption holds right up until it doesn't. Plenty of teams use ElastiCache Redis for session stores, rate limiters, leaderboards, job queues, and feature flags, where the data living in Redis is the source of truth. When a cluster has no backups configured, a single bad FLUSHALL, a node failure, or a corrupted dataset can wipe that state with no way to get it back.

The elasticache_nobackup check looks at each ElastiCache Redis cluster and confirms whether snapshot retention is configured. If retention is set to zero, automatic backups are disabled and the cluster has no recovery point.

What this check detects

ElastiCache for Redis supports automatic daily snapshots controlled by a single setting: SnapshotRetentionLimit. This value is the number of days ElastiCache keeps automatic backups. When it is set to 0, automatic backups are turned off entirely.

This check fails when a Redis cluster (or replication group) has SnapshotRetentionLimit set to 0. In that state:

No automatic daily snapshots are taken.
There is no point-in-time recovery option after data loss.
You cannot seed a new cluster from a recent backup.

Note: Backups only apply to Redis. Memcached clusters do not support snapshots at all because Memcached is purely in-memory with no persistence. This check is specific to the Redis engine.

Why it matters

The risk here depends entirely on what role Redis plays in your architecture, and that role tends to drift over time. A cache that started as a pure read-through layer often grows into something that holds data nobody else has.

Real-world failure scenarios

Accidental flush. An engineer connects to the wrong cluster and runs FLUSHALL during debugging. Without a snapshot, every key is gone instantly and permanently.
Application bug. A deploy ships code that overwrites or deletes keys it shouldn't. By the time alerts fire, the damage is done.
Node or AZ failure. Even with replication, certain failure modes can leave you needing to rebuild from a known good state. A snapshot gives you that starting point.
Migration and cloning. Snapshots are how you spin up a copy of production data in a staging cluster or move to a different node type. No backups means no clean way to clone.

For session stores, losing Redis means logging out every active user. For a rate limiter, it can mean a thundering herd of requests that were previously throttled. For a job queue backed by Redis, it can mean lost jobs that never get processed. None of these are theoretical, and all of them are avoidable for the cost of keeping a few days of snapshots.

Warning: Snapshots are not free. ElastiCache charges for backup storage beyond the free allotment (which equals one day of your cluster's storage per node). A 7-day retention on a large cluster will add to your S3-backed backup storage bill. The cost is usually small relative to the data it protects, but budget for it.

How to fix it

You can enable backups on an existing cluster without recreating it. There are three common paths: the console, the AWS CLI, and infrastructure as code.

Option 1: AWS Console

Open the ElastiCache console and select Redis clusters.
Select the cluster or replication group, then choose Modify.
Under Backup, enable automatic backups.
Set Backup retention period to 7 days (or higher for critical data).
Set a Backup window during low-traffic hours.
Apply the change immediately or during the next maintenance window.

Option 2: AWS CLI

For a replication group (cluster mode disabled or enabled), modify the retention limit and snapshot window:

aws elasticache modify-replication-group \
  --replication-group-id my-redis-group \
  --snapshot-retention-limit 7 \
  --snapshot-window 03:00-05:00 \
  --apply-immediately

For a standalone cache cluster (legacy single-node, no replication group):

aws elasticache modify-cache-cluster \
  --cache-cluster-id my-redis-node \
  --snapshot-retention-limit 7 \
  --snapshot-window 03:00-05:00 \
  --apply-immediately

Note: The --snapshot-window is in UTC and must be at least 60 minutes long. Pick a window that does not overlap your maintenance window, and aim for your lowest traffic period since snapshotting adds CPU and memory pressure during the dump.

You can also take an immediate manual snapshot before making other changes, which is good practice before any risky operation:

aws elasticache create-snapshot \
  --replication-group-id my-redis-group \
  --snapshot-name my-redis-group-pre-change

Option 3: Terraform

If you manage ElastiCache through Terraform, set the retention attributes directly on the resource:

resource "aws_elasticache_replication_group" "redis" {
  replication_group_id = "my-redis-group"
  description          = "Primary Redis cluster"
  node_type           = "cache.r6g.large"
  num_cache_clusters  = 2
  engine              = "redis"

  snapshot_retention_limit = 7
  snapshot_window          = "03:00-05:00"

  # Keep a final snapshot if the group is ever destroyed
  final_snapshot_identifier = "my-redis-group-final"
}

Tip: Set final_snapshot_identifier so that a terraform destroy leaves you a recovery point instead of vaporizing the data. It costs nothing while the cluster lives and saves you on the day a destroy happens by mistake.

CloudFormation

RedisGroup:
  Type: AWS::ElastiCache::ReplicationGroup
  Properties:
    ReplicationGroupId: my-redis-group
    ReplicationGroupDescription: Primary Redis cluster
    Engine: redis
    CacheNodeType: cache.r6g.large
    NumCacheClusters: 2
    SnapshotRetentionLimit: 7
    SnapshotWindow: "03:00-05:00"

How to prevent it from happening again

Fixing one cluster is easy. Making sure the next cluster someone spins up is not missing backups is the part that actually keeps you out of trouble. Bake the requirement into the places where clusters get created.

Gate it in CI with policy as code

If you provision with Terraform, add an OPA/Conftest policy that rejects any replication group without retention configured:

package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_elasticache_replication_group"
  retention := resource.change.after.snapshot_retention_limit
  retention < 7
  msg := sprintf(
    "ElastiCache group '%s' must set snapshot_retention_limit >= 7 (got %d)",
    [resource.change.after.replication_group_id, retention]
  )
}

Run it against a plan in your pipeline so a non-compliant change never reaches apply:

terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json

Detect drift at runtime

Policy as code catches new infrastructure, but it does not catch someone setting retention to zero through the console after the fact. Use AWS Config or a scheduled scan to catch drift. An AWS Config custom rule or a Lensix scan running on a schedule will surface any cluster that falls back to zero retention.

Tip: Lensix runs the elasticache_nobackup check continuously across all your accounts and regions, so you find a misconfigured cluster within the scan interval rather than during an incident. Pair it with the policy-as-code gate above to cover both new and existing resources.

Best practices

Match retention to recovery needs. Seven days is a reasonable default. For clusters holding business-critical state, go to 14 or 35 days, the latter being the ElastiCache maximum for automatic snapshots.
Schedule snapshots during low traffic. Snapshotting forks the process and consumes memory and CPU. Put the window in your quietest hours and make sure your node has enough free memory (aim for 25 percent headroom) so the fork does not trigger swapping.
Take manual snapshots before risky changes. Before a node type change, engine upgrade, or large data migration, create a named manual snapshot. Manual snapshots persist until you delete them, unlike automatic ones that age out.
Export critical snapshots to S3. For long-term retention beyond 35 days or for cross-account copies, export snapshots to an S3 bucket. This decouples your recovery point from the cluster lifecycle.
Test your restores. A backup you have never restored is a guess, not a recovery plan. Periodically spin up a cluster from a snapshot and verify the data is intact.
Treat Redis data as real data when it is. If your application cannot rebuild the cache from another source within an acceptable window, then Redis holds durable data and deserves the same backup discipline as a database.

Danger: Restoring from a snapshot creates a new cluster from that point in time. It does not roll back your existing cluster in place. Always restore into a new cluster, validate the data, then cut traffic over. Never assume a restore will merge with live data, because it will not.

Backups on a cache feel like overkill until the one time you need them. The setting is a single number, the cost is small, and the downside of skipping it is irreversible data loss. Set the retention limit, gate it in your pipeline, and move on.

ElastiCache Redis Has No Backup: Enabling Snapshot Retention for Recovery