Fix SQS Queue Missing Dead Letter Queue (DLQ) | AWS

TL;DR

This check flags SQS queues that have no dead letter queue (DLQ) configured. Without a DLQ, messages that repeatedly fail processing get retried forever or silently dropped, hiding bugs and causing data loss. Fix it by creating a DLQ and attaching a redrive policy with a sensible maxReceiveCount.

A dead letter queue is one of those things you do not think about until a poison message takes down a consumer in production. SQS will happily redeliver a message that your code cannot process, over and over, while your function or worker burns CPU, racks up cost, and produces nothing but error logs. This check exists to make sure failed messages have somewhere to go.

Lensix raises sqs_nodeadletter when it finds an SQS queue with no redrive policy pointing at a dead letter queue. It is a low-effort fix that pays off the first time something goes wrong downstream.

What this check detects

The check inspects each SQS queue in your account and looks at its RedrivePolicy attribute. If that attribute is absent or empty, the queue has no dead letter queue and the check fails.

A redrive policy is a small piece of JSON attached to a source queue. It tells SQS two things:

deadLetterTargetArn — the ARN of the queue that should receive failed messages.
maxReceiveCount — how many times a message can be received and returned to the queue before SQS moves it to the DLQ.

When that policy is missing, there is no upper bound on how many times a message can be retried. Standard queues will keep redelivering until the message hits its retention limit, then drop it.

Note: A dead letter queue is just an ordinary SQS queue. There is no special queue type. It becomes a DLQ purely because another queue's redrive policy points at it. That means a DLQ should be the same type (standard or FIFO) as its source queue.

Why it matters

Skipping a DLQ feels harmless on day one. The cost shows up later, usually during an incident.

Poison messages cause infinite retries

Imagine a consumer that parses each message as JSON. One day an upstream service sends a malformed payload. Your consumer throws, never deletes the message, and the visibility timeout expires. SQS redelivers it. Your consumer throws again. This loop continues for the message's full retention period, which can be up to 14 days. With a Lambda trigger, that single bad message can consume invocations continuously and trigger error alarms that never clear.

Silent data loss

For standard queues, once a message exceeds the retention period it is deleted permanently. If you never set up a DLQ, you have no record that it existed and no way to replay it. For an order-processing or payment-event pipeline, that is lost business data with no audit trail.

Warning: Continuous redelivery of poison messages is a real cost driver. Each receive is a billable request, and with a Lambda event source mapping a single stuck message can generate millions of invocations over a few days. A DLQ caps that blast radius.

Lost visibility into failures

A DLQ is also a monitoring surface. When messages land in it, that is a clear signal something downstream is broken. Without one, failures are scattered across consumer logs and easy to miss. With one, you can alarm on ApproximateNumberOfMessagesVisible on the DLQ and get paged the moment processing starts failing.

How to fix it

The fix has two steps: create a dead letter queue, then attach a redrive policy to the source queue that points at it.

Step 1: Create the dead letter queue (CLI)

aws sqs create-queue \
  --queue-name my-app-queue-dlq \
  --attributes '{"MessageRetentionPeriod":"1209600"}'

Setting MessageRetentionPeriod to 1209600 seconds (14 days) gives you the maximum window to investigate and replay failed messages.

Step 2: Get the DLQ ARN

aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-app-queue-dlq \
  --attribute-names QueueArn

Step 3: Attach the redrive policy to the source queue

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-app-queue \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-app-queue-dlq\",\"maxReceiveCount\":\"5\"}"
  }'

A maxReceiveCount of 5 is a reasonable default. It tolerates transient errors like brief network blips while still moving genuinely broken messages to the DLQ before they cause a retry storm.

Note: The DLQ and the source queue must be in the same AWS account and region, and they must be the same type. A FIFO queue can only use a FIFO dead letter queue, and a standard queue can only use a standard one.

Console steps

Open the Amazon SQS console and create a new queue to act as the DLQ, matching the type of your source queue.
Open your source queue and choose Edit.
Expand Dead-letter queue and toggle it to Enabled.
Select the DLQ you created and set Maximum receives (for example, 5).
Save.

Terraform

resource "aws_sqs_queue" "dlq" {
  name                      = "my-app-queue-dlq"
  message_retention_seconds = 1209600
}

resource "aws_sqs_queue" "main" {
  name = "my-app-queue"

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 5
  })
}

# Restrict which source queues can use the DLQ
resource "aws_sqs_queue_redrive_allow_policy" "dlq_allow" {
  queue_url = aws_sqs_queue.dlq.id

  redrive_allow_policy = jsonencode({
    redrivePermission = "byQueue"
    sourceQueueArns   = [aws_sqs_queue.main.arn]
  })
}

Tip: The redrive_allow_policy on the DLQ side is worth adding. It controls which source queues are allowed to use this DLQ, so a misconfigured queue elsewhere cannot quietly dump messages into it. Set it to byQueue and list the exact source ARNs.

CloudFormation

{
  "Resources": {
    "Dlq": {
      "Type": "AWS::SQS::Queue",
      "Properties": {
        "QueueName": "my-app-queue-dlq",
        "MessageRetentionPeriod": 1209600
      }
    },
    "MainQueue": {
      "Type": "AWS::SQS::Queue",
      "Properties": {
        "QueueName": "my-app-queue",
        "RedrivePolicy": {
          "deadLetterTargetArn": { "Fn::GetAtt": ["Dlq", "Arn"] },
          "maxReceiveCount": 5
        }
      }
    }
  }
}

Do not forget the DLQ itself

A dead letter queue that nobody watches is almost as bad as no DLQ at all. Messages pile up silently and you find out weeks later. Add a CloudWatch alarm on the DLQ depth.

aws cloudwatch put-metric-alarm \
  --alarm-name my-app-dlq-not-empty \
  --namespace AWS/SQS \
  --metric-name ApproximateNumberOfMessagesVisible \
  --dimensions Name=QueueName,Value=my-app-queue-dlq \
  --statistic Maximum \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 0 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

This pages your team the moment any message lands in the DLQ, which is usually the earliest reliable signal that a consumer is broken.

Danger: When you replay messages out of a DLQ, do it deliberately. Use the SQS redrive-to-source feature or a controlled script, and confirm the underlying bug is fixed first. Redriving poison messages back into a still-broken consumer just refills the DLQ and can re-trigger the original retry storm.

How to prevent it from happening again

Manual remediation does not scale. Bake the requirement into the places where queues are created.

Catch it in CI with policy-as-code

If you provision queues with Terraform, scan plans before they apply. A Checkov policy or a custom OPA rule can reject any aws_sqs_queue that lacks a redrive_policy (excluding the DLQs themselves).

# Run Checkov against your Terraform
checkov -d ./infra --check CKV_AWS_307

For a custom Conftest / OPA rule:

package main

deny[msg] {
  resource := input.resource.aws_sqs_queue[name]
  not endswith(name, "dlq")
  not resource.redrive_policy
  msg := sprintf("SQS queue '%s' has no dead letter queue configured", [name])
}

Standardize with a module

Wrap queue creation in a shared Terraform module that creates the DLQ, wires up the redrive policy, and sets a DLQ alarm by default. Engineers call the module and get a safe configuration without thinking about it.

Tip: Make the DLQ opt-out rather than opt-in. In your shared module, create the DLQ automatically and require an explicit flag to disable it. That way the safe path is the default path, and skipping a DLQ becomes a conscious, reviewable decision.

Continuous detection with Lensix

IaC scanning only covers resources created through IaC. Click-ops queues and drift slip through. Lensix continuously evaluates live SQS configuration across your accounts and re-flags sqs_nodeadletter whenever a queue loses or never had a redrive policy, so you catch the gaps your pipeline never saw.

Best practices

Set maxReceiveCount thoughtfully. Too low and transient errors send healthy messages to the DLQ. Too high and poison messages retry too long. Start at 3 to 5 and tune based on your consumer's retry behavior.
Give DLQs long retention. Use the full 14 days so you have time to investigate before messages expire.
One DLQ per source queue. Avoid sharing a single DLQ across many unrelated queues. It muddies the signal and makes replay harder. A dedicated DLQ keeps failure attribution clean.
Match queue types. FIFO source needs a FIFO DLQ, standard needs standard. Mismatches are rejected.
Alarm on DLQ depth. A DLQ with no alarm is an invisible failure bucket.
Lock down the DLQ with a redrive allow policy. Restrict which source queues may target it so a stray queue cannot pollute your failure data.
Have a replay runbook. Document how to inspect, fix, and redrive messages so an on-call engineer is not improvising under pressure.

A dead letter queue costs almost nothing and takes minutes to configure. The first time it catches a poison message instead of letting it loop endlessly or vanish, it pays for itself many times over.

SQS Queue Has No Dead Letter Queue: Why It Matters and How to Fix It