AWS Deregistration Delay Set to 0: Risks & Fix

TL;DR

A target group with a deregistration delay of 0 means in-flight requests get dropped the instant a target is removed, causing failed requests during deploys and scale-in events. Set the delay to a sensible value (30 to 120 seconds for most apps) so connections drain cleanly before the target leaves the pool.

Deregistration delay, also called connection draining, is one of those load balancer settings that nobody notices until a deploy starts throwing 502s and 504s. When a target group is set to deregister targets immediately, the load balancer stops routing traffic to a target and then yanks it out of rotation with no grace period. Any request that was mid-flight when the target left gets cut off.

The Lensix check lb_noderegistrationdelay flags any AWS Elastic Load Balancing target group where the deregistration delay is set to 0 seconds. This post explains what that setting does, why a zero value is risky, and how to fix and prevent it.

What this check detects

Application Load Balancers (ALB) and Network Load Balancers (NLB) route traffic to target groups, which contain the registered targets (EC2 instances, IP addresses, or Lambda functions). When a target needs to be removed, whether due to a deployment, an auto scaling scale-in event, or a manual deregistration, the load balancer enters a draining state for that target.

The deregistration delay controls how long the load balancer waits, after marking a target as draining, before it fully removes the target and stops sending it any traffic. During this window, existing in-flight requests are allowed to complete, but no new requests are routed to the draining target.

This check fires when the delay is configured to 0. With a zero delay there is effectively no draining period, so the load balancer removes the target right away and any open connections are terminated abruptly.

Note: The relevant target group attribute is deregistration_delay.timeout_seconds. Its default value is 300 seconds. A value of 0 is almost always an explicit choice someone made, not an accidental default, which is why it is worth reviewing.

Why it matters

The damage shows up at exactly the moments you want things to go smoothly: deployments and scaling events.

Dropped requests during deployments

Rolling deployments rely on draining. A typical blue/green or rolling update pattern deregisters old targets and registers new ones. If the old targets are removed with zero delay, every request currently being processed on those targets is severed. For a web app this means visible 502 and 504 errors. For an API it means failed transactions, broken uploads, and clients forced to retry.

Connection resets during scale-in

Auto Scaling groups remove instances during scale-in. With connection draining disabled, an instance that is processing a long-running request, say a report generation or a file upload, gets cut off mid-stream. The user sees a failure even though nothing was actually wrong with the request.

Hidden data integrity issues

Abruptly terminated connections can leave operations half-finished. A POST that updates one record but not the related one, a multi-part upload that never completes, a payment webhook that times out and gets retried, all of these become more likely when targets disappear without draining.

Warning: Health check failures interact with this setting too. If a target fails health checks and is set to deregister with no delay, you lose any chance to gracefully shed its in-flight load. The errors get attributed to "the deploy" or "a bad instance" when the real cause is the missing drain window.

How to fix it

The fix is to set a non-zero deregistration delay. The right number depends on your longest expected request duration. For most stateless web and API workloads, 30 to 120 seconds is plenty. For workloads with long-lived connections (large uploads, streaming, long polling), match the delay to your real upper bound.

Find the target group ARN

aws elbv2 describe-target-groups \
  --query 'TargetGroups[].{Name:TargetGroupName,Arn:TargetGroupArn}' \
  --output table

Check the current deregistration delay for a specific target group:

aws elbv2 describe-target-group-attributes \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/my-app-tg/abc123 \
  --query "Attributes[?Key=='deregistration_delay.timeout_seconds']"

Set a sensible delay

aws elbv2 modify-target-group-attributes \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/my-app-tg/abc123 \
  --attributes Key=deregistration_delay.timeout_seconds,Value=60

Warning: A longer delay extends how long deploys and scale-in events take, since the load balancer waits for the full window (or until all connections close, whichever comes first). It also means draining instances continue to count against your fleet briefly. Pick a value that covers your real request durations without padding it unnecessarily.

Console steps

Open the EC2 console and go to Target Groups under Load Balancing.
Select the target group flagged by the check.
Open the Attributes tab and choose Edit.
Set Deregistration delay to your chosen value, for example 60 seconds.
Save changes.

Terraform

If you manage infrastructure as code, set the attribute directly on the target group so it is enforced on every apply:

resource "aws_lb_target_group" "app" {
  name        = "my-app-tg"
  port        = 443
  protocol    = "HTTPS"
  vpc_id      = aws_vpc.main.id
  target_type = "instance"

  deregistration_delay = 60

  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
  }
}

CloudFormation

AppTargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    Name: my-app-tg
    Port: 443
    Protocol: HTTPS
    VpcId: !Ref Vpc
    TargetGroupAttributes:
      - Key: deregistration_delay.timeout_seconds
        Value: "60"

Tip: Pair the deregistration delay with a matching preStop hook or graceful shutdown in your application. The load balancer stops sending new traffic during draining, but your app should also stop accepting new work and finish what it has. Without that coordination, a long delay just keeps a doomed instance alive without helping it actually drain.

How to prevent it from happening again

One-off fixes drift back to misconfiguration the next time someone copies an old module or clicks through the console. Bake the requirement into your pipeline.

Policy as code with OPA / Conftest

Add a Rego policy that rejects any target group with a delay below your floor:

package terraform.lb

deny[msg] {
  resource := input.resource.aws_lb_target_group[name]
  delay := object.get(resource, "deregistration_delay", 300)
  to_number(delay) < 30
  msg := sprintf("target group '%s' has deregistration_delay %v, must be >= 30", [name, delay])
}

Catch it in CI before merge

terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json --policy policy/

Wire that into the pull request check so a target group with a zero delay never reaches production. For teams using tflint or checkov, you can express the same rule with a custom check.

Tip: Lensix runs lb_noderegistrationdelay continuously across your accounts, so even resources created outside your IaC pipeline (console clicks, scripts, third-party tools) get flagged. Use the CI gate to stop new drift and Lensix to catch the drift that slips past it.

Best practices

Set the delay to your real request duration, not a guess. Look at your p99 request latency and longest-running endpoints, then add headroom. A streaming or upload service may need several minutes; a JSON API rarely needs more than 60 seconds.
Coordinate the delay with application shutdown. The drain window only helps if your app stops accepting new connections and finishes in-flight work within that time. Align the load balancer delay, your container terminationGracePeriodSeconds, and your auto scaling lifecycle hook.
Use lifecycle hooks for Auto Scaling. An Auto Scaling lifecycle hook can hold an instance in Terminating:Wait long enough for draining to finish, preventing the ASG from killing the instance before the load balancer is done with it.
Do not set the delay arbitrarily high. A 3600 second delay does not make your app safer, it just slows every deploy and keeps unhealthy targets in a draining state longer than needed. Match it to reality.
Standardize through shared modules. If every target group is created from one Terraform module with a sane default, this misconfiguration largely disappears at the source.
Monitor draining behavior. Watch the HTTPCode_ELB_5XX_Count and RequestCountPerTarget metrics around deploys. A spike in 5XX errors during deploys is a strong signal that draining is not configured or not long enough.

Deregistration delay is cheap insurance. It costs nothing to enable, it adds a small amount of time to deploys, and in exchange it removes one of the most common causes of avoidable errors during routine operations. Set it once in your modules, enforce it in CI, and let a continuous check catch anything that slips through.

Deregistration Delay Not Configured: Why a Zero-Second Drain Breaks Your Deploys