Fix AWS Load Balancer With No Healthy Targets

TL;DR

This check fires when every target in an AWS target group is failing health checks, which means your load balancer has nothing to send traffic to and clients see 5xx errors. Find out whether the targets are actually down or the health check config is wrong, fix the underlying cause, and add CloudWatch alarms on HealthyHostCount so you catch it before users do.

A load balancer with no healthy targets is one of those failures that turns a minor problem into a full outage. The load balancer keeps accepting connections, but it has nowhere to route them, so every request comes back as an error. This check surfaces that exact condition: a target group where the count of healthy targets has dropped to zero.

It applies to AWS Application Load Balancers, Network Load Balancers, and Gateway Load Balancers, all of which route traffic based on target group health.

What this check detects

Lensix inspects each target group attached to your load balancers and reads the health status of every registered target. The check (lb_unhealthy, module lb_checks) flags any target group where all targets report a status other than healthy, including unhealthy, draining, unused, or initial with no healthy peer.

When this happens, the listener that forwards to that target group has no viable backend. The result depends on the load balancer type:

Application Load Balancer: returns HTTP 503 Service Unavailable to clients.
Network Load Balancer: connections are reset or time out, since there is no target to complete the TCP handshake with.
Gateway Load Balancer: traffic destined for the appliance fleet is dropped.

Note: A target group health check is separate from an instance status check. An EC2 instance can pass its EC2 status checks (the hardware and OS are fine) while still failing the target group health check because the application on the port is not responding.

Why it matters

When a target group goes fully unhealthy, you are not degraded, you are down. There is no partial capacity to absorb load. Every user hitting that endpoint gets an error.

A few ways this plays out in production:

Deployment gone wrong. A bad release ships, the app crashes on startup, and every new task fails its health check. The old tasks drain out, and within a couple of minutes the target group is empty of healthy hosts.
Health check misconfiguration. Someone changes the health check path from /health to /healthz, or points it at a port the app does not listen on. The app is perfectly fine, but the load balancer thinks every target is dead.
Security group or NACL change. A network rule update blocks the load balancer from reaching the target port. Health checks start timing out across the board.
Dependency failure. The health endpoint checks a database connection, the database fails over or hits connection limits, and suddenly every instance reports unhealthy at once.

The business impact is straightforward: lost revenue during the outage, broken SLAs, failed downstream integrations, and the on-call pages that follow. Because the failure is often correlated (one shared cause takes out all targets at the same time), it tends to be total rather than partial.

Warning: Auto Scaling and ECS will keep launching and terminating instances that fail health checks, which can produce a churn loop. New capacity comes up, fails, gets replaced, and fails again, burning cost and never reaching a healthy state until you fix the root cause.

How to fix it

Start by confirming the scope. Get the load balancer ARN and the target group, then read the actual health status.

1. Identify the affected target group

aws elbv2 describe-target-groups \
  --query 'TargetGroups[].{Name:TargetGroupName,Arn:TargetGroupArn,Port:Port,Protocol:Protocol}' \
  --output table

2. Check the health of every target

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web-tg/abc123 \
  --query 'TargetHealthDescriptions[].{Id:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
  --output table

The Reason field is your most important clue. Common values and what they mean:

Target.FailedHealthChecks — the target responded but with the wrong status code, or did not respond on the health check path.
Target.Timeout — health check requests never got a response. Usually a security group, NACL, or the app not listening on the port.
Target.ResponseCodeMismatch — the app answered, but with a code outside the expected matcher range (for example, returning 302 when only 200 is allowed).
Elb.InternalError — an issue on the load balancer side, often transient.
Target.NotInUse — the target group is not associated with an active listener.

3. Match the fix to the reason

If the reason is a timeout, check that the load balancer can reach the target port. The target's security group must allow inbound traffic from the load balancer's security group on the health check port:

aws ec2 authorize-security-group-ingress \
  --group-id sg-0target1234567890 \
  --protocol tcp \
  --port 8080 \
  --source-group sg-0loadbalancer12345

If the reason is a response code or failed check, verify the health check settings against what the app actually serves:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web-tg/abc123 \
  --health-check-path /health \
  --health-check-port traffic-port \
  --matcher HttpCode=200 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --health-check-interval-seconds 15 \
  --health-check-timeout-seconds 5

Then confirm the app responds the way you expect by hitting the path from inside the VPC (for example from a bastion or another instance in the same subnet):

curl -i http://10.0.1.42:8080/health

If the app itself is crashing, that is the real problem. For ECS, check the service events and task logs:

aws ecs describe-services \
  --cluster prod-cluster \
  --services web-service \
  --query 'services[0].events[0:5]'

aws logs tail /ecs/web-service --since 15m --follow

Danger: If you are in the middle of a bad deployment, roll back before you start tweaking health check thresholds. Loosening thresholds to mark a broken app as healthy will route live traffic to a process that cannot serve it. Fix or revert the application first, then restore strict health checks.

4. Roll back a bad deployment

For ECS, redeploy the last known good task definition:

aws ecs update-service \
  --cluster prod-cluster \
  --service web-service \
  --task-definition web-task:42 \
  --force-new-deployment

For an Auto Scaling group behind the load balancer, swap back to the previous launch template version and refresh the instances:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name web-asg \
  --preferences MinHealthyPercentage=50,InstanceWarmup=120

How to prevent it from happening again

Most "all targets unhealthy" incidents are preventable with a combination of safer deployments and earlier alerting.

Alarm on healthy host count

The single most useful guardrail is a CloudWatch alarm on HealthyHostCount per target group. Alert when it drops below your minimum acceptable count, not just when it hits zero, so you have headroom to react.

aws cloudwatch put-metric-alarm \
  --alarm-name web-tg-low-healthy-hosts \
  --namespace AWS/ApplicationELB \
  --metric-name HealthyHostCount \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --dimensions Name=TargetGroup,Value=targetgroup/web-tg/abc123 \
               Name=LoadBalancer,Value=app/web-lb/def456 \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alerts \
  --treat-missing-data breaching

Use deployment circuit breakers

ECS rolling deployments support a circuit breaker that automatically rolls back if new tasks fail to stabilize. Turn it on so a bad release never empties the target group:

{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }
}

Setting minimumHealthyPercent to 100 keeps your existing healthy tasks running until the new ones pass health checks. The old capacity stays in place as a safety net.

Codify the health check in IaC

Define health check settings in Terraform so they cannot drift to a wrong value through a console change:

resource "aws_lb_target_group" "web" {
  name        = "web-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"

  health_check {
    enabled             = true
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    matcher             = "200"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }

  deregistration_delay = 30
}

Tip: Keep a lightweight /health endpoint that checks only whether the process can serve requests, and a separate /ready or deep health endpoint for dependency checks. Pointing the load balancer at a shallow check prevents a brief database blip from marking your entire fleet unhealthy at once.

Gate it in CI/CD with policy as code

Catch missing alarms and risky health check settings before merge. A simple Open Policy Agent rule can require a healthy host alarm on every target group:

package lensix.lb

deny[msg] {
  rc := input.resource_changes[_]
  rc.type == "aws_lb_target_group"
  not has_health_alarm(rc.change.after.name)
  msg := sprintf("target group %q has no HealthyHostCount alarm", [rc.change.after.name])
}

Best practices

Spread targets across multiple Availability Zones. A target group with capacity in two or three AZs survives a zonal failure without going fully unhealthy.
Set health check thresholds deliberately. Too aggressive and you flap during normal latency spikes. Too lenient and you keep serving traffic to dead targets. A 15 second interval with an unhealthy threshold of 3 is a reasonable starting point for web apps.
Avoid coupling the health check to fragile dependencies. If your health endpoint fails the moment a downstream service hiccups, one dependency outage cascades into a load balancer outage.
Tune deregistration delay. A long drain delay during a fast scale-down can leave too few active targets. Match it to how long your longest in-flight requests actually take.
Test failure in staging. Kill a target group's only healthy host in a non-prod environment and confirm your alarms fire and your runbook works.
Watch the right metrics together. HealthyHostCount, UnHealthyHostCount, HTTPCode_ELB_5XX_Count, and TargetResponseTime tell a coherent story when correlated on one dashboard.

An empty target group is a clear, binary failure with a clear fix path. The work is in making sure you never get there by accident: safe deployments, conservative health checks decoupled from dependencies, and an alarm that pages you when the healthy count drops, not when it hits zero.

Load Balancer Has No Healthy Targets: Diagnosing and Fixing Empty AWS Target Groups