ASG ELB Health Checks: Fix Auto Scaling Self-Healing

TL;DR

If an Auto Scaling group sits behind a load balancer but only uses EC2 health checks, it cannot detect application-level failures and will keep routing traffic to broken instances. Switch the ASG health check type to ELB so unhealthy targets get replaced automatically.

An Auto Scaling group (ASG) is supposed to keep your fleet healthy by replacing instances that go bad. But "bad" means different things depending on how you configure health checks. By default, an ASG only knows whether the underlying EC2 instance is running. It has no idea whether your application is actually responding. When that ASG is fronted by an Elastic Load Balancer, relying on EC2 health checks alone leaves a serious gap.

The asg_noelbhealthcheck check flags Auto Scaling groups that are attached to a load balancer or target group but still use the default EC2 health check type instead of ELB health checks.

What this check detects

Every Auto Scaling group has a health check type setting. There are two common values:

EC2 (the default): the ASG considers an instance healthy as long as the EC2 instance is in the running state and passing EC2 status checks (system and instance reachability).
ELB: the ASG additionally honors the health check results reported by the attached load balancer or target group. If the load balancer marks an instance as unhealthy, the ASG terminates and replaces it.

This check fires when an ASG is associated with one or more load balancers or target groups, but its HealthCheckType is still set to EC2. In that scenario the ASG and the load balancer disagree about what "healthy" means, and the ASG wins.

Note: EC2 status checks only verify the hypervisor and network reachability of the instance. They do not open a connection to your application, check an HTTP path, or validate a response code. An instance can pass every EC2 check while your web server returns 503 on every request.

Why it matters

The whole point of putting an ASG behind a load balancer is to serve traffic reliably. When health checks are misaligned, you lose that reliability in ways that are easy to miss until an incident.

Broken instances stay in rotation

Imagine your application process crashes, hangs, or runs out of file descriptors, but the EC2 instance keeps running. With EC2 health checks only:

The load balancer notices the instance is failing its target group health check and stops sending it new requests. Good.
But the ASG still thinks the instance is healthy, so it never replaces it.
Your fleet now has fewer working instances than your desired capacity suggests, and nothing self-heals.

If enough instances degrade this way during a traffic spike, the healthy instances absorb all the load and start failing too. A small problem cascades into an outage.

Failed deployments do not roll over

Bad deployments are one of the most common causes of partial outages. If a new AMI or user-data script breaks the app, EC2 health checks will happily report the instance as healthy. The ASG keeps the broken instances and may even scale out more of them. With ELB health checks, the load balancer marks them unhealthy, the ASG replaces them, and depending on your setup you get a clearer signal that something is wrong.

Silent capacity loss

Because EC2 status checks rarely fail for application problems, this misconfiguration tends to hide. Dashboards show the ASG at desired capacity. Everything looks fine until users complain about intermittent errors caused by load balancing across a mix of healthy and zombie instances.

Warning: Switching to ELB health checks means instances that fail your load balancer health check will be terminated and replaced. If your health check path is too strict, has a short timeout, or depends on a flaky downstream service, you can trigger unnecessary instance churn. Validate your health check configuration before flipping the switch in production.

How to fix it

The fix is to set the ASG health check type to ELB. You should also set a sensible health check grace period so new instances have time to boot and start your application before the ASG evaluates them.

Step 1: Confirm the ASG is attached to a load balancer

Check which load balancers or target groups the ASG is associated with:

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --query 'AutoScalingGroups[0].{Name:AutoScalingGroupName,HealthCheckType:HealthCheckType,Grace:HealthCheckGracePeriod,TargetGroups:TargetGroupARNs,ELBs:LoadBalancerNames}'

If TargetGroups or ELBs is populated and HealthCheckType is EC2, this check applies.

Step 2: Verify the target group health check is sane

Before switching, make sure the target group health check actually reflects application health and is not overly aggressive. For an Application Load Balancer target group:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app-tg/abc123

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app-tg/abc123 \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,Healthy:HealthyThresholdCount,Unhealthy:UnhealthyThresholdCount}'

Confirm the health check path returns a quick, dependency-light response (a /healthz endpoint that returns 200 when the app is ready is ideal). Avoid pointing the health check at a route that hits a database or third-party API, since that turns downstream blips into instance terminations.

Step 3: Update the health check type

Danger: This command changes how a production ASG decides to terminate instances. If your load balancer health check is misconfigured, every instance can be marked unhealthy and replaced in a loop, taking down the service. Test in staging and confirm targets are currently healthy before running this against production.

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-app-asg \
  --health-check-type ELB \
  --health-check-grace-period 300

The grace period (300 seconds here) tells the ASG to ignore health check failures for the first 5 minutes after an instance launches. Set this slightly longer than your worst-case boot and application startup time so instances are not killed before they finish initializing.

Console steps

Open the EC2 console and go to Auto Scaling Groups.
Select your group and open the Details tab.
Find the Health checks section and choose Edit.
Set Health check type to ELB (or enable the ELB option alongside EC2).
Set a Health check grace period that comfortably covers your startup time.
Save.

Fix it with Terraform

If you manage the ASG as code, set the attributes explicitly so the value cannot drift back to the default:

resource "aws_autoscaling_group" "app" {
  name                = "my-app-asg"
  min_size            = 2
  max_size            = 10
  desired_capacity    = 4
  vpc_zone_identifier = var.private_subnet_ids

  target_group_arns = [aws_lb_target_group.app.arn]

  health_check_type         = "ELB"
  health_check_grace_period = 300

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }
}

Fix it with CloudFormation

AppAutoScalingGroup:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: "2"
    MaxSize: "10"
    DesiredCapacity: "4"
    HealthCheckType: ELB
    HealthCheckGracePeriod: 300
    TargetGroupARNs:
      - !Ref AppTargetGroup
    LaunchTemplate:
      LaunchTemplateId: !Ref AppLaunchTemplate
      Version: !GetAtt AppLaunchTemplate.LatestVersionNumber

Tip: Pair ELB health checks with an ASG instance refresh and lifecycle hooks. A connection draining lifecycle hook lets in-flight requests complete before an unhealthy instance is terminated, so failovers stay graceful rather than abrupt.

How to prevent it from coming back

Flipping one ASG is easy. Keeping every ASG correct across accounts and teams takes guardrails.

Gate it in your IaC pipeline

Use policy-as-code to reject any ASG that has load balancer or target group associations but is not using ELB health checks. With OPA / Conftest against a Terraform plan:

package main

deny[msg] {
  rc := input.resource_changes[_]
  rc.type == "aws_autoscaling_group"
  values := rc.change.after
  count(values.target_group_arns) > 0
  values.health_check_type != "ELB"
  msg := sprintf("ASG '%s' is attached to a target group but is not using ELB health checks", [values.name])
}

Wire this into CI so the build fails before the misconfiguration reaches AWS:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy policy/

Detect drift with AWS Config

For resources created outside your pipeline, an AWS Config custom rule (or a managed rule that inspects ASG configuration) can flag any group whose health check type is EC2 while it has attached load balancers. Route findings to Security Hub or an SNS topic so they get triaged.

Continuously monitor with Lensix

The asg_noelbhealthcheck check runs as part of the autoscaling_checks module and surfaces every offending ASG across your AWS accounts on a schedule. That catches the groups that slip past CI, were created manually during an incident, or were imported from another team.

Best practices

Always use ELB health checks for load-balanced ASGs. If traffic flows through a load balancer, the ASG should trust the load balancer's view of health.
Keep health check endpoints lightweight. A dedicated /healthz route that returns quickly without calling databases or external services prevents downstream issues from cascading into instance terminations.
Distinguish readiness from liveness. Use a deep health check (validates dependencies) for the load balancer routing decision only if you handle it carefully, and a shallow check for the ASG replacement decision. When in doubt, keep the ASG-facing check simple.
Tune the grace period. Too short and instances get killed mid-boot, too long and broken instances linger. Measure real startup time and add a buffer.
Set sensible health check thresholds. An unhealthy threshold of 2 to 3 consecutive failures avoids reacting to a single transient blip while still replacing genuinely broken instances quickly.
Test failover regularly. Kill the app process on one instance in a non-prod environment and confirm the ASG replaces it. If it does not, your health check type or path is wrong.

EC2 health checks tell you whether a box is on. ELB health checks tell you whether your application works. For anything sitting behind a load balancer, only the second answer matters.

ASG Not Using ELB Health Checks: Why It Breaks Self-Healing