AWS ASG Suspended Processes: Risks & Fix | Lensix

TL;DR

This check flags Auto Scaling groups that have one or more suspended processes, which can silently break scaling, health replacement, and load balancing. Resume the processes with aws autoscaling resume-processes once you confirm no maintenance is in progress.

An Auto Scaling group (ASG) is supposed to be your safety net. When a host falls over, the ASG replaces it. When traffic spikes, the ASG adds capacity. When you push a bad deploy, the ASG rolls it back. All of that depends on the ASG's internal processes actually running. When those processes are suspended, the group keeps reporting healthy in the console while quietly doing nothing.

This Lensix check, asg_suspendedprocesses, looks at every Auto Scaling group in your AWS account and reports any group that has suspended scaling processes.

What this check detects

Each Auto Scaling group runs a set of background processes that handle different parts of its behavior. You can suspend any of them individually. The most common ones are:

Launch — adds instances to the group during scale-out or replacement.
Terminate — removes instances during scale-in or replacement.
HealthCheck — marks instances unhealthy based on EC2 or ELB health checks.
ReplaceUnhealthy — terminates unhealthy instances and launches replacements.
AlarmNotification — receives scaling notifications from CloudWatch alarms.
ScheduledActions — runs scheduled scaling actions.
AddToLoadBalancer — registers new instances with a load balancer or target group.
AZRebalance — balances instances across Availability Zones.
InstanceRefresh — runs rolling instance refreshes.

The check fires when any of these are suspended. The presence of suspended processes does not always mean something is broken, but it always means the group is not behaving the way most engineers assume it does.

Note: Suspended processes are sometimes set automatically by AWS. For example, if launches repeatedly fail, the ASG may suspend the Launch and Terminate processes to avoid thrashing. So a suspended process can be a symptom of a deeper problem, not just a manual setting someone forgot to undo.

Why it matters

The danger here is that suspended processes are invisible in the places people normally look. An ASG with HealthCheck and ReplaceUnhealthy suspended shows the same green status as a fully working one. Capacity numbers look fine. Nothing alarms. Then a node dies and never gets replaced, and you find out during an incident instead of before one.

A few concrete failure modes:

Dead instances stay in service. With HealthCheck or ReplaceUnhealthy suspended, a crashed instance keeps receiving traffic. Users see errors, but the ASG never reacts.
No scaling under load. With Launch or AlarmNotification suspended, CloudWatch alarms fire but no new capacity comes online. Latency climbs, then requests start timing out.
New instances never take traffic. With AddToLoadBalancer suspended, instances launch and pass health checks but are never registered with the target group, so you pay for capacity that does nothing.
Deploys silently break. A suspended InstanceRefresh or Terminate process can leave a rolling deploy half-finished, with old and new versions running side by side.

There is also a security angle. If Terminate is suspended, an instance that you want to cycle out, perhaps because it was compromised or holds stale credentials, will not be replaced automatically. Your blast-radius controls assume the ASG is doing its job.

Warning: Suspended processes commonly get left behind after a manual maintenance window. Someone suspends Terminate to debug a node, gets paged onto something else, and never resumes it. Weeks later the group quietly stops self-healing. Always treat a suspension as temporary and tracked.

How to fix it

First, find out which processes are suspended and, ideally, why. Do not blindly resume everything before you understand the context, because some suspensions are intentional and active.

1. Inspect the group

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --query 'AutoScalingGroups[0].SuspendedProcesses'

The output lists each suspended process and a reason:

[
  {
    "ProcessName": "Terminate",
    "SuspensionReason": "User suspended at 2024-03-02T14:11:09Z"
  },
  {
    "ProcessName": "Launch",
    "SuspensionReason": "AZRebalance was suspended because launches were failing."
  }
]

The SuspensionReason tells you a lot. "User suspended" means a human or a tool did it on purpose. An AWS-generated reason like "launches were failing" means you have an underlying issue (a bad launch template, an exhausted subnet, a hitting service quota) that you need to fix before resuming.

2. Fix any underlying cause first

If the suspension was triggered by failing launches, resuming the process will just trigger more failures. Check the activity history to find the real error:

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name my-app-asg \
  --max-records 10 \
  --query 'Activities[].{Status:StatusCode,Reason:StatusMessage,Time:StartTime}'

Common culprits: an AMI that no longer exists, an IAM instance profile that was deleted, no free IPs left in the subnet, or an EC2 quota you have hit.

3. Resume the processes

Danger: Resuming Terminate or ReplaceUnhealthy can immediately start terminating instances the ASG considers unhealthy or excess. If a maintenance task is mid-flight, this can pull running production nodes out from under it. Confirm there is no active maintenance and that your health checks are reporting accurately before you resume.

Resume specific processes:

aws autoscaling resume-processes \
  --auto-scaling-group-name my-app-asg \
  --scaling-processes Terminate HealthCheck ReplaceUnhealthy

Or, if you have confirmed all suspensions are unintentional, resume everything:

aws autoscaling resume-processes \
  --auto-scaling-group-name my-app-asg

4. Verify

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --query 'AutoScalingGroups[0].SuspendedProcesses'

An empty array means all processes are running.

Tip: If you regularly suspend processes during deploys, script the resume step as a guaranteed cleanup that runs even when the deploy fails. In a shell pipeline, wrap it in a trap; in a CI job, run it as an always() step. The most common cause of this finding is a resume that never happened because something else errored first.

How to prevent it from happening again

The goal is to make sure no group is left with suspended processes outside of a deliberate, short-lived window.

Catch drift in Terraform

Terraform's aws_autoscaling_group resource has a suspended_processes argument. Leave it unset, or set it to an empty list, and any out-of-band suspension shows up as drift on the next terraform plan.

resource "aws_autoscaling_group" "app" {
  name                = "my-app-asg"
  min_size            = 2
  max_size            = 10
  desired_capacity    = 4
  vpc_zone_identifier = var.private_subnet_ids

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  # Explicitly suspend nothing. Drift surfaces in plan.
  suspended_processes = []

  health_check_type         = "ELB"
  health_check_grace_period = 120
}

Running terraform plan on a schedule (for example a nightly CI job that fails on any non-empty diff) turns drift detection into an early warning system.

Gate with policy-as-code

If you generate Terraform plan JSON in CI, you can fail the build when a plan introduces suspended processes. Here is an OPA/Rego rule:

package autoscaling

deny[msg] {
  rc := input.resource_changes[_]
  rc.type == "aws_autoscaling_group"
  count(rc.change.after.suspended_processes) > 0
  msg := sprintf("ASG %q must not define suspended_processes", [rc.address])
}

Alert on live state

Drift detection only covers groups managed by IaC and only at plan time. To catch runtime suspensions, including the ones AWS sets automatically, schedule a check against the live API. A minimal example with the AWS CLI and jq:

aws autoscaling describe-auto-scaling-groups \
  --query 'AutoScalingGroups[?length(SuspendedProcesses)>`0`].[AutoScalingGroupName]' \
  --output text

Any output here is a group that needs attention. Wire it into your alerting, or let Lensix run the check continuously so you do not have to maintain the plumbing yourself.

Best practices

Treat every suspension as temporary. If you suspend a process, set yourself a hard reminder to resume it, and prefer automation that resumes on completion or failure.
Suspend the narrowest set possible. If you only need to stop scale-in during a deploy, suspend Terminate alone, not the whole group. The fewer processes you touch, the less self-healing you give up.
Use the right health check type. For groups behind a load balancer, set health_check_type = "ELB" so application-level failures count, not just EC2 status. A suspended HealthCheck on top of a misconfigured check type leaves you doubly exposed.
Document intentional suspensions. If a group genuinely needs a process suspended long term, record why in the IaC and in the suspension reason. Future you, and the on-call engineer at 3am, will thank you.
Monitor scaling activity, not just capacity. A group can sit at the right instance count while quietly failing to replace dead nodes. Watch describe-scaling-activities for repeated failures.

An Auto Scaling group that cannot scale, heal, or terminate is just an expensive static fleet wearing a costume. The whole point is the automation, so keep the processes running.

Run this check across all regions and accounts, fix the underlying causes before resuming, and add a plan-time gate plus a runtime alert so the finding does not come back.

Auto Scaling Group Has Suspended Processes: Why It Breaks Self-Healing and How to Fix It