Fix EC2 System Status Check Failing | Lensix

TL;DR

This check flags EC2 instances whose system status check is failing, which usually points to a problem on the underlying AWS hardware or hypervisor rather than your OS. The quickest fix is to stop and start the instance so AWS migrates it to healthy hardware.

EC2 instances run on physical servers managed by AWS. When the hardware or the hypervisor underneath your instance starts to fail, your workload can become unreachable even though nothing is wrong with your application or operating system. AWS surfaces these problems through status checks, and the Instance Health Check Failing check in Lensix watches specifically for a failing system status check.

This post explains what the check looks at, why a failing system status check is worth acting on quickly, how to recover an affected instance, and how to build resilience so a single bad host never takes down your service.

What this check detects

AWS runs two independent status checks on every running EC2 instance:

System status checks monitor the AWS infrastructure your instance depends on: the physical host, network connectivity to the host, the hypervisor, and the underlying power and hardware. A failure here is on the AWS side.
Instance status checks monitor the software and network configuration of your individual instance: things like a corrupted file system, exhausted memory, a kernel panic, or bad networking config inside the guest.

The Lensix ec2_health check fires when the system status check is reporting a status other than passed. That means AWS has detected a problem with the hardware or hypervisor hosting your instance, and the issue is almost always something only a host migration can resolve.

Note: System and instance status checks are different. A failing instance status check usually requires a fix inside the guest OS (rebooting, fixing fstab, freeing disk space). A failing system status check, which is what this check tracks, points at the physical host. Knowing which one failed tells you where to look.

Why it matters

A failing system status check is not a warning you can sit on. The instance is either already impaired or about to become unreachable, and the cause is outside your control. Here is what that means in practice.

Your workload can go dark without warning

If the host is degrading, the instance may stop responding to traffic, lose network reachability, or freeze entirely. For a single-instance service with no redundancy, that is a full outage. Customers see errors, health checks on your load balancer start failing, and your on-call gets paged.

Data on instance store is at risk

Instances that use instance store (ephemeral) volumes lose that data if the host fails or if you stop and start the instance. If you have been treating instance store as durable storage, a hardware failure can mean permanent data loss.

Danger: Stopping and starting an instance to recover from a system status check failure moves it to new hardware, which permanently discards any data on instance store volumes. Confirm what is on local storage before you act, and back it up first if it matters.

Stateful single points of failure hurt the most

A self-managed database, a message broker, or a stateful cache running on one EC2 instance with no replica turns a routine host failure into a recovery incident. The check matters more the more critical and less redundant the workload is.

How to fix it

The goal of remediation is to get the instance onto healthy hardware. For EBS-backed instances, the standard move is a stop and start, which causes AWS to provision the instance on a different physical host.

Step 1: Confirm the failing check and instance type

First, verify what is actually failing and whether the instance is EBS-backed or instance-store-backed.

aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --query 'InstanceStatuses[0].{System:SystemStatus.Status,Instance:InstanceStatus.Status}' \
  --output table

Check the root device and storage layout so you know what survives a stop/start:

aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --query 'Reservations[0].Instances[0].{RootDeviceType:RootDeviceType,Volumes:BlockDeviceMappings}' \
  --output json

If RootDeviceType is ebs, a stop/start is safe for the root volume and any attached EBS volumes. If it is instance-store, you cannot stop the instance without losing it, so you will need to launch a replacement instead.

Step 2: Stop and start the instance (EBS-backed)

Warning: A stop/start causes downtime while the instance moves hosts, and the public IPv4 address changes unless you use an Elastic IP. Schedule this during a maintenance window or shift traffic away first if the workload is in the path of live requests.

# Stop the instance and wait for it to fully stop
aws ec2 stop-instances --instance-ids i-0123456789abcdef0
aws ec2 wait instance-stopped --instance-ids i-0123456789abcdef0

# Start it again on healthy hardware
aws ec2 start-instances --instance-ids i-0123456789abcdef0
aws ec2 wait instance-running --instance-ids i-0123456789abcdef0

Once it is running, confirm the system status check returns to passed:

aws ec2 wait instance-status-ok --instance-ids i-0123456789abcdef0

aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --query 'InstanceStatuses[0].SystemStatus.Status' \
  --output text

Step 3: Handle instance-store-backed instances

If the root device is instance store, stopping is not an option. Launch a replacement instance from your AMI, restore data from backups, and terminate the impaired instance once the replacement is serving traffic.

aws ec2 run-instances \
  --image-id ami-0123456789abcdef0 \
  --instance-type m5.large \
  --subnet-id subnet-0123456789abcdef0 \
  --security-group-ids sg-0123456789abcdef0 \
  --key-name my-keypair

Step 4: Reattach the Elastic IP if needed

If the instance served traffic on a public IP and you did not use an Elastic IP, the address changed. Associate an Elastic IP now so future host migrations do not change your endpoint.

aws ec2 associate-address \
  --instance-id i-0123456789abcdef0 \
  --allocation-id eipalloc-0123456789abcdef0

Tip: AWS can recover impaired instances automatically. Create a CloudWatch alarm on the StatusCheckFailed_System metric with a recover action, and AWS will migrate the instance to new hardware for you without manual intervention.

aws cloudwatch put-metric-alarm \
  --alarm-name "ec2-system-recover-i-0123456789abcdef0" \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --statistic Maximum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --alarm-actions "arn:aws:automate:us-east-1:ec2:recover"

How to prevent it from happening again

You cannot stop AWS hardware from occasionally failing. What you can do is make sure a single host failure never becomes an outage, and that recovery happens automatically.

Enable automatic recovery on every instance

The CloudWatch recover alarm above should be the default for any instance you care about. With Terraform, bake it into your instance module so no one forgets it:

resource "aws_cloudwatch_metric_alarm" "system_recover" {
  alarm_name          = "ec2-system-recover-${aws_instance.app.id}"
  namespace           = "AWS/EC2"
  metric_name         = "StatusCheckFailed_System"
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 1
  comparison_operator = "GreaterThanOrEqualToThreshold"

  dimensions = {
    InstanceId = aws_instance.app.id
  }

  alarm_actions = ["arn:aws:automate:${data.aws_region.current.name}:ec2:recover"]
}

Run workloads in Auto Scaling Groups

An Auto Scaling Group with health checks will terminate an impaired instance and launch a fresh one automatically. Pair it with an EC2 health check type so status check failures trigger replacement:

resource "aws_autoscaling_group" "app" {
  desired_capacity    = 3
  min_size            = 2
  max_size            = 6
  vpc_zone_identifier = var.private_subnet_ids
  health_check_type   = "ELB"
  health_check_grace_period = 120

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }
}

Spread instances across Availability Zones

Hardware failures are local to a host, and broader infrastructure events can affect an AZ. Distribute instances across at least two or three AZs so no single zone failure takes the whole service offline.

Note: Auto Scaling Groups balance instances across the subnets you give them. List subnets in different AZs in vpc_zone_identifier and the group spreads capacity for you.

Treat instance store as throwaway storage

Anything that needs to survive a host migration belongs on EBS, S3, RDS, or another durable store. Reserve instance store for caches, scratch space, and data you can rebuild.

Best practices

Design for replacement, not repair. Treat individual instances as disposable. If losing one instance is a problem, you have a single point of failure to fix, not a hardware problem to mourn.
Use Elastic IPs or load balancers for stable endpoints. Never depend on the auto-assigned public IP for anything, since it changes on every stop/start.
Back up stateful instances regularly. Use Data Lifecycle Manager or AWS Backup to snapshot EBS volumes on a schedule so recovery is fast.
Alarm on both status check metrics. Watch StatusCheckFailed_System for host issues and StatusCheckFailed_Instance for guest issues, and route them to different runbooks since the fixes differ.
Monitor scheduled events. AWS sometimes schedules retirement or maintenance for degrading hosts before they fail outright. Watch for these events and migrate proactively during a window you choose.

aws ec2 describe-instance-status \
  --query 'InstanceStatuses[?Events].{Instance:InstanceId,Events:Events}' \
  --output json

A failing system status check is one of the clearer signals in EC2: the host is the problem, and the answer is almost always to move the instance somewhere healthy. Build that recovery in once with auto recovery and Auto Scaling, and these alerts turn from incidents into non-events.

Instance Health Check Failing: Recovering EC2 from Hardware and Hypervisor Issues