Fix Undersized EC2 Instances (High CPU) | Lensix

TL;DR

Lensix flags EC2 instances whose CPU has been pinned high for an extended period, a sign the instance is too small for its workload. Confirm the trend in CloudWatch, then resize to a larger instance type or move to a Compute Optimizer recommendation before performance degrades or your users notice.

An instance running flat out might look like good value. You are paying for capacity and you are using all of it. In practice, sustained high CPU is rarely a sign of efficiency. It usually means requests are queuing, latency is creeping up, and you have no headroom left for traffic spikes, deployments, or background jobs. This check exists to catch that situation before it turns into an incident.

What this check detects

The ec2_undersized check looks at CloudWatch CPU utilisation metrics for your running EC2 instances over a recent window and flags any instance where utilisation has stayed consistently high. "Consistently" is the important word here. A brief spike during a batch job or a deploy is normal. What Lensix cares about is the instance that sits at 85 to 100 percent CPU for hours or days at a time with little relief.

High sustained CPU is a strong indicator that the instance type you picked no longer matches the work it is being asked to do. That can happen gradually as traffic grows, or suddenly after a new feature ships that is more compute heavy than expected.

Note: CPU is only one dimension. An instance can also be undersized on memory, network throughput, or EBS bandwidth without CPU ever looking high. This check focuses on CPU because it is the most reliable signal AWS exposes by default. Memory utilisation requires the CloudWatch agent, which many accounts do not have installed.

Why it matters

Running an instance at the ceiling of its CPU capacity has knock-on effects that reach well beyond a busy-looking graph.

Latency and timeouts. When CPU is saturated, work queues. Request latency climbs, health checks start failing intermittently, and load balancers may pull the instance out of rotation, pushing even more load onto the remaining instances.
No headroom for spikes. An instance at 95 percent has nothing left for a traffic surge, a Black Friday event, or a noisy neighbour cron job. The first unexpected spike tips it over.
Cascading failures. In an autoscaling group or a cluster, one overloaded node failing health checks can trigger replacements, redistribute load, and overload the next node. This is how a slow burn becomes a full outage.
Burstable instance throttling. If the instance is a T-family type (t3, t4g, and so on), sustained high CPU burns through CPU credits. Once credits run out, the instance is throttled hard to its baseline, which can be as low as 5 to 20 percent of a vCPU. Performance falls off a cliff and stays there.

Warning: Burstable instances are the most common cause of mysterious, intermittent slowdowns. The instance runs fine for a while, exhausts its credits, then crawls. Check the CPUCreditBalance metric. If it trends toward zero, you are not undersized in the usual sense, you are on the wrong instance family.

There is also a reliability and reputation cost. Slow responses during peak hours are exactly when customers are paying attention. An undersized instance turns your busiest, most valuable traffic into your worst experience.

How to fix it

Start by confirming the pattern, then choose the right remediation. Do not blindly scale up, because the cause matters.

1. Confirm the trend in CloudWatch

Pull the CPU utilisation for the flagged instance over the last week so you can see whether this is sustained or a one-off.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --start-time "$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 3600 \
  --statistics Average Maximum \
  --output table

If both Average and Maximum sit high across most hours, the instance is genuinely under pressure rather than reacting to a single event.

2. Check whether it is a burstable instance running out of credits

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --start-time "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 3600 \
  --statistics Average \
  --output table

If the balance is heading toward zero, you have two choices: move to a non-burstable instance type (the m, c, or r families), or enable unlimited mode on the burstable instance so it can sustain high CPU for an additional charge.

3. Ask AWS Compute Optimizer what it recommends

Compute Optimizer analyses your historical metrics and suggests right-sized instance types. It removes a lot of the guesswork.

aws compute-optimizer get-ec2-instance-recommendations \
  --instance-arns arn:aws:ec2:us-east-1:111122223333:instance/i-0123456789abcdef0 \
  --query 'instanceRecommendations[0].recommendationOptions' \
  --output json

Tip: Compute Optimizer needs at least 30 hours of metric data and is free to enable. If you have not turned it on, run aws compute-optimizer update-enrollment-status --status Active once per account and let it gather data. It will catch undersized and oversized instances across your whole fleet.

4. Resize the instance

Changing instance type requires a stop and start, which means downtime for that single instance. Plan for it, or do it on a node that can drain behind a load balancer or autoscaling group.

Warning: Stopping an instance changes its public IP unless you use an Elastic IP, and any data on instance store volumes is lost. Confirm your data lives on EBS and that nothing depends on the current public IP before you stop it.

# Stop the instance
aws ec2 stop-instances --instance-ids i-0123456789abcdef0
aws ec2 wait instance-stopped --instance-ids i-0123456789abcdef0

# Change to a larger type
aws ec2 modify-instance-attribute \
  --instance-id i-0123456789abcdef0 \
  --instance-type "{\"Value\": \"m6i.xlarge\"}"

# Start it back up
aws ec2 start-instances --instance-ids i-0123456789abcdef0
aws ec2 wait instance-running --instance-ids i-0123456789abcdef0

5. Consider scaling out instead of up

For stateless web and API workloads, adding more instances behind a load balancer is often better than buying a bigger one. It gives you redundancy and removes the single point of failure. If the instance lives in an autoscaling group, tune the desired capacity and scaling policy rather than the instance size.

# Scale out by raising desired capacity
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name my-app-asg \
  --desired-capacity 4 \
  --honor-cooldown

# Or add a target-tracking policy to keep average CPU near 50 percent
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-app-asg \
  --policy-name cpu-target-50 \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "PredefinedMetricSpecification": {"PredefinedMetricType": "ASGAverageCPUUtilization"},
    "TargetValue": 50.0
  }'

How to prevent it from happening again

One-off resizing is firefighting. The goal is to make undersizing visible early and to bake sane defaults into how you provision compute.

Alarm on sustained high CPU

Set a CloudWatch alarm that fires when CPU stays high across multiple evaluation periods, not on a single spike. This gives you a heads up well before saturation becomes an outage.

aws cloudwatch put-metric-alarm \
  --alarm-name "ec2-sustained-high-cpu-i-0123456789abcdef0" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 12 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alerts

Twelve evaluation periods of five minutes means the alarm only triggers after a full hour above 80 percent, filtering out routine spikes.

Use autoscaling for elastic workloads

If a workload can run on more than one instance, put it in an autoscaling group with a target-tracking policy. You stop guessing at instance size and let demand drive capacity. Keep the target CPU around 50 to 60 percent so there is room to absorb a spike while new instances spin up.

Codify instance sizing in IaC

Define instance types in Terraform or CloudFormation so changes are reviewed, not made by hand in the console at 2am during an incident. A variable makes resizing a one-line, peer-reviewed change.

variable "app_instance_type" {
  type    = string
  default = "m6i.large"
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.app.id
  instance_type = var.app_instance_type

  tags = {
    Name = "app-server"
    Team = "platform"
  }
}

Tip: Add a scheduled job that runs Compute Optimizer queries weekly and posts undersized and oversized findings to a Slack channel. Right-sizing is not a one-time project, it is a habit. Workloads drift, and so should your instance choices.

Gate risky sizes in CI/CD

Policy-as-code can stop obvious mistakes before they merge, such as deploying a burstable instance for a workload that you know runs hot. A simple Conftest or OPA policy on your Terraform plan can flag a t3 instance in a production module.

package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  startswith(resource.change.after.instance_type, "t3")
  resource.change.after.tags.Environment == "production"
  msg := sprintf("Burstable instance %s not allowed in production", [resource.address])
}

Best practices

Target 50 to 70 percent average CPU, not 90. The gap is your safety margin for spikes, deploys, and failover. Running hot leaves you nothing.
Match the instance family to the workload. Compute-heavy services belong on the c family, memory-heavy ones on r, and steady general workloads on m. Burstable types are for genuinely spiky, low-average workloads only.
Install the CloudWatch agent for memory and disk metrics. CPU alone hides memory-bound undersizing. Without the agent you are flying half blind.
Prefer scaling out for stateless workloads. More small instances beat one big one for both resilience and graceful scaling.
Review right-sizing on a schedule. Both undersizing and oversizing cost you, one in performance and one in money. Make Compute Optimizer part of a regular review, not an emergency.
Load test before you ship compute-heavy features. Knowing a new feature triples CPU per request before launch is far cheaper than learning it from a paging alert.

Undersizing is one of those problems that is invisible right up until it is an outage. A consistently high CPU graph is the cloud equivalent of an engine running in the red. Treat this check as the early warning it is, confirm the trend, and give the workload the headroom it needs before your customers feel the strain.

EC2 Instance May Be Undersized: Detecting and Fixing Sustained High CPU