Fix Oversized EC2 Instances and Cut AWS Costs

TL;DR

This check flags EC2 instances whose CPU utilisation has stayed low for an extended period, which usually means you are paying for capacity you never use. Right-size the instance to a smaller type (or a different family) after confirming memory, network, and disk are not the real bottleneck.

Idle compute is one of the quietest line items in a cloud bill. Nobody gets paged when a m5.2xlarge sits at 4% CPU for a month, but the invoice shows up regardless. The Instance May Be Oversized check (ec2_oversized) looks at your EC2 fleet and surfaces the instances that are almost certainly larger than they need to be, so you can claw back spend without touching anything that matters.

This post walks through what the check measures, why oversized instances are worth fixing, how to safely downsize them, and how to stop the problem from creeping back.

What this check detects

The ec2_oversized check inspects CloudWatch metrics for each running EC2 instance and looks for a sustained pattern of low CPU utilisation. When average and peak CPU stay well below the capacity of the instance type over the observation window, the instance is flagged as a downsizing candidate.

In practice the signal looks something like this:

Average CPU utilisation under ~10% across the measurement period
Peak (maximum) CPU that never approaches the ceiling of the current instance type
No obvious bursty workload that would justify the headroom

Note: CPU is the easiest dimension to measure because the EC2 hypervisor reports it for free. Memory utilisation is not visible to AWS unless you install the CloudWatch agent, so this check is intentionally conservative. A flag means "investigate," not "blindly shrink."

The check does not assume your workload is CPU-bound. It simply points out that the CPU you are paying for is going unused, which is a strong hint that a smaller or differently shaped instance would do the same job for less.

Why it matters

Oversized instances are not a security vulnerability, but they are a real and recurring cost problem. Here is what is actually at stake.

You pay for the whole instance, idle or not

EC2 on-demand and reserved pricing is based on the instance type, not on how hard you push it. An m5.2xlarge running at 5% CPU costs exactly the same as one running at 95%. Drop two sizes to an m5.large and you cut that line item by roughly 75% for the same workload.

Waste compounds across the fleet

A single oversized box is rounding error. A few hundred of them, each provisioned "just to be safe" during a launch that never materialised, turns into a five or six figure annual leak. Oversizing also tends to be copied: someone picks a generous instance type for a service, an autoscaling launch template inherits it, and now every replica is oversized.

It distorts capacity planning

When everything is overprovisioned, you lose the ability to see genuine demand. Utilisation dashboards full of single-digit percentages make it impossible to spot the workload that is actually starting to strain, because all your headroom looks the same.

Warning: Low CPU does not automatically mean an instance is safe to shrink. Databases, in-memory caches, and JVM-heavy services are often memory-bound or IO-bound while sitting near-idle on CPU. Downsizing one of these to a smaller type with less RAM can crash it. Always confirm the constraining resource before acting.

How to fix it

Resizing an EC2 instance follows a stop, modify, start sequence. The instance must be in the stopped state to change its type, which means a short period of downtime for that single instance.

Step 1: Confirm the real bottleneck

Before picking a smaller type, pull the metrics that matter. Start with CPU from CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistics Average Maximum \
  --start-time "$(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 3600

If you have the CloudWatch agent installed, also check memory (mem_used_percent) and disk. If memory is comfortably below the next size down's capacity and network throughput is modest, you have a genuine downsizing candidate.

Tip: Let AWS do the math for you. AWS Compute Optimizer analyses 14 days of metrics and recommends a specific target instance type, including projected savings and a risk rating. Enable it once per account and it will surface the same oversized instances with a concrete recommendation attached.

aws compute-optimizer get-ec2-instance-recommendations \
  --instance-arns arn:aws:ec2:us-east-1:111122223333:instance/i-0123456789abcdef0

Step 2: Pick the target type

Drop one or two sizes within the same family first (2xlarge to large), since that keeps the same CPU-to-memory ratio and minimises surprises. If the workload is consistently CPU-light but memory-heavy, consider a memory-optimised family (r series) instead of just shrinking the general-purpose box.

Step 3: Resize the instance

Danger: Stopping an instance causes downtime and, for instances using instance store volumes, permanently destroys any data on those ephemeral disks. Confirm the instance is EBS-backed and snapshot anything critical before you stop it. Never run this on a production instance without a maintenance window or a load-balanced replica to absorb traffic.

# Stop the instance
aws ec2 stop-instances --instance-ids i-0123456789abcdef0

# Wait until it is fully stopped
aws ec2 wait instance-stopped --instance-ids i-0123456789abcdef0

# Change the instance type
aws ec2 modify-instance-attribute \
  --instance-id i-0123456789abcdef0 \
  --instance-type "{\"Value\": \"m5.large\"}"

# Start it back up
aws ec2 start-instances --instance-ids i-0123456789abcdef0

Step 4: Verify after the change

Watch the instance for a few days after resizing. CPU should now sit in a healthier band (roughly 20 to 60% average with room for spikes). If CPU pins at 100% or the application starts throttling, you cut too deep, so step back up one size.

Note: Not every type change is valid. The target instance type must be compatible with the same virtualisation type, architecture (Nitro vs Xen), and AMI. You cannot jump from an Intel m5 to a Graviton m6g with a stop and start alone, because the AMI architecture differs. AWS will reject the modification with a clear error if the pairing is unsupported.

Doing it with Infrastructure as Code

If the instance is managed by Terraform, do not resize it by hand. Update the source and let the pipeline apply it, otherwise the next terraform apply will revert your change.

resource "aws_instance" "api" {
  ami           = "ami-0abc123def4567890"
  instance_type = "m5.large"  # was m5.2xlarge

  tags = {
    Name = "api-server"
  }
}

For autoscaling groups, the type lives in the launch template. Bump the version and let an instance refresh roll the change out gradually:

resource "aws_launch_template" "api" {
  name_prefix   = "api-"
  image_id      = "ami-0abc123def4567890"
  instance_type = "m5.large"  # was m5.2xlarge
}

resource "aws_autoscaling_group" "api" {
  launch_template {
    id      = aws_launch_template.api.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
    }
  }
}

The rolling instance refresh replaces nodes a few at a time, so the service stays up while the fleet shrinks to the new size.

How to prevent it from happening again

Right-sizing once is satisfying. Keeping the fleet right-sized takes a process. A few things that work well:

Enable Compute Optimizer org-wide. Turn it on in your management account and have it analyse every member account. Review its findings on a monthly cadence as part of a cost review.
Gate instance types in CI. Use a policy-as-code tool to reject oversized defaults in pull requests before they ever launch.
Tag instances with an owner and a justification. If someone provisions a large box, the tag forces them to record why. Untagged large instances become easy review targets.
Schedule periodic re-evaluation. Workloads change. An instance that was correctly sized at launch can become oversized after a refactor or a traffic shift.

Here is a simple Open Policy Agent / Conftest rule that blocks oversized instance types in Terraform plans unless they carry an explicit exception tag:

package terraform.ec2

# Instance types we consider "large" and want to justify
oversized := {"m5.2xlarge", "m5.4xlarge", "c5.4xlarge", "r5.4xlarge"}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  itype := resource.change.after.instance_type
  oversized[itype]
  not resource.change.after.tags["RightsizeException"]

  msg := sprintf(
    "Instance type %q requires a 'RightsizeException' tag with justification",
    [itype],
  )
}

Tip: Wire the same check into Lensix so the ec2_oversized finding shows up in your health dashboard continuously, not just at apply time. Drift happens when someone resizes an instance manually in the console, and a scheduled scan catches what a one-time CI gate cannot.

Best practices

Target a utilisation band, not a number. Aim for average CPU in the 30 to 60% range with enough headroom for peaks. Chasing 90% leaves no slack for traffic spikes or instance failover.
Measure memory before you shrink. CPU is only half the picture. Install the CloudWatch agent on memory-sensitive workloads so right-sizing decisions are based on the constraining resource.
Prefer Compute Optimizer recommendations over gut feel. Its risk rating tells you when a recommendation is based on thin data, which saves you from cutting too aggressively.
Consider Graviton for steady-state workloads. Moving CPU-light, steady services to ARM-based instances often delivers both a size reduction and a per-hour discount, provided your stack supports the architecture.
Combine right-sizing with Savings Plans. Right-size first, then commit. Buying a Savings Plan against an oversized baseline locks in the waste for one to three years.
Treat downsizing as reversible. Resizing is a stop and start away. Cut a size, watch for a week, and step back up if the workload complains. The reversibility makes it low risk to be slightly aggressive.

Oversized instances are some of the easiest cloud savings you will ever find. The metrics are already there, the fix is a single attribute change, and the payoff is recurring every month. Run the ec2_oversized check, work through the candidates a few at a time, and put a CI gate in place so the savings stick.

Instance May Be Oversized: Right-Sizing EC2 to Cut Waste