EC2 Spot Pricing: Fix On-Demand Waste on AWS

TL;DR

This check flags EC2 instances that have been running under 6 hours on on-demand pricing, which often points to short-lived or batch workloads that could run on Spot at up to 90% lower cost. Move qualifying workloads to Spot using a launch template with a Spot request, or set InstanceMarketOptions in your IaC.

On-demand EC2 pricing is the safe default, and that is exactly why it tends to stick around long after it stops being the right choice. Lensix raises the New Instance Not Using Spot Pricing check when it spots a freshly launched instance, one that has been alive for less than 6 hours, paying the full on-demand rate. Short runtime is a strong hint that the workload is transient: a CI runner, a batch job, a render task, a data transform, or a scale-out worker. Those are the workloads Spot was built for.

This is not a security finding in the traditional sense. It is a cost and efficiency signal. But cloud waste is its own kind of risk, and a sprawl of expensive short-lived instances usually means nobody is watching how compute gets provisioned.

What this check detects

The check (ec2_not_spot, in the ec2_checks module) looks at running EC2 instances and evaluates two conditions:

The instance has been running for fewer than 6 hours, based on its launch time.
Its lifecycle is not spot. In AWS terms, the InstanceLifecycle field is either empty (on-demand) or scheduled.

When both are true, Lensix flags it. The reasoning is simple: a brand new instance is more likely to be doing short, interruptible work, and short interruptible work is the textbook case for Spot.

Note: Spot Instances use AWS spare capacity and can be reclaimed with a two-minute warning when AWS needs the capacity back. In exchange you pay the Spot price, which is frequently 60 to 90 percent below on-demand for the same instance type.

The 6-hour window is a heuristic, not a guarantee. A long-running database that happened to launch 20 minutes ago will also trip this check. That is fine. The check is meant to prompt a question, not to make the decision for you: is this workload a good Spot candidate?

Why it matters

The impact here is almost entirely financial, but the numbers add up faster than people expect.

Take a fleet of CI build agents on c6i.4xlarge. On-demand that is roughly $0.68 per hour in us-east-1. Spot for the same type often sits around $0.20 to $0.27 depending on the moment. Run 20 of those agents for a few hours a day across a month and the difference is in the thousands of dollars, for identical work.

The workloads that show up under this check tend to fall into a few buckets:

CI/CD runners that spin up per job and tear down after.
Batch and ETL jobs that process a queue and exit.
Render farms and media transcoding with deadline-flexible work.
Stateless web or worker tier scale-out behind an autoscaler.
Ephemeral test environments stood up for a PR or a demo.

All of these tolerate interruption well, either because the work is idempotent and re-runnable or because the system already handles instances coming and going. Paying on-demand for them is leaving money on the table.

Warning: Spot is not free of trade-offs. Instances can be reclaimed at any time with a 2-minute notice. Do not move stateful workloads, long-running databases, or anything that cannot checkpoint and resume to Spot without a recovery strategy in place.

How to fix it

First, confirm the instance is actually a Spot candidate. If it is a database, a stateful broker, or a single point of failure, the right fix is to acknowledge the finding and move on. If it is interruptible, migrate it.

1. Identify the instance and its lifecycle

aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --query 'Reservations[].Instances[].{ID:InstanceId,Lifecycle:InstanceLifecycle,Launch:LaunchTime,Type:InstanceType}' \
  --output table

If Lifecycle is null, you are on on-demand.

2. Launch new capacity as Spot

You cannot convert a running on-demand instance into a Spot instance in place. You launch new Spot capacity and retire the on-demand instance once the workload has drained.

For a one-off Spot instance:

aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type c6i.4xlarge \
  --instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"one-time","InstanceInterruptionBehavior":"terminate"}}' \
  --key-name my-key \
  --subnet-id subnet-0abc123 \
  --security-group-ids sg-0abc123

3. Prefer an Auto Scaling group with a mixed instances policy

For anything beyond a one-off, do not launch bare Spot instances. Use an Auto Scaling group with a mixed instances policy so AWS can diversify across instance types and pools, which dramatically reduces interruption pain.

{
  "AutoScalingGroupName": "ci-runners",
  "MinSize": 0,
  "MaxSize": 30,
  "DesiredCapacity": 10,
  "VPCZoneIdentifier": "subnet-0abc123,subnet-0def456,subnet-0ghi789",
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateId": "lt-0abc123",
        "Version": "$Latest"
      },
      "Overrides": [
        { "InstanceType": "c6i.4xlarge" },
        { "InstanceType": "c6a.4xlarge" },
        { "InstanceType": "c5.4xlarge" },
        { "InstanceType": "m6i.4xlarge" }
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 0,
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "price-capacity-optimized"
    }
  }
}

The price-capacity-optimized strategy picks pools that are both cheap and deep, which gives you the best balance of cost and stability. Listing several interchangeable instance types in Overrides is the single most effective thing you can do to avoid interruptions.

Note: Setting OnDemandBaseCapacity to a small non-zero number is a common pattern. It keeps a baseline of guaranteed capacity on-demand while everything above the baseline runs on Spot, giving you both reliability and savings.

4. Handle interruptions gracefully

Your workload should listen for the Spot interruption notice and drain cleanly. The notice is available via instance metadata:

# From inside the instance, using IMDSv2
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action

A 200 response with a JSON body means termination is imminent. Use that window to deregister from load balancers, finish or requeue in-flight work, and exit cleanly.

5. Retire the on-demand instance

Danger: Terminating an instance is irreversible. Confirm the workload has fully drained to your new Spot capacity and that no local-only state remains before you run this.

aws ec2 terminate-instances --instance-ids i-0123456789abcdef0

Defining it in IaC

The durable fix is in your infrastructure code, not the console. Here is the mixed instances pattern in Terraform:

resource "aws_autoscaling_group" "ci_runners" {
  name                = "ci-runners"
  min_size            = 0
  max_size            = 30
  desired_capacity    = 10
  vpc_zone_identifier = var.private_subnet_ids

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.runner.id
        version            = "$Latest"
      }

      override { instance_type = "c6i.4xlarge" }
      override { instance_type = "c6a.4xlarge" }
      override { instance_type = "c5.4xlarge" }
      override { instance_type = "m6i.4xlarge" }
    }

    instances_distribution {
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 0
      spot_allocation_strategy                 = "price-capacity-optimized"
    }
  }
}

For a single instance defined directly, set the market options on the resource:

resource "aws_instance" "batch_worker" {
  ami           = var.ami_id
  instance_type = "c6i.4xlarge"

  instance_market_options {
    market_type = "spot"
    spot_options {
      spot_instance_type             = "one-time"
      instance_interruption_behavior = "terminate"
    }
  }

  tags = {
    Workload = "etl-batch"
    Spot     = "true"
  }
}

How to prevent it from happening again

One-time fixes get undone. The goal is to make Spot the default for the workloads that suit it and to catch on-demand creep before it lands.

Tag workloads by interruption tolerance

Add a tag like interruptible=true to workloads that can run on Spot. This gives policy tools and humans a clear signal, and it lets you write rules against intent rather than guessing from runtime.

Gate provisioning with policy-as-code

Use OPA or Checkov in CI to flag on-demand instances that carry an interruptible tag. A simple Conftest policy:

package main

deny[msg] {
  resource := input.resource.aws_instance[name]
  resource.tags.interruptible == "true"
  not resource.instance_market_options
  msg := sprintf("Instance '%s' is tagged interruptible but is not using Spot pricing", [name])
}

Tip: Run this in a pull request check, not just at apply time. Catching an on-demand worker in code review is far cheaper than catching it on next month's bill.

Make Spot the path of least resistance

If your platform team ships golden launch templates and ASG modules, set Spot as the default and require an explicit override to opt out. Engineers reach for the default, so make the default the cheap one for interruptible work.

Watch for drift continuously

Lensix re-runs this check on a schedule, so newly launched on-demand instances surface automatically. Wire those findings into your alerting so a stray batch job on on-demand pricing gets noticed the same day, not at the end of the billing cycle.

Best practices

Diversify instance types. A Spot request constrained to a single type is fragile. Three to six interchangeable types across multiple Availability Zones gives AWS room to fulfil and keep your capacity.
Use capacity-optimized allocation (or price-capacity-optimized) instead of lowest-price. Chasing the absolute lowest price often lands you in shallow pools that get reclaimed quickly.
Keep a small on-demand baseline for workloads that need a guaranteed floor of capacity, and let Spot handle the burst above it.
Always handle the 2-minute interruption notice. Drain connections, checkpoint state, and requeue work. Treat interruption as normal, not exceptional.
Combine with Savings Plans for the steady-state floor. Spot for spiky and interruptible work, Savings Plans or Reserved Instances for the predictable baseline. The two are complementary, not competing.
Do not put stateful workloads on Spot without a tested recovery story. Primary databases, single-node brokers, and anything holding unrecoverable local state belong on on-demand or reserved capacity.

The check is a nudge, not a mandate. Most short-lived instances should be on Spot, and the ones that should not are easy to identify once you ask the question. Answer it deliberately for each workload, encode the answer in your IaC, and let the check keep you honest as your fleet grows.

New Instance Not Using Spot Pricing: When On-Demand EC2 Is Costing You