Fix Stale AWS Security Group References | Lensix

TL;DR

This check flags security group rules that reference another security group ID that has already been deleted. These stale rules are dead weight that hide your real network policy and break automation. Fix it by removing the orphaned rule, then rebuild references using stable tags or IaC instead of raw IDs.

Security groups are the workhorse of network access control in AWS. One of their most useful features is the ability to reference another security group as the source or destination of a rule, instead of hardcoding IP ranges. That keeps your rules dynamic: any instance in the referenced group automatically gets access, no matter how its IP changes.

The problem starts when the referenced security group gets deleted but the rule pointing to it sticks around. AWS does not always clean these up, and the result is a rule that references a security group ID that no longer exists. The sg_stale_rule check catches exactly this situation.

What this check detects

The sg_stale_rule check scans every security group in your AWS account and inspects each ingress and egress rule. When a rule uses a UserIdGroupPairs reference (a source or destination that points to another security group), the check verifies that the referenced group still exists.

If the target security group has been deleted, the rule is considered stale. AWS itself sometimes surfaces these as a PrefixListId or leaves the group ID in place with no matching resource. Either way, you end up with a rule that can never match live traffic because its counterpart is gone.

Note: A security group reference works within a VPC (or across peered VPCs in the same region). When you delete the referenced group, AWS usually blocks the deletion if a rule still points to it, but cross-account references, peering teardowns, and Terraform state drift can all leave orphaned references behind.

Why it matters

A stale rule will not directly open a hole in your network, since it references nothing. So why care? Because the real cost is operational and security hygiene, and both compound over time.

It obscures your actual network policy

When an engineer audits a security group, every rule has to be understood. A reference to a deleted group is noise. It forces people to ask "what was this for, and is it safe to remove?" That uncertainty slows down reviews and makes it more likely that a genuinely risky rule gets overlooked in the clutter.

It breaks automation and IaC drift detection

Tools that reconcile desired state against live state choke on orphaned references. Terraform plans can fail or show permanent drift. Custom scripts that resolve group references hit null lookups. The stale rule becomes a recurring source of false alarms that trains your team to ignore drift, which is the opposite of what you want.

It signals deeper process problems

A stale reference almost always means a security group was deleted without cleaning up dependents, or an IaC teardown ran out of order. That same gap can leave behind things that are not harmless, like a permissive ingress rule that was meant to be temporary.

Warning: Do not assume a stale rule is always safe to delete blindly. In rare cases the referenced group was recreated under a new ID and someone forgot to update the rule. Confirm intent before removing rules in production VPCs.

How to fix it

Remediation has two steps: find the orphaned rule, then remove it. The reference itself is gone, so the only action available is to delete the stale rule from the security group.

Step 1: Identify the stale rule

List the security group and inspect its rules. You are looking for a UserIdGroupPairs entry whose GroupId does not resolve to a live group.

aws ec2 describe-security-groups \
  --group-ids sg-0123456789abcdef0 \
  --query 'SecurityGroups[0].IpPermissions'

Take any referenced group IDs and check whether they still exist:

aws ec2 describe-security-groups \
  --group-ids sg-0referencedgroup00 2>&1

If you get an InvalidGroup.NotFound error, the reference is stale and the rule should be removed.

Step 2: Remove the orphaned rule

Danger: The commands below modify a live security group and immediately affect the network policy applied to attached resources. Run them against the right group ID, in the right account, and confirm the rule is genuinely orphaned first.

For an ingress rule, use revoke-security-group-ingress and pass the same protocol, port range, and referenced group ID that appears in the stale rule:

aws ec2 revoke-security-group-ingress \
  --group-id sg-0123456789abcdef0 \
  --ip-permissions '[
    {
      "IpProtocol": "tcp",
      "FromPort": 443,
      "ToPort": 443,
      "UserIdGroupPairs": [
        { "GroupId": "sg-0referencedgroup00" }
      ]
    }
  ]'

For an egress rule, use revoke-security-group-egress with the same shape:

aws ec2 revoke-security-group-egress \
  --group-id sg-0123456789abcdef0 \
  --ip-permissions '[
    {
      "IpProtocol": "-1",
      "UserIdGroupPairs": [
        { "GroupId": "sg-0referencedgroup00" }
      ]
    }
  ]'

If you manage the group with the AWS Console, open the security group, go to the relevant inbound or outbound rules tab, edit rules, and delete the row that references the missing group, then save.

Tip: If you have dozens of stale references across an account, script the discovery. Loop over every security group, pull each UserIdGroupPairs reference, and batch-check existence with a single describe-security-groups call passing all referenced IDs. Any ID missing from the response is your orphan list.

Fixing it in Terraform

If the rule lives in Terraform, the right move is usually to remove the hardcoded reference and replace it with a resource attribute so the dependency graph stays correct. Avoid pinning a literal sg-... string in a rule.

resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = aws_vpc.main.id
}

resource "aws_security_group" "db" {
  name   = "db-sg"
  vpc_id = aws_vpc.main.id
}

# Reference by attribute, not by literal ID
resource "aws_security_group_rule" "db_from_app" {
  type                     = "ingress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.db.id
  source_security_group_id = aws_security_group.app.id
}

When the source group is referenced as aws_security_group.app.id, Terraform understands the dependency. Destroying app will fail or force you to remove the rule first, which prevents the orphan from ever forming.

How to prevent it from happening again

Stale references are a symptom of deletion order problems. The fix is to make dependencies explicit and to catch orphans automatically before they pile up.

Let IaC own the dependency graph

The single most effective prevention is to never reference security groups by literal ID across modules or scripts. Use resource attributes (Terraform), Ref / !GetAtt (CloudFormation), or equivalent so the tool refuses to delete a group that something still points to.

Add a CI/CD gate

Run a stale-reference scan in your pipeline before deploys land. A small script can fail the build if any security group references a group ID that is not present in the plan or in the live account.

#!/usr/bin/env bash
set -euo pipefail

# Collect every referenced group ID across the account
referenced=$(aws ec2 describe-security-groups \
  --query 'SecurityGroups[].IpPermissions[].UserIdGroupPairs[].GroupId' \
  --output text | tr '\t' '\n' | sort -u)

# Collect every existing group ID
existing=$(aws ec2 describe-security-groups \
  --query 'SecurityGroups[].GroupId' \
  --output text | tr '\t' '\n' | sort -u)

# Anything referenced but not existing is stale
stale=$(comm -23 <(echo "$referenced") <(echo "$existing"))

if [ -n "$stale" ]; then
  echo "Stale security group references found:"
  echo "$stale"
  exit 1
fi
echo "No stale security group references."

Tip: Wire this into a scheduled job as well as a pre-deploy gate. References can go stale through console changes or cross-account teardowns that never touch your pipeline, so a nightly sweep catches what CI misses.

Enforce deletion order with policy-as-code

Tools like Open Policy Agent or AWS Config rules can flag plans that delete a security group still referenced by another rule. Catching the bad teardown at plan time is far cheaper than cleaning up orphans later.

Best practices

Reference groups, not CIDRs, where it makes sense. Group references stay correct as instances scale and IPs change, which is exactly why this feature exists. Keep using it, just manage the lifecycle properly.
Name and tag every security group clearly. When a reference does break, a descriptive name and owner tag turns a five-minute investigation into a five-second one.
Avoid sharing one security group across unrelated workloads. Tightly scoped groups have fewer references, which means fewer ways to leave an orphan behind.
Run continuous scans, not one-off audits. Network configuration drifts constantly. A single cleanup today does nothing for the stale rule someone creates next week.
Treat a stale reference as a process signal. When you find one, ask how it got there. The answer usually points to an out-of-order teardown that will produce more debris if left unaddressed.

Stale security group references are not the most dramatic finding in your environment, but they are a reliable indicator of how disciplined your network change management really is. Clean them up, then close the gap that created them so they do not come back.

Security Group References a Deleted SG: Cleaning Up Stale AWS Rules