Back to blog
AWSBest PracticesCloud SecurityDatabasesOperations & Compliance

MSK Cluster Not Encrypted at Rest: Enforcing Customer-Managed KMS Keys on Amazon MSK

Learn why Amazon MSK clusters need a customer-managed KMS key for broker storage encryption, the risks of the default key, and how to remediate with CLI and Terraform.

TL;DR

This check flags Amazon MSK clusters that store broker data without a customer-managed KMS key for at-rest encryption. Encrypted-at-rest data on stale volumes and snapshots is a compliance and breach risk, so set encryptionInfo.encryptionAtRest.dataVolumeKMSKeyId to a CMK you control. Note this can only be set at cluster creation, so a rebuild is required.

Amazon Managed Streaming for Apache Kafka (MSK) is where a lot of sensitive data ends up in transit: payment events, user activity streams, audit logs, IoT telemetry. All of that lands on broker storage volumes before it expires out of the topic retention window. If those volumes are not encrypted with a key you control, you lose a meaningful layer of defense and, often, a compliance checkbox.

The msk_unencrypted check looks at each MSK cluster and verifies that broker storage is encrypted at rest using a customer-managed AWS KMS key (CMK), rather than relying on a default configuration you cannot fully govern.


What this check detects

Lensix inspects the encryptionInfo block on every MSK cluster in your account and flags clusters where broker storage encryption is not backed by a customer-managed KMS key.

Each MSK cluster has an encryption configuration that looks like this:

{
  "EncryptionInfo": {
    "EncryptionAtRest": {
      "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:111122223333:key/aws/kafka"
    },
    "EncryptionInTransit": {
      "ClientBroker": "TLS",
      "InCluster": true
    }
  }
}

The piece this check cares about is DataVolumeKMSKeyId under EncryptionAtRest. This controls the KMS key used to encrypt the EBS volumes attached to your Kafka brokers, where partition data physically lives.

Note: MSK always encrypts broker storage at rest. If you do not specify a key, AWS uses an AWS-managed KMS key for the service. The distinction this check enforces is between an AWS-managed key (aws/kafka) and a customer-managed key (CMK) that you create, control access to, and can audit and rotate on your own terms.


Why it matters

"It's already encrypted by default" is the usual response here, and it is technically true. But the default AWS-managed key gives you almost none of the control that auditors and incident responders actually need.

You cannot scope access to an AWS-managed key

AWS-managed keys come with a service-linked key policy you cannot edit. You cannot deny specific principals, you cannot require encryption context conditions, and you cannot revoke access in an emergency. With a CMK, the key policy is yours. If a broker role or an account is compromised, revoking key access on a CMK is a fast way to cut off decryption.

Auditing and forensics

CMK usage shows up in CloudTrail as kms:Decrypt and kms:GenerateDataKey calls tied to your key. During an incident you can answer "what decrypted this data and when," which is far harder with a default service key buried in AWS internals.

Compliance frameworks expect customer-managed keys

PCI DSS, HIPAA, SOC 2, and FedRAMP controls increasingly distinguish between provider-managed and customer-managed key material. Many audit templates explicitly ask whether encryption keys are under customer control with documented rotation and access policies. An AWS-managed key fails that question.

Warning: Snapshots, replicas, and decommissioned broker volumes all inherit the cluster's encryption configuration. A cluster that has been running for a year on a default key has a year of stale data you cannot retroactively re-key without rebuilding.

Blast radius of a leaked data volume

Imagine a broker EBS volume gets exposed through a misconfigured backup pipeline or a snapshot shared too broadly. With a CMK whose policy denies the offending principal, that data stays unreadable. With a default key and broad IAM, the data is decryptable by anyone the service trusts.


How to fix it

Here is the hard truth about MSK encryption at rest: the KMS key cannot be changed after the cluster is created. There is no update-encryption API for the data volume key. Remediation means standing up a new cluster with the correct CMK and migrating your workloads to it.

Danger: There is no in-place fix. Migrating an MSK cluster means moving producers and consumers to new bootstrap brokers and replicating topic data. Plan this as a controlled migration with a rollback path, not a quick patch. Deleting the old cluster destroys all retained messages still in its topics.

Step 1: Create a customer-managed KMS key

aws kms create-key \
  --description "MSK broker storage encryption key" \
  --tags TagKey=Purpose,TagValue=msk-encryption \
  --query 'KeyMetadata.Arn' \
  --output text

Give it a friendly alias so it is easy to reference:

aws kms create-alias \
  --alias-name alias/msk-broker-storage \
  --target-key-id arn:aws:kms:us-east-1:111122223333:key/abcd1234-...

Step 2: Scope the key policy

Grant the MSK service and your cluster operators only what they need. A minimal key policy statement for the MSK service:

{
  "Sid": "AllowMSKUseOfKey",
  "Effect": "Allow",
  "Principal": { "Service": "kafka.amazonaws.com" },
  "Action": [
    "kms:Encrypt",
    "kms:Decrypt",
    "kms:GenerateDataKey",
    "kms:GenerateDataKeyWithoutPlaintext",
    "kms:DescribeKey",
    "kms:CreateGrant"
  ],
  "Resource": "*"
}

Step 3: Create the new cluster with the CMK

Build an encryption config file:

{
  "EncryptionAtRest": {
    "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:111122223333:key/abcd1234-..."
  },
  "EncryptionInTransit": {
    "ClientBroker": "TLS",
    "InCluster": true
  }
}

Then create the cluster, referencing that config:

aws kafka create-cluster \
  --cluster-name orders-stream-prod-v2 \
  --kafka-version "3.6.0" \
  --number-of-broker-nodes 3 \
  --broker-node-group-info file://broker-node-group.json \
  --encryption-info file://encryption-info.json

Step 4: Migrate traffic

Use MirrorMaker 2 or MSK Replicator to copy topic data and offsets from the old cluster to the new one, then cut producers and consumers over to the new bootstrap brokers. A typical replicator setup:

aws kafka create-replicator \
  --replicator-name orders-migration \
  --source-kafka-cluster file://source-cluster.json \
  --target-kafka-cluster file://target-cluster.json \
  --replication-info-list file://replication-info.json \
  --service-execution-role-arn arn:aws:iam::111122223333:role/MSKReplicatorRole

Tip: Run both clusters in parallel during the cutover. Point consumers at the new cluster first and confirm they catch up to live offsets before redirecting producers. That ordering minimizes the window where messages could be dropped.

Step 5: Decommission the old cluster

Once the new cluster is serving all traffic and you have confirmed no consumers depend on the old one:

aws kafka delete-cluster \
  --cluster-arn arn:aws:kafka:us-east-1:111122223333:cluster/orders-stream-prod/abcd-1234

Fixing it the right way: Terraform

Because remediation requires a rebuild, doing this in infrastructure as code from the start saves you the migration pain entirely. Define the CMK and the encryption block together:

resource "aws_kms_key" "msk" {
  description             = "MSK broker storage encryption key"
  enable_key_rotation     = true
  deletion_window_in_days = 30
}

resource "aws_msk_cluster" "orders" {
  cluster_name           = "orders-stream-prod"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m5.large"
    client_subnets  = var.private_subnet_ids
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 1000
      }
    }
  }

  encryption_info {
    encryption_at_rest_kms_key_arn = aws_kms_key.msk.arn

    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }
}

If encryption_at_rest_kms_key_arn is omitted, Terraform lets MSK fall back to the AWS-managed key, which is exactly what this check flags. Make it a required variable in your module so nobody can skip it.


How to prevent it from happening again

Since you cannot re-key a running cluster, prevention is the whole game. Catch missing CMKs before the cluster ever exists.

Block it in CI with policy as code

An OPA/Conftest rule against your Terraform plan:

package msk.encryption

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_msk_cluster"
  not resource.change.after.encryption_info[_].encryption_at_rest_kms_key_arn
  msg := sprintf("MSK cluster '%s' must set encryption_at_rest_kms_key_arn to a customer-managed KMS key", [resource.address])
}

Wire it into your pipeline so a plan without a CMK fails the build:

terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json --policy ./policies

Enforce with an SCP guardrail

You can use a service control policy to deny MSK cluster creation unless a CMK is specified, although MSK does not expose the key ARN as a clean condition key in all cases. A more reliable approach is an AWS Config rule plus auto-remediation that alerts when a non-compliant cluster appears, paired with a CI gate that prevents the bad code from merging in the first place.

Tip: Run the msk_unencrypted check on a schedule in Lensix and route findings to Slack or a ticket. Because the fix is a migration, you want to know about a non-compliant cluster on day one, not at audit time when you have a year of retained data sitting on the wrong key.


Best practices

  • Always specify a CMK at creation. The cost of a dedicated KMS key is trivial next to the cost of a migration. Treat it as non-negotiable for every cluster.
  • Enable key rotation. Set enable_key_rotation = true so KMS rotates the backing key material annually with no action from you.
  • Encrypt in transit too. Set client_broker = "TLS" and in_cluster = true. At-rest encryption protects stored data, but plaintext between clients and brokers is just as exposed.
  • Use separate keys per environment. A distinct CMK for prod versus staging means a key policy or revocation in one environment never touches another.
  • Tighten the key policy. Grant only the MSK service and the specific roles that administer the cluster. Avoid wildcard principals on the key policy.
  • Document key ownership. Auditors will ask who owns the key, how it rotates, and who can use it. Tag your keys and keep that mapping current.

The pain of an MSK migration is real, which is exactly why this is worth getting right the first time. A five-minute decision at cluster creation saves a weekend of MirrorMaker babysitting later.

If you are running Lensix, the msk_unencrypted check gives you the inventory of clusters that need attention. Sort them by data sensitivity, schedule the migrations, and lock the new clusters down with a CMK and a CI gate so the finding never comes back.