Enforce TLS on Amazon MSK Clusters | Lensix

TL;DR

This check flags MSK clusters that accept plaintext (or mixed plaintext and TLS) client-to-broker traffic, exposing Kafka data to network sniffing and man-in-the-middle attacks. Fix it by setting the cluster encryption mode to TLS only and reconnecting clients over port 9094.

Amazon Managed Streaming for Apache Kafka (MSK) sits in the middle of a lot of sensitive data flows: order events, clickstreams, change-data-capture from your databases, payment notifications. By default, MSK can be configured to accept connections from clients over plaintext as well as TLS. When a cluster allows plaintext, every byte a producer sends or a consumer reads can travel across your VPC unencrypted.

The msk_notls check looks at the cluster's client broker encryption setting and fails when it is set to anything other than TLS. That means it catches both fully plaintext clusters and the "mixed" mode that allows both at once.

What this check detects

MSK exposes a setting called ClientBroker under the cluster's encryption-in-transit configuration. It has three possible values:

TLS — clients must connect over TLS on port 9094 (or 9098 for IAM auth). This is the secure setting.
TLS_PLAINTEXT — the broker accepts both TLS and plaintext connections. This is the dangerous "mixed" mode.
PLAINTEXT — clients connect unencrypted on port 9092.

The check fails if ClientBroker is PLAINTEXT or TLS_PLAINTEXT. You can see the current value yourself:

aws kafka describe-cluster \
  --cluster-arn "arn:aws:kafka:us-east-1:111122223333:cluster/my-cluster/abc123" \
  --query 'ClusterInfo.EncryptionInfo.EncryptionInTransit'

{
    "ClientBroker": "TLS_PLAINTEXT",
    "InCluster": true
}

A result of TLS_PLAINTEXT or PLAINTEXT here is what trips the check. The InCluster field controls broker-to-broker encryption, which is a separate concern but worth keeping set to true.

Note: Mixed mode often shows up because someone enabled TLS during a migration but never removed the plaintext listener once clients were cut over. The intention was good. The leftover plaintext port is the problem.

Why it matters

People sometimes wave this off with "it's inside the VPC, so it's fine." That assumption breaks down quickly.

Network traffic is not private just because it is internal

Plaintext Kafka traffic can be captured by anything with visibility into the network path: a compromised EC2 instance in the same subnet, a misconfigured VPC mirror, a malicious container sharing a host, or an attacker who has already gained a foothold and is moving laterally. Kafka payloads are just length-prefixed bytes. If your producers are writing JSON or Avro records with PII, credentials, or financial data, that data is readable on the wire.

Mixed mode is worse than it looks

With TLS_PLAINTEXT, you might believe you are protected because your main applications connect over TLS. But the plaintext listener is still open. An attacker who can reach the brokers can simply choose port 9092 and bypass encryption entirely. A locked door next to an open window is still an open window.

Danger: If your MSK cluster carries regulated data (PCI DSS, HIPAA, GDPR-scope personal data), a plaintext or mixed configuration is very likely an audit finding and a reportable control failure. Encryption in transit is an explicit requirement in all three frameworks.

Credentials can leak too

If you use SASL/SCRAM authentication over a plaintext connection, the authentication handshake is exposed. SCRAM is designed to resist some attacks, but running it over an unencrypted channel removes a layer of defense you should not be giving up.

How to fix it

The fix is to set ClientBroker to TLS. The catch is that this is not a free, instant flip: changing encryption-in-transit triggers a rolling update of the cluster, and your clients must already be capable of connecting over TLS before you cut off plaintext.

Warning: Switching to TLS-only is a rolling configuration change. Brokers restart one at a time. Plan it during a maintenance window, and make sure every producer and consumer is configured for TLS first, or they will lose connectivity the moment the plaintext listener closes.

Step 1: Get clients ready for TLS

Update your Kafka client configuration to use the TLS bootstrap endpoint (port 9094) and the appropriate security protocol. For a Java client using TLS without client auth:

security.protocol=SSL
ssl.truststore.location=/etc/kafka/secrets/kafka.client.truststore.jks
ssl.truststore.password=changeit

MSK brokers use certificates signed by the Amazon Trust Services CA, which is already in the default JVM truststore, so you usually do not need a custom truststore for the broker side. Fetch the TLS bootstrap brokers with:

aws kafka get-bootstrap-brokers \
  --cluster-arn "arn:aws:kafka:us-east-1:111122223333:cluster/my-cluster/abc123" \
  --query 'BootstrapBrokerStringTls'

Deploy the updated client config and confirm everything is producing and consuming over 9094 while plaintext is still allowed. This is the safe overlap window.

Step 2: Update the cluster encryption setting

You need the current cluster version string for the update. Pull it first:

aws kafka describe-cluster \
  --cluster-arn "arn:aws:kafka:us-east-1:111122223333:cluster/my-cluster/abc123" \
  --query 'ClusterInfo.CurrentVersion' --output text

Then apply the TLS-only setting:

aws kafka update-security \
  --cluster-arn "arn:aws:kafka:us-east-1:111122223333:cluster/my-cluster/abc123" \
  --current-version "K3AEGXETSR30VB" \
  --encryption-info '{
    "EncryptionInTransit": {
      "ClientBroker": "TLS",
      "InCluster": true
    }
  }'

This returns a ClusterOperationArn. Track the operation until it completes:

aws kafka describe-cluster-operation \
  --cluster-operation-arn "arn:aws:kafka:us-east-1:111122223333:cluster-operation/..." \
  --query 'ClusterOperationInfo.OperationState'

Wait for the state to read UPDATE_COMPLETE. The plaintext listener on port 9092 is closed once the rolling update finishes.

Step 3: Verify

aws kafka describe-cluster \
  --cluster-arn "arn:aws:kafka:us-east-1:111122223333:cluster/my-cluster/abc123" \
  --query 'ClusterInfo.EncryptionInfo.EncryptionInTransit.ClientBroker' --output text

You want TLS back. Confirm your applications are still healthy and no consumer lag has spiked.

Fixing it in infrastructure as code

If you manage MSK with Terraform, set the encryption block explicitly rather than relying on the default. The provider defaults to TLS for client_broker, but being explicit prevents drift and makes the intent obvious in review:

resource "aws_msk_cluster" "main" {
  cluster_name           = "my-cluster"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m5.large"
    client_subnets  = var.private_subnet_ids
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 100
      }
    }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }
}

For CloudFormation, the equivalent lives under the cluster's EncryptionInfo:

{
  "EncryptionInfo": {
    "EncryptionInTransit": {
      "ClientBroker": "TLS",
      "InCluster": true
    }
  }
}

Tip: Changing client_broker in Terraform on an existing cluster triggers an in-place update, not a replacement, but it still rolls the brokers. Run terraform plan and confirm it shows an update rather than a destroy/create before you apply.

How to prevent it from happening again

One-time fixes drift back. The goal is to make plaintext MSK impossible to ship in the first place.

Block it at the IaC stage

Use a policy-as-code tool to reject any MSK definition that does not enforce TLS. With Checkov, the built-in rule CKV_AWS_81 already covers this. To enforce it in CI, fail the pipeline on that check:

checkov -d ./infra --check CKV_AWS_81 --compact

If you prefer OPA/Conftest, a Rego rule against your Terraform plan JSON does the job:

package msk

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_msk_cluster"
  cb := resource.change.after.encryption_info[_].encryption_in_transit[_].client_broker
  cb != "TLS"
  msg := sprintf("MSK cluster %s must set client_broker to TLS, found %s", [resource.address, cb])
}

Detect drift at runtime

IaC gates only catch what goes through IaC. Someone can still flip the setting in the console or via the API. AWS Config has a managed rule, msk-in-cluster-encryption-enabled, and you can write a custom rule for client-broker TLS. Pairing a runtime scanner like Lensix with your CI gate gives you both sides: prevention before deploy and detection after.

Tip: Wire the Lensix msk_notls finding into a Slack or PagerDuty alert so a regression surfaces within minutes rather than at the next audit.

Best practices

Always set ClientBroker to TLS and InCluster to true. Encrypt both client-to-broker and broker-to-broker traffic.
Never use TLS_PLAINTEXT in production. If you need it temporarily during a migration, treat it as a tracked, time-boxed exception with a deadline to remove the plaintext listener.
Combine TLS with authentication. TLS encrypts the channel but does not, on its own, control who connects. Use IAM access control or SASL/SCRAM so only authorized clients reach the brokers.
Lock down the security group. Restrict ingress to the broker ports to only the subnets and security groups that run your Kafka clients. Defense in depth means an attacker should not be able to reach the brokers at all.
Pin TLS settings in your IaC modules. If you maintain a shared MSK module, hard-code the encryption block so every team that consumes the module inherits the secure default.
Audit existing clusters now. Clusters created years ago, before TLS was a default expectation, are the usual offenders. Sweep them all rather than just checking new deployments.

Enforcing TLS on MSK is a small, well-bounded change with a large payoff: it removes an entire category of network-level data exposure and clears one of the most common encryption-in-transit audit findings. Get your clients onto port 9094, flip the setting to TLS, and gate it in CI so it stays that way.

MSK Cluster Does Not Enforce TLS: Why It Matters and How to Fix It