Fix MSK Broker Logging Not Enabled on AWS | Lensix

TL;DR

This check flags Amazon MSK clusters that have no broker log delivery configured. Without broker logs you lose visibility into Kafka internals during incidents and fail most audit requirements. Fix it by enabling log delivery to CloudWatch, S3, or Firehose with a single update-monitoring call or a few lines of Terraform.

Amazon MSK takes a lot of the operational pain out of running Apache Kafka, but it does not turn on broker logging for you. Clusters spin up, produce and consume happily, and the Kafka broker logs that you would normally tail on a self-managed cluster just sit inside the managed nodes where you cannot reach them. When a consumer group rebalances endlessly at 3am, or a partition leader election goes sideways, those logs are exactly what you need, and exactly what you will not have.

The msk_nologging check detects MSK clusters where none of the three broker log delivery destinations (CloudWatch Logs, Kinesis Data Firehose, or S3) are enabled.

What this check detects

Every MSK cluster has a LoggingInfo configuration block that controls where Kafka broker logs are shipped. The block supports three independent destinations:

CloudWatch Logs for near real-time searching and metric filters
Kinesis Data Firehose for streaming logs into S3, OpenSearch, or a third-party SIEM
Amazon S3 for cheap long-term retention and compliance archives

This check fails when all three are disabled, meaning broker logs are being generated inside the cluster but discarded. You can confirm the state of a cluster with the AWS CLI:

aws kafka describe-cluster \
  --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/prod-events/abc-123 \
  --query 'ClusterInfo.LoggingInfo'

A misconfigured cluster returns something like this, with every destination set to false:

{
  "BrokerLogs": {
    "CloudWatchLogs": { "Enabled": false },
    "Firehose": { "Enabled": false },
    "S3": { "Enabled": false }
  }
}

Note: Broker logs are not the same as MSK metrics. CloudWatch metrics (enabled by default at the DEFAULT level) tell you throughput and partition counts. Broker logs contain the actual server.log, controller.log, and state-change output from Kafka itself. You need both.

Why it matters

Kafka is usually sitting right in the middle of your most important data flows: order events, audit trails, change data capture, payment pipelines. When it misbehaves, the blast radius is wide and the clock is ticking. Broker logs are the difference between a fast root cause and a long, blind guessing game.

Incident response goes dark

Without broker logs you cannot see why a broker dropped out of the ISR (in-sync replica) set, why a controller election happened, or why a client keeps getting NOT_LEADER_FOR_PARTITION. Those answers live in the broker logs. With logging off, your only options during an outage are AWS Support tickets and trial-and-error restarts, neither of which is fast.

Security and audit gaps

Broker logs capture authentication failures, TLS handshake errors, and ACL authorization denials. If an attacker or a misconfigured client is repeatedly failing to authenticate, that signal is in the logs. Drop the logs and you drop the evidence. For SOC 2, PCI DSS, HIPAA, and ISO 27001, auditors expect log delivery for systems handling regulated data, and an MSK cluster carrying payment events with no logging is an easy finding against you.

Warning: Broker logs cannot be backfilled. Once an incident has passed, there is no way to retroactively generate logs for a cluster that had logging disabled. Every hour a production cluster runs without logging is a permanent blind spot.

Capacity and reliability tuning

Garbage collection pauses, log segment flush warnings, and request handler thread starvation all show up in broker logs before they show up as customer-facing errors. Teams that ship these logs to CloudWatch or S3 can build metric filters and alarms that catch degradation early, instead of finding out from a downstream consumer that lag has spiked.

How to fix it

You can enable broker logging on a running cluster without recreating it. Pick at least one destination. For most teams CloudWatch Logs plus S3 is a good combination: CloudWatch for searchable short-term troubleshooting, S3 for cheap long-term retention.

Option 1: AWS CLI

First create a CloudWatch log group (skip this if you already have one):

aws logs create-log-group --log-group-name /aws/msk/prod-events

# Optional but recommended: set a retention period
aws logs put-retention-policy \
  --log-group-name /aws/msk/prod-events \
  --retention-in-days 90

Then update the cluster monitoring configuration. You need the current cluster version, which acts as an optimistic lock:

CURRENT_VERSION=$(aws kafka describe-cluster \
  --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/prod-events/abc-123 \
  --query 'ClusterInfo.CurrentVersion' --output text)

aws kafka update-monitoring \
  --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/prod-events/abc-123 \
  --current-version "$CURRENT_VERSION" \
  --enhanced-monitoring PER_TOPIC_PER_BROKER \
  --logging-info '{
    "BrokerLogs": {
      "CloudWatchLogs": {
        "Enabled": true,
        "LogGroup": "/aws/msk/prod-events"
      },
      "S3": {
        "Enabled": true,
        "Bucket": "my-org-msk-logs",
        "Prefix": "prod-events/"
      }
    }
  }'

Warning: update-monitoring triggers a rolling update of the cluster configuration. The operation is non-destructive and brokers stay available, but the cluster enters an UPDATING state for several minutes and you cannot issue another update until it returns to ACTIVE. Run it during a normal change window rather than mid-incident.

Option 2: Terraform

If your clusters are managed in Terraform, add a logging_info block to the aws_msk_cluster resource:

resource "aws_cloudwatch_log_group" "msk" {
  name              = "/aws/msk/prod-events"
  retention_in_days = 90
}

resource "aws_msk_cluster" "prod_events" {
  cluster_name           = "prod-events"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  # ... broker_node_group_info, encryption_info, etc ...

  logging_info {
    broker_logs {
      cloudwatch_logs {
        enabled   = true
        log_group = aws_cloudwatch_log_group.msk.name
      }

      s3 {
        enabled = true
        bucket  = aws_s3_bucket.msk_logs.id
        prefix  = "prod-events/"
      }
    }
  }
}

Option 3: CloudFormation

MSKCluster:
  Type: AWS::MSK::Cluster
  Properties:
    ClusterName: prod-events
    KafkaVersion: 3.6.0
    NumberOfBrokerNodes: 3
    # ... BrokerNodeGroupInfo, EncryptionInfo ...
    LoggingInfo:
      BrokerLogs:
        CloudWatchLogs:
          Enabled: true
          LogGroup: /aws/msk/prod-events
        S3:
          Enabled: true
          Bucket: my-org-msk-logs
          Prefix: prod-events/

Note: If you choose S3 delivery, MSK needs permission to write to the bucket. Add a bucket policy that allows the delivery.logs.amazonaws.com service principal to s3:PutObject under your prefix, otherwise the cluster update succeeds but no logs ever land.

Verify the change took effect once the cluster is back to ACTIVE:

aws kafka describe-cluster \
  --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/prod-events/abc-123 \
  --query 'ClusterInfo.LoggingInfo.BrokerLogs'

How to prevent it from happening again

Fixing one cluster by hand is fine. The goal is to make a cluster without logging impossible to ship in the first place.

Block it at the IaC layer

Catch missing logging before it ever reaches AWS by scanning Terraform plans in CI. A Checkov policy does this out of the box:

checkov -d . --check CKV_AWS_80

For custom rules, OPA / Conftest gives you full control over what "good" looks like. This policy fails any MSK cluster where neither CloudWatch nor S3 logging is enabled:

package terraform.msk

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_msk_cluster"
  logs := resource.change.after.logging_info[_].broker_logs[_]
  not logs.cloudwatch_logs[_].enabled
  not logs.s3[_].enabled
  msg := sprintf("MSK cluster '%s' has no broker logging enabled", [resource.address])
}

Wire that into your pipeline so a pull request that adds an unlogged cluster fails before merge:

terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json --policy policies/

Tip: Pair the CI gate with an AWS Config managed rule or a scheduled Lensix scan so you also catch clusters created outside your pipeline, through the console or by another team. Policy-as-code stops new mistakes; continuous scanning catches the ones that slipped in before you added the gate.

Enforce it organization-wide

If you use Service Control Policies, you cannot directly require logging on creation, but you can deny the kafka:UpdateMonitoring action that would later turn logging off, locking in a known-good configuration once it is set.

Best practices

Always enable at least two destinations. CloudWatch Logs for live troubleshooting and S3 for long-term, low-cost retention is the standard pairing. Firehose is the right choice when you forward to a SIEM or OpenSearch.
Set retention deliberately. CloudWatch log groups default to never expiring, which gets expensive on a chatty cluster. Set retention_in_days to match your operational needs (30 to 90 days is common) and lean on S3 lifecycle rules for anything longer.
Add metric filters and alarms. Logging only helps if someone looks. Create CloudWatch metric filters for patterns like "ERROR", "Shrinking ISR", or authentication failures, and alarm on them so problems surface before customers notice.
Match logging to data sensitivity. Clusters carrying regulated or financial data should ship logs to immutable, encrypted S3 storage with Object Lock to satisfy audit and tamper-evidence requirements.
Treat logging as a launch requirement. Bundle broker logging into your golden MSK module or Terraform template so every new cluster inherits it. The cheapest fix is the one you never have to remediate.

Broker logging on MSK is a small, non-disruptive change with an outsized payoff. Turn it on once per cluster, gate it in CI so it cannot regress, and you trade a permanent blind spot for the visibility you will be very glad to have the next time Kafka surprises you.

MSK Cluster Has No Broker Logging: Why It Matters and How to Fix It