Fix GKE Cluster Monitoring Disabled on GCP | Lensix

TL;DR

This check flags GKE clusters running without Cloud Monitoring, which leaves you blind to node failures, resource exhaustion, and suspicious activity. Re-enable it with gcloud container clusters update CLUSTER --monitoring=SYSTEM or set the monitoring config in your Terraform.

A Kubernetes cluster without monitoring is a black box. Pods churn, nodes go unhealthy, and you only find out when a customer files a ticket. The gke_nomonitoring check catches GKE clusters where Cloud Monitoring has been turned off, and it is one of those misconfigurations that stays invisible right up until it costs you an outage or an incident you cannot reconstruct.

This post walks through what the check looks at, why a missing monitoring config is more than an operational annoyance, and how to fix it cleanly across the CLI, the console, and Terraform.

What this check detects

The check inspects each GKE cluster's monitoringConfig and flags any cluster where Cloud Monitoring is disabled or scoped down to nothing. In GKE terms, that means the cluster has no enabled monitoring components, so system metrics for the control plane, nodes, and workloads are not being shipped to Cloud Monitoring.

GKE clusters created today default to monitoring the SYSTEM_COMPONENTS set. The clusters that trip this check are usually one of three cases:

An older cluster created before monitoring defaults changed, never updated.
A cluster where someone explicitly ran --monitoring=NONE to cut noise or cost.
An IaC template that hardcodes monitoring to disabled and got copy-pasted across environments.

Note: GKE separates monitoring from logging. You can have logging on and monitoring off, or vice versa. This check is specifically about Cloud Monitoring (metrics), not Cloud Logging. A cluster can fail this check while still shipping logs.

Why it matters

Monitoring is not a nice-to-have layer you bolt on later. It is the data feed that powers everything else: alerting, autoscaling decisions you can validate, capacity planning, and incident forensics. Turn it off and several things break at once.

You lose visibility into cluster health

Without system metrics, you cannot see CPU and memory pressure on nodes, pod restart counts, or control plane health. A node slowly running out of memory will start OOM-killing pods, and the first signal you get is degraded service rather than a metric crossing a threshold an hour earlier.

Alerting goes dark

Any Cloud Monitoring alert policy that depends on GKE metrics silently stops firing when the metrics stop flowing. This is worse than having no alerts at all, because your team believes coverage exists. During an incident, the dashboards everyone trusts show flat lines, and people waste time arguing about whether the graph is broken or the service is.

Incident response has no history to work from

When something does go wrong, the first question is always "what changed and when." If metrics were not being collected, there is no timeline to reconstruct. You cannot correlate a latency spike with a node that got recycled, or tie a memory leak to a specific deployment. The investigation starts from zero.

Warning: Disabling monitoring to save money is usually a false economy. The cost of GKE system metrics is modest compared to the cost of a single prolonged outage you cannot diagnose. If cost is the real concern, scope monitoring down rather than off (covered below).

It can mask malicious activity

From a security standpoint, missing metrics make anomaly detection harder. A cryptomining workload pinning CPU on your nodes, or a sudden spike in pod creation from a compromised service account, shows up clearly in monitoring data. With monitoring off, that signal never reaches anyone.

How to fix it

The fix is to re-enable Cloud Monitoring on the cluster. You have three practical paths depending on how you manage infrastructure.

Option 1: gcloud CLI

To enable monitoring of system components, the recommended baseline:

gcloud container clusters update CLUSTER_NAME \
  --location=COMPUTE_LOCATION \
  --monitoring=SYSTEM

If you also want workload-level metrics and managed Prometheus collection, enable the broader set:

gcloud container clusters update CLUSTER_NAME \
  --location=COMPUTE_LOCATION \
  --monitoring=SYSTEM,WORKLOAD \
  --enable-managed-prometheus

Confirm the change took effect:

gcloud container clusters describe CLUSTER_NAME \
  --location=COMPUTE_LOCATION \
  --format="value(monitoringConfig.componentConfig.enableComponents)"

Warning: A cluster update that changes monitoring config does not recreate nodes, but it can take a few minutes to propagate and may briefly interact with control plane reconciliation. Run it during a low-traffic window if your cluster is sensitive to control plane operations.

Option 2: Google Cloud Console

Open Kubernetes Engine and select your cluster.
Click Edit, then scroll to the Features section.
Find Cloud Monitoring and set it to enabled.
Choose the component scope, system metrics at minimum, and add workload metrics if you need them.
Save. The change applies to the cluster without rebuilding node pools.

Option 3: Terraform

If your clusters are managed with the google_container_cluster resource, set the monitoring config explicitly so the desired state is enforced on every apply:

resource "google_container_cluster" "primary" {
  name     = "primary"
  location = "us-central1"

  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS"]

    managed_prometheus {
      enabled = true
    }
  }
}

To include workload metrics, add "WORKLOADS" to enable_components. Run a plan first to confirm Terraform shows the monitoring config changing and nothing else unexpected:

terraform plan -target=google_container_cluster.primary

Tip: If you manage many clusters, do not patch them one by one. Make the monitoring block a required input in your shared GKE module so every cluster inherits it, then roll the module version out across environments.

How to prevent it from happening again

Fixing one cluster is easy. Stopping the next person from disabling monitoring, or a stale template from reintroducing the problem, takes a guardrail.

Enforce it in code review with policy-as-code

Add an OPA or Conftest policy that rejects any GKE cluster definition with monitoring disabled. A simple Rego rule against a Terraform plan:

package gke.monitoring

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_container_cluster"
  components := resource.change.after.monitoring_config[_].enable_components
  not contains_system(components)
  msg := sprintf("Cluster %q must enable SYSTEM_COMPONENTS monitoring", [resource.name])
}

contains_system(components) {
  components[_] == "SYSTEM_COMPONENTS"
}

Wire that into CI so a pull request that removes monitoring fails before merge.

Gate it in the pipeline

Run conftest test against the generated plan as a required check:

terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json --policy ./policies

Use Organization Policy where you can

For broader enforcement, GCP Organization Policies and Security Health Analytics can flag clusters drifting from your baseline. Pair that with continuous scanning so a manually-disabled cluster gets caught within minutes, not at the next audit.

Tip: Lensix runs the gke_nomonitoring check continuously across all your projects, so a cluster that gets changed out of band, outside Terraform, still surfaces in your dashboard rather than slipping through.

Best practices

Treat SYSTEM_COMPONENTS monitoring as non-negotiable. It is cheap, low-noise, and the foundation for every alert you will ever write.
Add workload metrics where the workloads matter. Enable the WORKLOADS component on production clusters running customer-facing services, and consider managed Prometheus for application metrics.
Scope down before you turn off. If monitoring volume is driving cost, trim the components you collect rather than disabling everything. NONE should almost never be a deliberate choice.
Build alerts on top of the metrics. Monitoring with no alert policies is data nobody looks at. At minimum, alert on node memory pressure, persistent pod restarts, and control plane errors.
Keep monitoring config in your IaC. If it is not in code, it will drift. Explicit config also documents intent for the next engineer.
Review monitoring as part of cluster onboarding. Make it a checklist item when any new cluster goes live, so coverage is verified rather than assumed.

Monitoring is the difference between knowing your cluster is healthy and hoping it is. Re-enable it, enforce it in code, and let your scanner catch the day someone turns it off again.

GKE Cluster Monitoring Disabled: Why It Matters and How to Fix It