Fix GKE Legacy Metadata Endpoints Enabled | Lensix

TL;DR

This check flags GKE node pools that still expose the legacy (v0.1 and v1beta1) metadata server endpoints, which let pods read instance metadata without HTTP headers and open the door to SSRF-based credential theft. Fix it by enabling GKE Metadata Server (Workload Identity) or recreating node pools with legacy endpoints disabled.

The Compute Engine metadata server is one of the most useful and most dangerous pieces of GCP infrastructure. It hands out instance attributes, SSH keys, and short-lived service account tokens to anything running on the VM that can reach 169.254.169.254. On a GKE node, that includes every pod scheduled onto it. Google has spent years tightening how that metadata is served, and the legacy endpoints are the part they want everyone to stop using.

This Lensix check, gke_metadataendpoints in the gke_checks module, looks at your GKE node pools and reports any that still allow the deprecated legacy metadata API paths. If a node pool has them enabled, a compromised or misbehaving pod has a far easier time pulling the node's identity credentials.

What this check detects

Compute Engine exposes its metadata server over a few different API versions. The two legacy ones are:

http://metadata.google.internal/computeMetadata/v0.1/
http://metadata.google.internal/computeMetadata/v1beta1/

The defining trait of these legacy endpoints is that they do not require the Metadata-Flavor: Google request header. The modern v1 endpoint does. That header requirement sounds trivial, but it is a real security control: it blocks a whole class of server-side request forgery (SSRF) attacks where an attacker tricks an application into fetching a URL but cannot control the request headers.

The check inspects the node pool configuration field that controls this behavior. In the GKE API it surfaces as the node metadata setting. A pool reporting LEGACY (or an unset value on older clusters) means the legacy endpoints are reachable from workloads on those nodes.

Note: There are two related but separate settings. disable-legacy-endpoints turns off the v0.1 and v1beta1 paths. GKE_METADATA (the GKE Metadata Server, part of Workload Identity) goes further by intercepting metadata requests entirely and only serving the workload's own identity. Enabling Workload Identity sets both correctly.

Why it matters

The risk here is credential theft via the metadata server, and the legacy endpoints make that attack meaningfully easier.

Every GKE node runs as a Compute Engine service account. By default that is the project's default compute service account, which historically carried the broad roles/editor role. Any pod on the node can request a token for that service account from the metadata server. If a pod is compromised, or an application running in it has an SSRF flaw, the attacker can use that token to act as the node's identity across the project.

Here is the concrete escalation path:

An attacker finds an SSRF vulnerability in a web app running in a pod (a feature that fetches a user-supplied URL, an image proxy, a webhook tester, and so on).
They point it at the metadata server. With the modern v1 endpoint this often fails, because the app's HTTP client will not send the required Metadata-Flavor: Google header.
With the legacy endpoints enabled, no header is needed. The request succeeds.
They retrieve a service account access token from /computeMetadata/v1beta1/instance/service-accounts/default/token and start calling GCP APIs as the node.

Warning: This is not theoretical. The 2019 Shopify and Capital One incidents both involved metadata server SSRF on cloud platforms. AWS responded with IMDSv2; GCP's answer is requiring the metadata header and steering everyone toward the GKE Metadata Server. Leaving legacy endpoints on removes that protection.

For regulated environments, an exposed legacy metadata endpoint is also a straightforward audit finding. CIS Google Kubernetes Engine Benchmark recommendations explicitly call for legacy metadata endpoints to be disabled, so this check failing can put you out of compliance.

How to fix it

You cannot toggle the metadata setting on a running node pool. The configuration is baked in at creation time, so remediation means creating a new node pool with the right setting and migrating workloads off the old one.

Option 1: Enable Workload Identity (recommended)

The best fix is to adopt Workload Identity, which enables the GKE Metadata Server. This both disables the legacy endpoints and stops pods from reading the node's service account credentials at all. Pods get their own scoped identity instead.

Enable it at the cluster level first:

gcloud container clusters update CLUSTER_NAME \
  --location=COMPUTE_LOCATION \
  --workload-pool=PROJECT_ID.svc.id.goog

Then create node pools with the GKE Metadata Server:

gcloud container node-pools create secure-pool \
  --cluster=CLUSTER_NAME \
  --location=COMPUTE_LOCATION \
  --workload-metadata=GKE_METADATA

Note: Workload Identity changes how pods authenticate to GCP. Workloads that relied on the node's service account will need a Kubernetes service account bound to a GCP service account via an IAM policy binding. Plan for this so you do not break existing integrations during migration.

Option 2: Disable legacy endpoints only

If you are not ready for full Workload Identity, you can at least disable the legacy endpoints by setting the node metadata mode to EXPOSE with legacy endpoints turned off. On current clusters this is the default, but older or imported pools may need it set explicitly. Create a replacement node pool:

gcloud container node-pools create no-legacy-pool \
  --cluster=CLUSTER_NAME \
  --location=COMPUTE_LOCATION \
  --metadata=disable-legacy-endpoints=true

Migrate workloads and delete the old pool

Cordon and drain the old nodes so the scheduler moves pods to the new pool:

# Cordon every node in the old pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=OLD_POOL -o name); do
  kubectl cordon "$node"
done

# Drain them one at a time, respecting PodDisruptionBudgets
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=OLD_POOL -o name); do
  kubectl drain "$node" --ignore-daemonsets --delete-emptydir-data --grace-period=120
done

Danger: The command below permanently deletes a node pool. Confirm all workloads have rescheduled onto the new pool and are healthy before running it. Draining without correct PodDisruptionBudgets can cause an outage for stateful or single-replica services.

gcloud container node-pools delete OLD_POOL \
  --cluster=CLUSTER_NAME \
  --location=COMPUTE_LOCATION

Terraform

If you manage GKE with Terraform, set the metadata block on the node config. Use GKE_METADATA for Workload Identity:

resource "google_container_node_pool" "secure_pool" {
  name     = "secure-pool"
  cluster  = google_container_cluster.primary.id
  location = var.location

  node_config {
    workload_metadata_config {
      mode = "GKE_METADATA"
    }

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

Tip: Changing workload_metadata_config or metadata on an existing pool forces Terraform to recreate it. Add a new pool resource, apply, migrate, then remove the old resource. That gives you a controlled rolling migration instead of an abrupt replacement.

How to prevent it from happening again

Recreating node pools is tedious, so the goal is to never ship one with legacy endpoints in the first place.

Lock it in Terraform modules

If your teams provision GKE through a shared module, hardcode mode = "GKE_METADATA" and disable-legacy-endpoints = "true" in the module rather than exposing them as overridable variables. Make the secure value the only value.

Gate it in CI/CD with policy-as-code

Use Conftest with an OPA/Rego policy to reject plans that create node pools with legacy metadata. Run it against terraform plan -out=tfplan && terraform show -json tfplan:

package gke.metadata

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_container_node_pool"
  mode := resource.change.after.workload_metadata_config[_].mode
  mode != "GKE_METADATA"
  msg := sprintf("Node pool '%s' must use GKE_METADATA, found '%s'", [resource.address, mode])
}

Enforce at the org level with Policy Controller

For runtime enforcement across all clusters, GKE Policy Controller (the managed Gatekeeper) can block node pool configurations that fall outside policy. Pair that with org-level constraints so a noncompliant pool cannot be created even by someone bypassing CI.

Tip: Keep Lensix running the gke_metadataendpoints check on a schedule so any pool created out-of-band, through the console, a script, or a forgotten pipeline, gets flagged before it lingers. Detection and prevention together close the gap that policy-as-code alone leaves open.

Best practices

Adopt Workload Identity everywhere. It is the single highest-value change you can make here. It removes node service account credentials from the pod's reach entirely, so even a working SSRF gets nothing useful.
Stop using the default compute service account. Create dedicated, least-privilege service accounts for node pools and workloads. A stolen token is far less dangerous when it only carries the permissions it actually needs.
Restrict metadata access at the network layer. A NetworkPolicy or the GKE Metadata Server itself can block pod traffic to 169.254.169.254 for workloads that have no business reading metadata.
Treat node pool creation as a security-reviewed action. Because metadata settings cannot be changed in place, the moment of creation is your only easy chance to get them right.
Audit regularly. Clusters drift. Pools get added during incidents, migrations, and experiments. Continuous checks catch what one-time hardening misses.

Disabling legacy metadata endpoints is a small configuration change with an outsized payoff. It closes one of the most reliable cloud privilege escalation paths attackers reach for, and on a modern GKE cluster it is mostly a matter of confirming the secure defaults are actually in place and rebuilding the few pools that slipped through.

GKE Legacy Metadata Endpoints Enabled: Why It Matters and How to Fix It