Fix GKE Node Pools Not Using COS | Lensix

TL;DR

This check flags GKE node pools running an OS other than Container-Optimized OS (COS). COS ships locked down by default with automatic security updates and a read-only root filesystem, so switch your node pools to the cos_containerd image type to shrink your attack surface.

Container-Optimized OS, usually shortened to COS, is Google's minimal, hardened operating system built specifically for running containers on Google Kubernetes Engine. When a node pool runs something else, like Ubuntu, you trade away a lot of the security posture Google bakes in by default. This check catches node pools that have drifted from COS, whether through a deliberate choice that was never revisited or a copy-pasted Terraform module that hardcoded the wrong image type.

What this check detects

The gke_nocos check inspects the imageType field on every node pool in your GKE clusters. If the value is anything other than a COS variant (COS_CONTAINERD or the legacy COS with Docker), the node pool is flagged.

In practice the common offenders are:

UBUNTU_CONTAINERD — Ubuntu with containerd
UBUNTU — legacy Ubuntu with Docker
WINDOWS_LTSC or WINDOWS_SAC — Windows Server node pools

Some of these are legitimate. Windows containers obviously need a Windows node pool, and a handful of workloads genuinely require Ubuntu (more on that below). But the default and recommended choice for the vast majority of Linux workloads is COS, and a node pool that is not using it should be a deliberate, documented decision rather than an accident.

Note: COS is built on the open-source Chromium OS project. It uses a minimal package set, mounts the root filesystem read-only, and enables features like verified boot, a locked-down kernel, and automatic security patching out of the box. There is no package manager and no easy way to install arbitrary software on a node, which is exactly the point.

Why it matters

The node OS is the layer directly beneath your containers. If an attacker breaks out of a container or compromises a node, the operating system is what stands between them and the rest of your cluster. COS is designed to make that breakout as painful as possible.

Smaller attack surface

A general-purpose distribution like Ubuntu ships with a large set of packages, a full package manager, and a writable root filesystem. Every one of those is a potential foothold. COS strips this down to the bare minimum needed to run containers. Fewer binaries on disk means fewer vulnerabilities to patch and fewer tools an attacker can use to pivot once they are on the node.

Automatic, kernel-level hardening

COS enables a locked-down kernel configuration, a read-only root mount, and stateful partitions that reset on reboot. An attacker who manages to write a malicious binary to the node often cannot persist it across a reboot, and many of the usual privilege escalation paths are simply not present.

Patch cadence you do not have to manage

With Ubuntu node pools, keeping up with kernel and userland CVEs is on you. COS receives automatic security updates from Google on a regular cadence, and the images are validated against each GKE release. That means less time chasing patch SLAs and a much shorter window between a CVE being published and your fleet being fixed.

Warning: Running an unsupported or stale image type can also trip up compliance frameworks. CIS GKE Benchmark recommendation 6.5.1 explicitly calls for using Container-Optimized OS. If you are audited against CIS, PCI DSS, or a similar standard, non-COS node pools will show up as findings.

A concrete scenario

Imagine a pod running a vulnerable web app gets popped through a known RCE. On an Ubuntu node, the attacker can break out, use apt to pull down their tooling, write a persistent cron job, and start scanning the internal network. On a COS node, there is no apt, the root filesystem rejects writes, and the cron persistence trick fails on reboot. Same initial vulnerability, dramatically different blast radius.

How to fix it

You cannot change the image type of an existing node pool in place. The fix is to create a new node pool with COS, migrate your workloads, and delete the old one. This is a standard, well-trodden GKE operation.

Step 1: Inspect the current node pools

gcloud container node-pools list \
  --cluster=my-cluster \
  --zone=us-central1-a \
  --format="table(name, config.imageType)"

Any row where imageType is not COS_CONTAINERD is a target for migration.

Step 2: Create a replacement node pool on COS

Match the machine type, size, and labels of the pool you are replacing so workloads schedule cleanly.

gcloud container node-pools create cos-pool \
  --cluster=my-cluster \
  --zone=us-central1-a \
  --image-type=COS_CONTAINERD \
  --machine-type=e2-standard-4 \
  --num-nodes=3 \
  --enable-autoupgrade \
  --enable-autorepair

Step 3: Cordon and drain the old node pool

Cordoning stops new pods from scheduling onto the old nodes. Draining evicts the running pods so the scheduler moves them onto your new COS pool.

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=ubuntu-pool -o name); do
  kubectl cordon "$node"
done

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=ubuntu-pool -o name); do
  kubectl drain "$node" --ignore-daemonsets --delete-emptydir-data --grace-period=120
done

Warning: Draining evicts pods and can cause brief disruption if your workloads do not have adequate replica counts or a PodDisruptionBudget. Do this during a maintenance window for stateful or single-replica services, and verify your PDBs before you start.

Step 4: Delete the old node pool

Danger: Deleting a node pool destroys its nodes. Confirm every pod has been rescheduled onto the COS pool and is healthy before running this. There is no undo.

gcloud container node-pools delete ubuntu-pool \
  --cluster=my-cluster \
  --zone=us-central1-a

Fixing it in Terraform

If your clusters are managed with Terraform, set the image type explicitly on the node pool resource. Because the image type forces a replacement, plan the change during a window and let Terraform handle the create-before-destroy cycle.

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.id
  location = "us-central1-a"

  node_config {
    machine_type = "e2-standard-4"
    image_type   = "COS_CONTAINERD"
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

Tip: Add lifecycle { create_before_destroy = true } to the node pool block so Terraform stands up the new COS pool before tearing down the old one. Combined with auto-repair and auto-upgrade, this keeps your fleet on a supported, patched image without manual effort.

How to prevent it from happening again

Migrating once is fine, but you want the next node pool someone creates to be COS by default and a non-COS pool to be rejected unless explicitly justified.

Gate it in CI/CD with policy-as-code

If you provision GKE with Terraform, run a policy check in your pipeline before apply. Here is a Conftest/OPA Rego rule that fails any plan introducing a non-COS image type:

package gke.imagetype

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_container_node_pool"
  image := resource.change.after.node_config[_].image_type
  not startswith(image, "COS")
  msg := sprintf("Node pool %q uses non-COS image type %q", [resource.address, image])
}

Run it against a plan output:

terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json --policy policy/

Enforce at runtime with admission control

For a defense-in-depth approach, use GKE's built-in Policy Controller or a Gatekeeper constraint to flag or block node pools that are not on COS. This catches changes made outside your pipeline, like a manual gcloud command or a console click.

Tip: Lensix runs the gke_nocos check continuously across your GCP projects, so a node pool that drifts to Ubuntu via an out-of-band change surfaces as a finding without you having to remember to re-scan. Pair the continuous check with a CI gate and you cover both new infrastructure and live drift.

Best practices

Default to COS_CONTAINERD. Use the containerd variant rather than the older Docker-based COS, which is deprecated. New clusters use COS with containerd by default, so the safest move is to leave it alone.
Keep auto-upgrade and auto-repair on. COS only delivers its full security value when it is current. Auto-upgrade keeps nodes on patched images, and auto-repair replaces unhealthy nodes automatically.
Document any non-COS exception. If you genuinely need Ubuntu, for example for certain GPU driver requirements or kernel modules COS does not support, capture the reason in code comments or a ticket so the choice is auditable and revisited.
Standardize node pools through a module. A shared Terraform module that hardcodes image_type = "COS_CONTAINERD" means individual teams cannot accidentally provision an Ubuntu pool.
Treat node OS as part of your supply chain. The OS beneath your containers is as much a security boundary as the container images themselves. Scan it, patch it, and keep it minimal.

Note: A small number of workloads do legitimately need Ubuntu node pools, particularly some older GPU setups or workloads that depend on kernel modules or system packages not available on COS. These are the exception, not the rule. When you hit one, isolate it to its own dedicated pool rather than switching an entire cluster off COS.

COS is one of those rare security wins that is also the path of least resistance. It is the default, it is supported, and it asks nothing of you beyond not changing it. Getting your node pools onto it removes a meaningful chunk of attack surface for very little ongoing effort.

GKE Node Pool Not Using Container-Optimized OS (COS)