Fix Cloud SQL PostgreSQL Checkpoint Logging | Lensix

TL;DR

This check flags Cloud SQL PostgreSQL instances where the log_checkpoints flag is off, leaving you blind to checkpoint activity that signals I/O pressure, write spikes, and recovery behavior. Turn it on by setting the database flag log_checkpoints=on via gcloud, Terraform, or the console.

Checkpoints are one of the most important background operations in PostgreSQL, and also one of the easiest to ignore until they start hurting you. When checkpoint logging is disabled on a Cloud SQL PostgreSQL instance, you lose visibility into how often checkpoints run, how long they take, and how much data they flush. That blind spot makes performance problems harder to diagnose and slows down post-incident analysis.

This Lensix check, sql_pgcheckpoints in the sql_checks module, looks at your GCP Cloud SQL PostgreSQL instances and reports any that do not have the log_checkpoints database flag enabled.

What this check detects

PostgreSQL exposes a configuration parameter called log_checkpoints. When it is on, the database writes a log line every time a checkpoint completes, including statistics like the number of buffers written, the time spent on the sync phase, and the write throughput.

The check inspects the settings.databaseFlags array on each Cloud SQL PostgreSQL instance. If log_checkpoints is missing or set to off, the instance is flagged.

Note: A checkpoint is the point at which PostgreSQL flushes all dirty pages from shared buffers to disk and records that position in the write-ahead log (WAL). After a crash, recovery only has to replay WAL from the last checkpoint forward, so checkpoint frequency directly affects both write amplification and recovery time.

It is worth noting that in PostgreSQL 15 and later, log_checkpoints defaults to on in upstream Postgres. On Cloud SQL the default still depends on the instance configuration and version, so this check is the reliable way to confirm the flag is actually set rather than assumed.

Why it matters

Checkpoint logging is not a direct security control, but it sits squarely in the operational visibility category that incident response and performance tuning both depend on. Here is where the missing data bites you.

You cannot see write pressure building

If checkpoints start firing more frequently than expected, that usually means your workload is generating WAL faster than max_wal_size can absorb between timed checkpoints. Without the log lines, the first symptom you notice is often degraded write latency or I/O saturation, with no breadcrumb trail explaining why.

A healthy log line looks like this:

LOG:  checkpoint complete: wrote 4523 buffers (3.4%);
      0 WAL file(s) added, 0 removed, 12 recycled;
      write=89.123 s, sync=0.034 s, total=89.421 s;
      sync files=58, longest=0.012 s, average=0.001 s;
      distance=196608 kB, estimate=196608 kB

The write and sync timings tell you whether checkpoints are spread out smoothly or hammering the disk in bursts. The distance value tells you how much WAL is being produced. None of this is recoverable after the fact if logging was off.

Slower incident response

When a database has an availability or latency incident, checkpoint logs are part of the standard timeline you reconstruct. Were checkpoints overlapping? Did a checkpoint coincide with the latency spike? Was the instance under sustained write load before the failover? Without these logs you are guessing.

Compliance and audit gaps

Many database logging benchmarks, including the CIS Google Cloud Platform Foundations Benchmark, call out checkpoint and connection logging as expected configuration for production database instances. A disabled flag is an easy finding for an auditor and an easy fix for you.

Warning: Checkpoint log lines are written to PostgreSQL logs, which flow into Cloud Logging. On a very busy instance this adds a steady trickle of log volume. It is small compared to query logging, but if you have aggressive log retention or export to BigQuery, factor in the marginal cost before enabling it fleet-wide.

How to fix it

Enabling checkpoint logging means setting the log_checkpoints database flag to on. This flag does not require a restart on Cloud SQL PostgreSQL, so the change applies cleanly with no downtime.

Option 1: gcloud CLI

First, check the current flags so you do not accidentally overwrite existing ones:

gcloud sql instances describe INSTANCE_NAME \
  --project=PROJECT_ID \
  --format="value(settings.databaseFlags)"

Warning: The --database-flags argument replaces the entire flag set. If your instance already has flags like log_connections or max_connections, list them all in a single command or you will silently remove them.

Now apply log_checkpoints alongside any existing flags:

gcloud sql instances patch INSTANCE_NAME \
  --project=PROJECT_ID \
  --database-flags=log_checkpoints=on,log_connections=on

Confirm the change took effect:

gcloud sql instances describe INSTANCE_NAME \
  --project=PROJECT_ID \
  --format="json(settings.databaseFlags)"

Option 2: Google Cloud Console

Open SQL in the Google Cloud Console and select your PostgreSQL instance.
Click Edit.
Expand the Flags section.
Click Add a database flag.
Select log_checkpoints and set the value to on.
Click Save. The flag applies without a restart.

Option 3: Terraform

If you manage Cloud SQL with Terraform, add the flag to the settings block. Define every flag you want, since Terraform also manages the full set:

resource "google_sql_database_instance" "postgres" {
  name             = "my-postgres-instance"
  database_version = "POSTGRES_15"
  region           = "us-central1"

  settings {
    tier = "db-custom-2-7680"

    database_flags {
      name  = "log_checkpoints"
      value = "on"
    }

    database_flags {
      name  = "log_connections"
      value = "on"
    }
  }
}

terraform plan
terraform apply

Tip: If you run many instances, define your baseline database flags once in a shared Terraform module or a locals block and reuse it. That way log_checkpoints and friends ship with every new instance by default, and the fix becomes a one-time change instead of a recurring chore.

How to prevent it from happening again

One-off remediation closes the finding for a single instance. Stopping it from reappearing on the next instance someone spins up requires guardrails in your provisioning pipeline.

Gate it in CI with policy-as-code

If you use Terraform, an OPA/Conftest policy can reject any Cloud SQL PostgreSQL instance plan that does not include log_checkpoints=on. Here is a Rego rule that fails the plan when the flag is missing or off:

package cloudsql.logging

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_sql_database_instance"
  flags := resource.change.after.settings[_].database_flags
  not has_checkpoint_logging(flags)
  msg := sprintf("Cloud SQL instance '%s' must set log_checkpoints=on", [resource.name])
}

has_checkpoint_logging(flags) {
  flag := flags[_]
  flag.name == "log_checkpoints"
  flag.value == "on"
}

Wire it into your pipeline against a saved plan:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy ./policies

Enforce with Organization Policy and continuous scanning

Policy-as-code in CI catches Terraform changes, but it does not catch a flag flipped by hand in the console. Pair the CI gate with continuous scanning so drift gets caught wherever it originates. Lensix runs the sql_pgcheckpoints check on a schedule across all your projects, so an instance created outside your IaC pipeline still gets flagged.

Tip: Treat database flags as a single managed baseline. Bundle log_checkpoints, log_connections, log_disconnections, and log_lock_waits together. They are all cheap to enable, they all improve observability, and managing them as one block keeps your fleet consistent.

Best practices

Enable checkpoint logging everywhere, not just production. Staging instances are where you reproduce performance issues, and they need the same visibility.
Tune checkpoints, do not just log them. Once you can see frequency, adjust max_wal_size and checkpoint_completion_target so checkpoints spread their writes smoothly instead of spiking I/O.
Watch for "checkpoints occurring too frequently" warnings. If your logs show checkpoints triggered by WAL volume rather than the timed interval, your max_wal_size is too small for the workload.
Pair logging flags as a set. log_checkpoints, log_connections, and log_lock_waits together give you a solid baseline of operational visibility with negligible overhead.
Route logs somewhere you will actually read them. Checkpoint lines flow into Cloud Logging. Build a log-based metric or alert on checkpoint frequency so the data drives action instead of sitting unread.

Checkpoint logging is a small flag with an outsized payoff. It costs almost nothing, it applies without downtime, and it gives you the timeline you will desperately want the next time write latency climbs at 3 a.m. Flip it on, bake it into your IaC baseline, and let continuous scanning keep it that way.

PostgreSQL Checkpoint Logging Disabled on Cloud SQL: Why It Matters and How to Fix It