Back to blog
Best PracticesCloud SecurityDatabasesGCPReliability

Cloud SQL No Automatic Failover: Why Zonal Databases Are a Risk

Learn why GCP Cloud SQL instances without automatic failover risk outages, and how to enable regional high availability with gcloud, Terraform, and policy-as-code.

TL;DR

This check flags Cloud SQL instances running in a single zone with no automatic failover, which means a zonal outage takes your database down with no automatic recovery. Switch the instance to regional (HA) availability with gcloud sql instances patch INSTANCE --availability-type=REGIONAL.

Databases are usually the one thing in your stack you cannot afford to lose. A stateless web tier can be rebuilt in seconds, but the moment your primary database goes dark, every service that depends on it starts failing. Cloud SQL makes high availability easy to turn on, which is exactly why running a production instance without it is such an avoidable mistake.

The Cloud SQL No Automatic Failover check looks for instances configured with zonal availability. These instances live in a single Google Cloud zone and have no standby to fail over to. If that zone has a problem, your database is unavailable until the zone recovers or you manually rebuild.


What this check detects

Lensix flags any Cloud SQL instance (MySQL, PostgreSQL, or SQL Server) where the availabilityType is set to ZONAL instead of REGIONAL.

Cloud SQL offers two availability configurations:

  • Zonal: A single instance in a single zone. No standby. If the zone or the instance fails, the database is down until it comes back or you intervene.
  • Regional (HA): A primary instance plus a synchronously replicated standby in a separate zone within the same region. Cloud SQL automatically promotes the standby if the primary becomes unhealthy.

Note: Regional availability uses synchronous replication to a standby that you cannot read from or connect to directly. It exists solely to take over during a failover. This is different from read replicas, which serve read traffic but do not provide automatic failover on their own.

When the check fires, it means the instance has no second zone to fall back on. Recovery from a zonal incident becomes a manual, time-consuming process instead of an automatic one measured in seconds.


Why it matters

Google Cloud zones do fail. Hardware faults, power events, network partitions, and maintenance can all take a single zone offline. When that happens to a zonal Cloud SQL instance, here is what your operations team faces:

  • Full outage of the database tier. Every application connected to the instance loses its data layer at once.
  • No automatic recovery. Someone has to notice, diagnose, and start a recovery, often a restore from backup into a new instance.
  • Potential data loss. If you are recovering from a backup or a point-in-time restore, you can lose any writes that happened after the last recovery point.
  • Long RTO. Restoring a large database can take minutes to hours, all of which is downtime your users feel.

Compare that to a regional instance, where failover is automatic and typically completes in under a minute with no data loss, since replication is synchronous. The standby is always current.

Warning: Routine maintenance also matters here. Google occasionally restarts instances to apply patches. On a zonal instance that restart is a hard downtime window. On a regional instance, maintenance is applied to the standby first and traffic fails over, dramatically reducing the disruption.

There is also a compliance angle. Frameworks like SOC 2, ISO 27001, and PCI DSS expect you to demonstrate resilience and recovery controls for systems holding sensitive data. A production database with no failover and an undocumented manual recovery path is a finding waiting to happen during an audit.


How to fix it

The fix is to switch the instance from zonal to regional availability. This adds a synchronously replicated standby and enables automatic failover.

Before you start

Regional availability requires automated backups and binary logging (or point-in-time recovery) to be enabled. If they are not, the patch command will enable them as needed, but it is worth confirming.

Warning: Converting to regional availability roughly doubles the compute cost of the instance, since you are now running a standby in a second zone. The standby does not serve traffic, so you pay for capacity you cannot read from. Budget for this before flipping production instances.

Option 1: gcloud CLI

First, confirm the current availability type:

gcloud sql instances describe INSTANCE_NAME \
  --format="value(settings.availabilityType)"

If it returns ZONAL, patch it to regional:

Danger: Changing the availability type triggers an operation that may briefly restart the instance and cause a short connection drop. Run this during a maintenance window for production databases, and make sure your applications retry transient connection failures.

gcloud sql instances patch INSTANCE_NAME \
  --availability-type=REGIONAL

Verify the change took effect:

gcloud sql instances describe INSTANCE_NAME \
  --format="value(settings.availabilityType, settings.backupConfiguration.enabled)"

You should see REGIONAL and True.

Option 2: Google Cloud Console

  1. Open SQL in the Cloud Console and select the instance.
  2. Click Edit.
  3. Expand the Availability section.
  4. Select Multiple zones (Highly available).
  5. Click Save and confirm.

Option 3: Terraform

If you manage Cloud SQL with Terraform, set availability_type to REGIONAL and make sure backups and point-in-time recovery are on:

resource "google_sql_database_instance" "main" {
  name             = "prod-db"
  database_version = "POSTGRES_15"
  region           = "us-central1"

  settings {
    tier              = "db-custom-2-7680"
    availability_type = "REGIONAL"

    backup_configuration {
      enabled                        = true
      point_in_time_recovery_enabled = true
      transaction_log_retention_days = 7
    }
  }

  deletion_protection = true
}

Apply with a plan first so you can see the change before it runs:

terraform plan -out=tfplan
terraform apply tfplan

Tip: After enabling HA, test failover deliberately rather than waiting for a real incident. You can trigger a controlled failover with gcloud sql instances failover INSTANCE_NAME in a non-production environment to confirm your application reconnects cleanly and your monitoring alerts fire.


How to prevent it from happening again

Fixing one instance is easy. Making sure no one ships a zonal production database next quarter is the real win. A few layers help here.

Enforce it with Terraform policy-as-code

If you use Terraform, gate plans in CI with a tool like OPA Conftest or Sentinel. Here is a Rego policy that fails any plan creating a Cloud SQL instance with zonal availability:

package cloudsql.availability

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "google_sql_database_instance"
  resource.change.after.settings[_].availability_type == "ZONAL"
  msg := sprintf(
    "Cloud SQL instance '%s' must use REGIONAL availability",
    [resource.change.after.name],
  )
}

Wire it into your pipeline so the check runs against the plan JSON:

terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json --policy policy/

Catch drift continuously with Lensix

Policy-as-code only covers resources created through your pipeline. Anything created manually, by another team, or through a forgotten script slips past it. Lensix scans your live GCP environment and re-runs the sql_nofailover check on a schedule, so a zonal instance created out-of-band gets flagged whether or not it went through Terraform.

Tip: Pair preventive controls (policy-as-code in CI) with detective controls (continuous scanning). The first stops the obvious mistakes at merge time, the second catches the things that never touched your pipeline.

Set a sane default in your modules

If you maintain a shared Cloud SQL Terraform module, default availability_type to REGIONAL and require teams to explicitly opt out for non-production. Making the safe choice the default removes the most common path to a misconfiguration.


Best practices

Automatic failover is one piece of a resilient database setup. Pair it with the rest:

  • Use regional availability for every production instance. Zonal is fine for dev and ephemeral environments where downtime is acceptable.
  • Enable automated backups and point-in-time recovery. Failover protects against zone loss, but backups protect against data corruption, accidental deletes, and bad migrations. They solve different problems.
  • Add read replicas for read scaling and cross-region DR. Failover replicas live in the same region. A cross-region read replica gives you a recovery option if an entire region has an issue.
  • Test failover regularly. An untested failover is an assumption, not a control. Run controlled failovers and confirm your application reconnects and your alerts fire.
  • Make applications resilient to short connection drops. Even an automatic failover involves a brief reconnection. Use connection pooling and retry logic so a sub-minute failover does not surface as a user-facing error.
  • Enable deletion protection. A highly available database is no good if someone deletes it by accident. Set deletion_protection = true on production instances.

Availability is cheap to configure and expensive to retrofit during an incident. Turn it on before you need it, then test it so you know it works.

Flipping an instance from zonal to regional is a few minutes of work and a doubling of compute cost. Weigh that against the cost of an unplanned database outage, the manual recovery, and the lost data, and the math is straightforward for anything you would call production.