Failure Modes

Note

In previous versions of CloudNativePG, this page included specific failure scenarios. Since these largely follow standard Kubernetes behavior, we have streamlined the content to avoid duplication of information that belongs to the underlying Kubernetes stack and is not specific to CloudNativePG.

CloudNativePG adheres to standard Kubernetes principles for self-healing and high availability. We assume familiarity with core Kubernetes concepts such as storage classes, PVCs, nodes, and Pods. For CloudNativePG-specific details, refer to the "Postgres Instance Manager" section, which covers startup, liveness, and readiness probes, as well as the self-healing section below.

Important

If you are running CloudNativePG in production, we strongly recommend seeking professional support.

Self-Healing

Primary Failure

If the primary Pod fails:

  • The operator promotes the most up-to-date standby with the lowest replication lag.
  • The -rw service is updated to point to the new primary.
  • The failed Pod is removed from the -r and -rw services.
  • Standby Pods begin replicating from the new primary.
  • The former primary uses pg_rewind to re-synchronize if its PVC is available; otherwise, a new standby is created from a backup of the new primary.

Standby Failure

If a standby Pod fails:

  • It is removed from the -r and -ro services.
  • The Pod is restarted using its PVC if available; otherwise, a new Pod is created from a backup of the current primary.
  • Once ready, the Pod is re-added to the -r and -ro services.

Manual Intervention

For failure scenarios not covered by automated recovery, manual intervention may be required.

Important

Do not perform manual operations without professional support.

Disabling Reconciliation

To temporarily disable the reconciliation loop for a PostgreSQL cluster, use the cnpg.io/reconciliationLoop annotation:

metadata:
  name: cluster-example-no-reconcile
  annotations:
    cnpg.io/reconciliationLoop: "disabled"
spec:
  # ...

Use this annotation with extreme caution and only during emergency operations.

Warning

This annotation should be removed as soon as the issue is resolved. Leaving it in place prevents the operator from executing self-healing actions, including failover.