Version: Devel

Failure Modes

note

In previous versions of CloudNativePG, this page included specific failure scenarios. Since these largely follow standard Kubernetes behavior, we have streamlined the content to avoid duplication of information that belongs to the underlying Kubernetes stack and is not specific to CloudNativePG.

CloudNativePG adheres to standard Kubernetes principles for self-healing and high availability. We assume familiarity with core Kubernetes concepts such as storage classes, PVCs, nodes, and Pods. For CloudNativePG-specific details, refer to the "Postgres Instance Manager" section, which covers startup, liveness, and readiness probes, as well as the self-healing section below.

Important

If you are running CloudNativePG in production, we strongly recommend seeking professional support.

Self-Healing

Primary Failure

If the primary Pod fails:

The operator promotes the most up-to-date standby with the lowest replication lag.
The -rw service is updated to point to the new primary.
The failed Pod is removed from the -r and -rw services.
Standby Pods begin replicating from the new primary.
The former primary uses pg_rewind to re-synchronize if its PVC is available; otherwise, a new standby is created from a backup of the new primary.

Standby Failure

If a standby Pod fails:

It is removed from the -r and -ro services.
The Pod is restarted using its PVC if available; otherwise, a new Pod is created from a backup of the current primary.
Once ready, the Pod is re-added to the -r and -ro services.

Manual Intervention

For failure scenarios not covered by automated recovery, manual intervention may be required.

Important

Do not perform manual operations without professional support.

Disabling Reconciliation

The cnpg.io/reconciliationLoop annotation allows you to temporarily disable the reconciliation loop for CloudNativePG resources. When set to "disabled", the operator will stop processing updates for the annotated resource, preventing any automated changes or self-healing actions.

Use this annotation with extreme caution and only during emergency operations.

warning

This annotation should be removed as soon as the issue is resolved. Leaving it in place prevents the operator from managing the annotated resource. On a Cluster, this includes self-healing actions and failover.

The following resources support this annotation:

Cluster: Disables reconciliation of the PostgreSQL cluster
Backup: Disables reconciliation of backup operations

Example usage:

metadata:
  name: cluster-example-no-reconcile
  annotations:
    cnpg.io/reconciliationLoop: "disabled"
spec:
  # ...

Self-Healing​

Primary Failure​

Standby Failure​

Manual Intervention​

Disabling Reconciliation​

Self-Healing

Primary Failure

Standby Failure

Manual Intervention

Disabling Reconciliation