Failure Modes
In previous versions of CloudNativePG, this page included specific failure scenarios. Since these largely follow standard Kubernetes behavior, we have streamlined the content to avoid duplication of information that belongs to the underlying Kubernetes stack and is not specific to CloudNativePG.
CloudNativePG adheres to standard Kubernetes principles for self-healing and high availability. We assume familiarity with core Kubernetes concepts such as storage classes, PVCs, nodes, and Pods. For CloudNativePG-specific details, refer to the "Postgres Instance Manager" section, which covers startup, liveness, and readiness probes, as well as the self-healing section below.
If you are running CloudNativePG in production, we strongly recommend seeking professional support.
Self-Healing
Primary Failure
If the primary Pod fails:
- The operator promotes the most up-to-date standby with the lowest replication lag.
- The
-rwservice is updated to point to the new primary. - The failed Pod is removed from the
-rand-rwservices. - Standby Pods begin replicating from the new primary.
- The former primary uses
pg_rewindto re-synchronize if its PVC is available; otherwise, a new standby is created from a backup of the new primary.
Standby Failure
If a standby Pod fails:
- It is removed from the
-rand-roservices. - The Pod is restarted using its PVC if available; otherwise, a new Pod is created from a backup of the current primary.
- Once ready, the Pod is re-added to the
-rand-roservices.
Manual Intervention
For failure scenarios not covered by automated recovery, manual intervention may be required.
Do not perform manual operations without professional support.
Disabling Reconciliation
The cnpg.io/reconciliationLoop annotation allows you to temporarily disable
the reconciliation loop for CloudNativePG resources. When set to "disabled",
the operator will stop processing updates for the annotated resource, preventing
any automated changes or self-healing actions.
Use this annotation with extreme caution and only during emergency operations.
This annotation should be removed as soon as the issue is resolved. Leaving it in place prevents the operator from managing the annotated resource. On a Cluster, this includes self-healing actions and failover.
The following resources support this annotation:
- Cluster: Disables reconciliation of the PostgreSQL cluster
- Backup: Disables reconciliation of backup operations
Example usage:
metadata:
name: cluster-example-no-reconcile
annotations:
cnpg.io/reconciliationLoop: "disabled"
spec:
# ...