Failure Modes

This section provides an overview of the major failure scenarios that PostgreSQL can face on a Kubernetes cluster during its lifetime.

Important

In case the failure scenario you are experiencing is not covered by this section, please immediately seek for professional support.

Postgres instance manager

Please refer to the "Postgres instance manager" section for more information the liveness and readiness probes implemented by CloudNativePG.

Storage space usage

The operator will instantiate one PVC for every PostgreSQL instance to store the PGDATA content. A second PVC dedicated to the WAL storage will be provisioned in case .spec.walStorage is specified during cluster initialization.

Such storage space is set for reuse in two cases:

  • when the corresponding Pod is deleted by the user (and a new Pod will be recreated)
  • when the corresponding Pod is evicted and scheduled on another node

If you want to prevent the operator from reusing a certain PVC you need to remove the PVC before deleting the Pod. For this purpose, you can use the following command:

kubectl delete -n [namespace] pvc/[cluster-name]-[serial] pod/[cluster-name]-[serial]

Note

If you specified a dedicated WAL volume, it will also have to be deleted during this process.

kubectl delete -n [namespace] pvc/[cluster-name]-[serial] pvc/[cluster-name]-[serial]-wal pod/[cluster-name]-[serial]

For example:

$ kubectl delete -n default pvc/cluster-example-1 pvc/cluster-example-1-wal pod/cluster-example-1
persistentvolumeclaim "cluster-example-1" deleted
persistentvolumeclaim "cluster-example-1-wal" deleted
pod "cluster-example-1" deleted

Failure modes

A pod belonging to a Cluster can fail in the following ways:

  • the pod is explicitly deleted by the user;
  • the readiness probe on its postgres container fails;
  • the liveness probe on its postgres container fails;
  • the Kubernetes worker node is drained;
  • the Kubernetes worker node where the pod is scheduled fails.

Each one of these failures has different effects on the Cluster and the services managed by the operator.

Pod deleted by the user

The operator is notified of the deletion. A new pod belonging to the Cluster will be automatically created reusing the existing PVC, if available, or starting from a physical backup of the primary otherwise.

Important

In case of deliberate deletion of a pod, PodDisruptionBudget policies will not be enforced.

Self-healing will happen as soon as the apiserver is notified.

You can trigger a sudden failure on a given pod of the cluster using the following generic command:

kubectl delete -n [namespace] \
  pod/[cluster-name]-[serial] --grace-period=1

For example, if you want to simulate a real failure on the primary and trigger the failover process, you can run:

kubectl delete pod [primary pod] --grace-period=1

Warning

Never use --grace-period=0 in your failover simulation tests, as this might produce misleading results with your PostgreSQL cluster. A grace period of 0 guarantees that the pod is immediately removed from the Kubernetes API server, without first ensuring that the PID 1 process of the postgres container (the instance manager) is shut down - contrary to what would happen in case of a real failure (e.g. unplug the power cord cable or network partitioning). As a result, the operator doesn't see the pod of the primary anymore, and triggers a failover promoting the most aligned standby, without the guarantee that the primary had been shut down.

Readiness probe failure

After 3 failures, the pod will be considered not ready. The pod will still be part of the Cluster, no new pod will be created.

If the cause of the failure can't be fixed, it is possible to delete the pod manually. Otherwise, the pod will resume the previous role when the failure is solved.

Self-healing will happen after three failures of the probe.

Liveness probe failure

After 3 failures, the postgres container will be considered failed. The pod will still be part of the Cluster, and the kubelet will try to restart the container. If the cause of the failure can't be fixed, it is possible to delete the pod manually.

Self-healing will happen after three failures of the probe.

Worker node drained

The pod will be evicted from the worker node and removed from the service. A new pod will be created on a different worker node from a physical backup of the primary if the reusePVC option of the nodeMaintenanceWindow parameter is set to off (default: on during maintenance windows, off otherwise).

The PodDisruptionBudget may prevent the pod from being evicted if there is at least another pod that is not ready.

Note

Single instance clusters prevent node drain when reusePVC is set to false. Refer to the Kubernetes Upgrade section.

Self-healing will happen as soon as the apiserver is notified.

Worker node failure

Since the node is failed, the kubelet won't execute the liveness and the readiness probes. The pod will be marked for deletion after the toleration seconds configured by the Kubernetes cluster administrator for that specific failure cause. Based on how the Kubernetes cluster is configured, the pod might be removed from the service earlier.

A new pod will be created on a different worker node from a physical backup of the primary. The default value for that parameter in a Kubernetes cluster is 5 minutes.

Self-healing will happen after tolerationSeconds.

Self-healing

If the failed pod is a standby, the pod is removed from the -r service and from the -ro service. The pod is then restarted using its PVC if available; otherwise, a new pod will be created from a backup of the current primary. The pod will be added again to the -r service and to the -ro service when ready.

If the failed pod is the primary, the operator will promote the active pod with status ready and the lowest replication lag, then point the -rw service to it. The failed pod will be removed from the -r service and from the -rw service. Other standbys will start replicating from the new primary. The former primary will use pg_rewind to synchronize itself with the new one if its PVC is available; otherwise, a new standby will be created from a backup of the current primary.

Manual intervention

In the case of undocumented failure, it might be necessary to intervene to solve the problem manually.

Important

In such cases, please do not perform any manual operation without professional support.

From version 1.11.0 of the operator, you can use the cnpg.io/reconciliationLoop annotation to temporarily disable the reconciliation loop on a selected PostgreSQL cluster, as follows:

metadata:
  name: cluster-example-no-reconcile
  annotations:
    cnpg.io/reconciliationLoop: "disabled"
spec:
  # ...

The cnpg.io/reconciliationLoop must be used with extreme care and for the sole duration of the extraordinary/emergency operation.

Warning

Please make sure that you use this annotation only for a limited period of time and you remove it when the emergency has finished. Leaving this annotation in a cluster will prevent the operator from issuing any self-healing operation, such as a failover.