Postgres instance manager
CloudNativePG does not rely on an external tool for failover management. It simply relies on the Kubernetes API server and a native key component called: the Postgres instance manager.
The instance manager takes care of the entire lifecycle of the PostgreSQL
server process (also known as postmaster
).
When you create a new cluster, the operator makes a Pod per instance.
The field .spec.instances
specifies how many instances to create.
Each Pod will start the instance manager as the parent process (PID 1) for the main container, which in turn runs the PostgreSQL instance. During the lifetime of the Pod, the instance manager acts as a backend to handle the startup, liveness and readiness probes.
Startup, liveness and readiness probes
The startup and liveness probes rely on pg_isready
, while the readiness
probe checks if the database is up and able to accept connections.
Startup Probe
The .spec.startDelay
parameter specifies the delay (in seconds) before the
liveness probe activates after a PostgreSQL Pod starts. By default, this is set
to 3600
seconds. You should adjust this value based on the time PostgreSQL
requires to fully initialize in your environment.
Warning
Setting .spec.startDelay
too low can cause the liveness probe to activate
prematurely, potentially resulting in unnecessary Pod restarts if PostgreSQL
hasn’t fully initialized.
CloudNativePG configures the startup probe with the following default parameters:
failureThreshold: FAILURE_THRESHOLD
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
Here, FAILURE_THRESHOLD
is calculated as startDelay
divided by
periodSeconds
.
If the default behavior based on startDelay
is not suitable for your use
case, you can take full control of the startup probe by specifying custom
parameters in the .spec.probes.startup
stanza. Note that defining this stanza
will override the default behavior, including the use of startDelay
.
Warning
Ensure that any custom probe settings are aligned with your cluster’s operational requirements to prevent unintended disruptions.
Info
For detailed information about probe configuration, refer to the probe API.
For example, the following configuration bypasses startDelay
entirely:
# ... snip
spec:
probes:
startup:
periodSeconds: 3
timeoutSeconds: 3
failureThreshold: 10
Liveness Probe
The liveness probe begins after the startup probe succeeds and is responsible for detecting if the PostgreSQL instance has entered a broken state that requires a restart of the pod.
The amount of time before a Pod is classified as not alive is configurable via
the .spec.livenessProbeTimeout
parameter.
CloudNativePG configures the liveness probe with the following default parameters:
failureThreshold: FAILURE_THRESHOLD
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
Here, FAILURE_THRESHOLD
is calculated as livenessProbeTimeout
divided by
periodSeconds
.
By default, .spec.livenessProbeTimeout
is set to 30
seconds. This means the
liveness probe will report a failure if it detects three consecutive probe
failures, with a 10-second interval between each check.
If the default behavior using livenessProbeTimeout
does not meet your needs,
you can fully customize the liveness probe by defining parameters in the
.spec.probes.liveness
stanza. Keep in mind that specifying this stanza will
override the default behavior, including the use of livenessProbeTimeout
.
Warning
Ensure that any custom probe settings are aligned with your cluster’s operational requirements to prevent unintended disruptions.
Info
For more details on probe configuration, refer to the probe API.
For example, the following configuration overrides the default behavior and
bypasses livenessProbeTimeout
:
# ... snip
spec:
probes:
liveness:
periodSeconds: 3
timeoutSeconds: 3
failureThreshold: 10
Readiness Probe
The readiness probe determines when a pod running a PostgreSQL instance is prepared to accept traffic and serve requests.
CloudNativePG uses the following default configuration for the readiness probe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
If the default settings do not suit your requirements, you can fully customize
the readiness probe by specifying parameters in the .spec.probes.readiness
stanza. For example:
# ... snip
spec:
probes:
readiness:
periodSeconds: 3
timeoutSeconds: 3
failureThreshold: 10
Warning
Ensure that any custom probe settings are aligned with your cluster’s operational requirements to prevent unintended disruptions.
Info
For more information on configuring probes, see the probe API.
Shutdown control
When a Pod running Postgres is deleted, either manually or by Kubernetes
following a node drain operation, the kubelet will send a termination signal to the
instance manager, and the instance manager will take care of shutting down
PostgreSQL in an appropriate way.
The .spec.smartShutdownTimeout
and .spec.stopDelay
options, expressed in seconds,
control the amount of time given to PostgreSQL to shut down. The values default
to 180 and 1800 seconds, respectively.
The shutdown procedure is composed of two steps:
-
The instance manager requests a smart shut down, disallowing any new connection to PostgreSQL. This step will last for up to
.spec.smartShutdownTimeout
seconds. -
If PostgreSQL is still up, the instance manager requests a fast shut down, terminating any existing connection and exiting promptly. If the instance is archiving and/or streaming WAL files, the process will wait for up to the remaining time set in
.spec.stopDelay
to complete the operation and then forcibly shut down. Such a timeout needs to be at least 15 seconds.
Important
In order to avoid any data loss in the Postgres cluster, which impacts the database RPO, don't delete the Pod where the primary instance is running. In this case, perform a switchover to another instance first.
Shutdown of the primary during a switchover
During a switchover, the shutdown procedure is slightly different from the general case. Indeed, the operator requires the former primary to issue a fast shut down before the selected new primary can be promoted, in order to ensure that all the data are available on the new primary.
For this reason, the .spec.switchoverDelay
, expressed in seconds, controls
the time given to the former primary to shut down gracefully and archive all
the WAL files. By default it is set to 3600
(1 hour).
Warning
The .spec.switchoverDelay
option affects the RPO
and RTO of your PostgreSQL database. Setting it to
a low value, might favor RTO over RPO but lead to data loss at cluster level
and/or backup level. On the contrary, setting it to a high value, might remove
the risk of data loss while leaving the cluster without an active primary for a
longer time during the switchover.
Failover
In case of primary pod failure, the cluster will go into failover mode. Please refer to the "Failover" section for details.
Disk Full Failure
Storage exhaustion is a well known issue for PostgreSQL clusters. The PostgreSQL documentation highlights the possible failure scenarios and the importance of monitoring disk usage to prevent it from becoming full.
The same applies to CloudNativePG and Kubernetes as well: the "Monitoring" section provides details on checking the disk space used by WAL segments and standard metrics on disk usage exported to Prometheus.
Important
In a production system, it is critical to monitor the database continuously. Exhausted disk storage can lead to a database server shutdown.
Note
The detection of exhausted storage relies on a storage class that
accurately reports disk size and usage. This may not be the case in simulated
Kubernetes environments like Kind or with test storage class implementations
such as csi-driver-host-path
.
If the disk containing the WALs becomes full and no more WAL segments can be stored, PostgreSQL will stop working. CloudNativePG correctly detects this issue by verifying that there is enough space to store the next WAL segment, and avoids triggering a failover, which could complicate recovery.
That allows a human administrator to address the root cause.
In such a case, if supported by the storage class, the quickest course of action
is currently to:
1. Expand the storage size of the full PVC
2. Increase the size in the Cluster
resource to the same value
Once the issue is resolved and there is sufficient free space for WAL segments, the Pod will restart and the cluster will become healthy.
See also the "Volume expansion" section of the documentation.