feat(majorupgrade): Allow image rollbacks on failed major upgrades by redbaron · Pull Request #9344 · cloudnative-pg/cloudnative-pg

redbaron · 2025-12-01T22:28:13Z

Unless PGData version was updated (as recorded by the status), allow image rollback to the previous version.

Fixes #9128

github-actions · 2025-12-01T22:28:25Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

NiccoloFei · 2026-02-27T19:13:59Z

/test

github-actions · 2026-02-27T19:14:12Z

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22500207879

Unless PGData version was updated (as recorded by the status), allow image rollback to the previous version. Signed-off-by: Maxim Ivanov <[email protected]>

Signed-off-by: Niccolò Fei <[email protected]>

NiccoloFei · 2026-03-02T13:15:42Z

/test

github-actions · 2026-03-02T13:15:53Z

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22577660156

NiccoloFei · 2026-03-02T17:36:02Z

/test ft=postgres-major-upgrade

github-actions · 2026-03-02T17:36:52Z

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22587955136

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>

Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>

Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>

Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>

mnencia · 2026-03-03T11:04:56Z

The E2E tests added by @NiccoloFei highlighted that the initial approach was not working due to the race between the job recreating the pod and the rollback. I experimented with a more radical approach by making the job run only once (BackoffLimit=0) and automatically handling the job deletion when the user reverts the image: #10104