Skip to content

feat(majorupgrade): auto-cleanup failed upgrade job on image rollback#10104

Merged
mnencia merged 4 commits intomainfrom
fix/major-upgrade-rollback-cleanup
Mar 9, 2026
Merged

feat(majorupgrade): auto-cleanup failed upgrade job on image rollback#10104
mnencia merged 4 commits intomainfrom
fix/major-upgrade-rollback-cleanup

Conversation

@mnencia
Copy link
Member

@mnencia mnencia commented Mar 2, 2026

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Mar 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@mnencia mnencia force-pushed the fix/major-upgrade-rollback-cleanup branch from 7163f55 to a91e27c Compare March 2, 2026 22:03
@mnencia
Copy link
Member Author

mnencia commented Mar 2, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22597783147

@mnencia mnencia force-pushed the fix/major-upgrade-rollback-cleanup branch 3 times, most recently from fce1be5 to 2d29c47 Compare March 3, 2026 10:25
@mnencia
Copy link
Member Author

mnencia commented Mar 3, 2026

/test ft=postgres-major-upgrade

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22619106064

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Mar 3, 2026
@mnencia
Copy link
Member Author

mnencia commented Mar 3, 2026

/test

@mnencia mnencia marked this pull request as ready for review March 3, 2026 10:56
@mnencia mnencia requested review from a team, NiccoloFei, jsilvela and litaocdl as code owners March 3, 2026 10:56
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 3, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22619835859

@dosubot dosubot bot added the enhancement 🪄 New feature or request label Mar 3, 2026
@NiccoloFei NiccoloFei force-pushed the fix/major-upgrade-rollback-cleanup branch 3 times, most recently from 97882f5 to ba49825 Compare March 6, 2026 11:50
@NiccoloFei
Copy link
Collaborator

/test ft=postgres-major-upgrade

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22762269483

@NiccoloFei
Copy link
Collaborator

/test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22764323799

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 6, 2026
@armru armru force-pushed the fix/major-upgrade-rollback-cleanup branch 2 times, most recently from d94d0d3 to 69879b5 Compare March 6, 2026 13:58
@armru armru force-pushed the fix/major-upgrade-rollback-cleanup branch 2 times, most recently from 27df5ec to 3b7d9ff Compare March 6, 2026 14:02
mnencia and others added 4 commits March 9, 2026 09:04
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
Reflect that the operator now automatically detects an image rollback
and deletes the failed upgrade job. Users only need to revert the
image — manual job deletion is no longer required.

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
@mnencia mnencia force-pushed the fix/major-upgrade-rollback-cleanup branch from 3b7d9ff to aafd3dc Compare March 9, 2026 08:04
@mnencia mnencia merged commit 5b7b799 into main Mar 9, 2026
39 of 42 checks passed
@mnencia mnencia deleted the fix/major-upgrade-rollback-cleanup branch March 9, 2026 08:17
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases enhancement 🪄 New feature or request lgtm This PR has been approved by a maintainer ok to merge 👌 This PR can be merged release-1.27 release-1.28 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Major upgrade rollback doesn't work as expected

5 participants