Skip to content

feat(majorupgrade): Allow image rollbacks on failed major upgrades#9344

Closed
redbaron wants to merge 3 commits intocloudnative-pg:mainfrom
redbaron:major-upgrade-failure-rollback
Closed

feat(majorupgrade): Allow image rollbacks on failed major upgrades#9344
redbaron wants to merge 3 commits intocloudnative-pg:mainfrom
redbaron:major-upgrade-failure-rollback

Conversation

@redbaron
Copy link
Contributor

@redbaron redbaron commented Dec 1, 2025

Unless PGData version was updated (as recorded by the status), allow image rollback to the previous version.

Fixes #9128

@redbaron redbaron requested a review from a team as a code owner December 1, 2025 22:28
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Dec 1, 2025
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.26 release-1.27 labels Dec 1, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the enhancement 🪄 New feature or request label Dec 1, 2025
@redbaron redbaron force-pushed the major-upgrade-failure-rollback branch 2 times, most recently from a2bdc0c to 625135e Compare December 1, 2025 22:30
@NiccoloFei NiccoloFei force-pushed the major-upgrade-failure-rollback branch 2 times, most recently from 8891664 to 0abcc65 Compare February 25, 2026 16:08
@NiccoloFei NiccoloFei force-pushed the major-upgrade-failure-rollback branch 2 times, most recently from a01219d to ea3e0a6 Compare February 27, 2026 14:53
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 27, 2026
@NiccoloFei NiccoloFei force-pushed the major-upgrade-failure-rollback branch from 275df96 to ee198a9 Compare February 27, 2026 19:10
@NiccoloFei
Copy link
Collaborator

/test

@github-actions
Copy link
Contributor

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22500207879

redbaron and others added 2 commits March 2, 2026 14:14
Unless PGData version was updated (as recorded by the status), allow image rollback to the previous version.

Signed-off-by: Maxim Ivanov <[email protected]>
@NiccoloFei NiccoloFei force-pushed the major-upgrade-failure-rollback branch from ee198a9 to fc490ee Compare March 2, 2026 13:15
@NiccoloFei
Copy link
Collaborator

/test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22577660156

@NiccoloFei
Copy link
Collaborator

/test ft=postgres-major-upgrade

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22587955136

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Mar 2, 2026
mnencia added a commit that referenced this pull request Mar 2, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 2, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 3, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 3, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 3, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 3, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
@mnencia
Copy link
Member

mnencia commented Mar 3, 2026

The E2E tests added by @NiccoloFei highlighted that the initial approach was not working due to the race between the job recreating the pod and the rollback. I experimented with a more radical approach by making the job run only once (BackoffLimit=0) and automatically handling the job deletion when the user reverts the image: #10104

NiccoloFei pushed a commit that referenced this pull request Mar 5, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
NiccoloFei added a commit that referenced this pull request Mar 5, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
NiccoloFei pushed a commit that referenced this pull request Mar 5, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
NiccoloFei added a commit that referenced this pull request Mar 5, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
NiccoloFei pushed a commit that referenced this pull request Mar 6, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
NiccoloFei added a commit that referenced this pull request Mar 6, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
armru pushed a commit that referenced this pull request Mar 6, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
armru pushed a commit that referenced this pull request Mar 6, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 9, 2026
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not
retry a failed pg_upgrade (retries won't produce a different result
and the retry pods hold the primary PVC, blocking recovery).

When the upgrade job exists but has not completed, the reconciler now
checks whether the user rolled back the image to the previous major
version. If so it deletes the job with foreground propagation and
requeues, allowing the cluster to restart on the original version
without manual intervention.

Move the majorupgrade.Reconcile call above the running-jobs guard in
reconcileResources so the reconciler can act on failed upgrade jobs
that would otherwise block the controller indefinitely.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Signed-off-by: Marco Nenciarini <[email protected]>
mnencia added a commit that referenced this pull request Mar 9, 2026
Add a rollback scenario to the major upgrade E2E suite. The test
creates a cluster with the pgvector extension, attempts an upgrade
to a minimal image that lacks the extension (causing pg_upgrade to
fail), then reverts the image and verifies the operator automatically
cleans up the failed job and the cluster recovers on the original
version with its timeline unchanged.

Based on Niccolò Fei's E2E test work in #9344.

Co-authored-by: Niccolò Fei <[email protected]>
Signed-off-by: Marco Nenciarini <[email protected]>
@mnencia
Copy link
Member

mnencia commented Mar 9, 2026

Hi @redbaron, thank you for your work on this — it was instrumental in shaping the final solution. As mentioned above, the E2E tests revealed a race condition with the default BackoffLimit that led us to take a different approach in #10104: disabling job retries entirely (BackoffLimit=0) and having the operator handle the cleanup automatically. Since that PR supersedes this one, I'm going to close it. Thanks again for the contribution!

@mnencia mnencia closed this Mar 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
mnencia added a commit that referenced this pull request Mar 9, 2026
…#10104)

Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry
a failed pg_upgrade. When the user reverts the image after a failed
upgrade, the operator automatically deletes the failed job and lets the
cluster restart on the original version.

Inspired by Maxim Ivanov's initial rollback approach in #9344.

Closes #9128

Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Niccolò Fei <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 5b7b799)
@redbaron redbaron deleted the major-upgrade-failure-rollback branch March 9, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases enhancement 🪄 New feature or request ok to merge 👌 This PR can be merged release-1.25 release-1.27 release-1.28 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Major upgrade rollback doesn't work as expected

4 participants