feat(majorupgrade): Allow image rollbacks on failed major upgrades#9344
feat(majorupgrade): Allow image rollbacks on failed major upgrades#9344redbaron wants to merge 3 commits intocloudnative-pg:mainfrom
Conversation
|
❗ By default, the pull request is configured to backport to all release branches.
|
a2bdc0c to
625135e
Compare
8891664 to
0abcc65
Compare
a01219d to
ea3e0a6
Compare
275df96 to
ee198a9
Compare
|
/test |
|
@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22500207879 |
Unless PGData version was updated (as recorded by the status), allow image rollback to the previous version. Signed-off-by: Maxim Ivanov <[email protected]>
Signed-off-by: Niccolò Fei <[email protected]>
ee198a9 to
fc490ee
Compare
|
/test |
|
@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22577660156 |
|
/test ft=postgres-major-upgrade |
|
@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22587955136 |
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
|
The E2E tests added by @NiccoloFei highlighted that the initial approach was not working due to the race between the job recreating the pod and the rollback. I experimented with a more radical approach by making the job run only once (BackoffLimit=0) and automatically handling the job deletion when the user reverts the image: #10104 |
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
|
Hi @redbaron, thank you for your work on this — it was instrumental in shaping the final solution. As mentioned above, the E2E tests revealed a race condition with the default BackoffLimit that led us to take a different approach in #10104: disabling job retries entirely (BackoffLimit=0) and having the operator handle the cleanup automatically. Since that PR supersedes this one, I'm going to close it. Thanks again for the contribution! |
|
❗ By default, the pull request is configured to backport to all release branches.
|
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]>
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
Unless PGData version was updated (as recorded by the status), allow image rollback to the previous version.
Fixes #9128