feat(majorupgrade): auto-cleanup failed upgrade job on image rollback#10104
feat(majorupgrade): auto-cleanup failed upgrade job on image rollback#10104
Conversation
|
❗ By default, the pull request is configured to backport to all release branches.
|
7163f55 to
a91e27c
Compare
|
/test |
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22597783147 |
fce1be5 to
2d29c47
Compare
|
/test ft=postgres-major-upgrade |
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22619106064 |
|
/test |
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22619835859 |
97882f5 to
ba49825
Compare
|
/test ft=postgres-major-upgrade |
|
@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22762269483 |
|
/test |
|
@NiccoloFei, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22764323799 |
d94d0d3 to
69879b5
Compare
27df5ec to
3b7d9ff
Compare
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade (retries won't produce a different result and the retry pods hold the primary PVC, blocking recovery). When the upgrade job exists but has not completed, the reconciler now checks whether the user rolled back the image to the previous major version. If so it deletes the job with foreground propagation and requeues, allowing the cluster to restart on the original version without manual intervention. Move the majorupgrade.Reconcile call above the running-jobs guard in reconcileResources so the reconciler can act on failed upgrade jobs that would otherwise block the controller indefinitely. Inspired by Maxim Ivanov's initial rollback approach in #9344. Signed-off-by: Marco Nenciarini <[email protected]>
Add a rollback scenario to the major upgrade E2E suite. The test creates a cluster with the pgvector extension, attempts an upgrade to a minimal image that lacks the extension (causing pg_upgrade to fail), then reverts the image and verifies the operator automatically cleans up the failed job and the cluster recovers on the original version with its timeline unchanged. Based on Niccolò Fei's E2E test work in #9344. Co-authored-by: Niccolò Fei <[email protected]> Signed-off-by: Marco Nenciarini <[email protected]>
Reflect that the operator now automatically detects an image rollback and deletes the failed upgrade job. Users only need to revert the image — manual job deletion is no longer required. Signed-off-by: Marco Nenciarini <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
3b7d9ff to
aafd3dc
Compare
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
…#10104) Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version. Inspired by Maxim Ivanov's initial rollback approach in #9344. Closes #9128 Signed-off-by: Marco Nenciarini <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Niccolò Fei <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 5b7b799)
Set BackoffLimit=0 on the major upgrade job so Kubernetes does not retry a failed pg_upgrade. When the user reverts the image after a failed upgrade, the operator automatically deletes the failed job and lets the cluster restart on the original version.
Inspired by Maxim Ivanov's initial rollback approach in #9344.
Closes #9128