Skip to content

fix(rollout): prevent supervised primary strategy from starving rollout slots#9977

Merged
leonardoce merged 3 commits intocloudnative-pg:mainfrom
ermakov-oleg:fix/supervised-rollout-starvation
Feb 18, 2026
Merged

fix(rollout): prevent supervised primary strategy from starving rollout slots#9977
leonardoce merged 3 commits intocloudnative-pg:mainfrom
ermakov-oleg:fix/supervised-rollout-starvation

Conversation

@ermakov-oleg
Copy link
Contributor

Summary

When a cluster with primaryUpdateStrategy: supervised has its primary pod
pending rollout, it calls CoordinateRollout() which resets the global rollout
delay timer — but then does nothing (just logs "Waiting for the user to request
a switchover"). This blocks all other clusters from getting rollout slots
indefinitely, causing complete rollout starvation.

Changes

Move the supervised strategy check in rolloutRequiredInstances() to execute
before CoordinateRollout(), so supervised clusters set
PhaseWaitingForUser and return early without consuming the global rollout
slot. The now-unreachable duplicate check in updatePrimaryPod() is removed.

Unit tests added to verify supervised clusters don't consume rollout slots
and unsupervised clusters still work as before.

@ermakov-oleg ermakov-oleg requested a review from a team as a code owner February 13, 2026 17:16
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 13, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Feb 13, 2026
@github-actions
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Feb 13, 2026
@armru
Copy link
Member

armru commented Feb 16, 2026

/test

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22059518565

@armru armru force-pushed the fix/supervised-rollout-starvation branch from 0a86760 to b63dc92 Compare February 16, 2026 10:42
@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Feb 16, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 16, 2026
@ermakov-oleg ermakov-oleg force-pushed the fix/supervised-rollout-starvation branch from b63dc92 to cd2e442 Compare February 17, 2026 15:45
@armru armru force-pushed the fix/supervised-rollout-starvation branch from cd2e442 to 626c2b1 Compare February 18, 2026 09:32
@gbartolini gbartolini moved this to Waiting for Second Review in CloudNativePG operator development Feb 18, 2026
@gbartolini gbartolini added this to the 1.29.0 milestone Feb 18, 2026
ermakov-oleg and others added 3 commits February 18, 2026 14:17
Move the supervised primary update strategy check before
CoordinateRollout() in rolloutRequiredInstances(), so supervised
clusters don't consume the global rollout delay slot when they
will only wait for user action. Remove the now-unreachable
duplicate check from updatePrimaryPod().

Signed-off-by: ermakov-oleg <[email protected]>
Replace hardcoded pod name and namespace in
buildPodListWithPrimaryNeedingRollout with cluster.Status.CurrentPrimary
and cluster.Namespace to fix unused-parameter lint error.

Signed-off-by: Armando Ruocco <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
@leonardoce leonardoce force-pushed the fix/supervised-rollout-starvation branch from 626c2b1 to 62628bc Compare February 18, 2026 13:17
@leonardoce leonardoce merged commit 11b46c2 into cloudnative-pg:main Feb 18, 2026
64 of 76 checks passed
@github-project-automation github-project-automation bot moved this from Waiting for Second Review to Done in CloudNativePG operator development Feb 18, 2026
cnpg-bot pushed a commit that referenced this pull request Feb 18, 2026
…ut slots (#9977)

The rollout manager uses a single global slot to coordinate Pod rollouts
across all clusters. When a rollout is initiated, the slot is claimed
and held for a configurable delay before another rollout can proceed.

Previously, clusters using the supervised primary update strategy would
claim the slot even though they only wait for user action (a manual
switchover). This blocked the other clusters from performing their
rollouts until the user intervened.

This fix moves the supervised strategy check ensuring that supervised
primaries never claim the slot. The slot is now only occupied when an
actual rollout will proceed.

Signed-off-by: ermakov-oleg <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 11b46c2)
cnpg-bot pushed a commit that referenced this pull request Feb 18, 2026
…ut slots (#9977)

The rollout manager uses a single global slot to coordinate Pod rollouts
across all clusters. When a rollout is initiated, the slot is claimed
and held for a configurable delay before another rollout can proceed.

Previously, clusters using the supervised primary update strategy would
claim the slot even though they only wait for user action (a manual
switchover). This blocked the other clusters from performing their
rollouts until the user intervened.

This fix moves the supervised strategy check ensuring that supervised
primaries never claim the slot. The slot is now only occupied when an
actual rollout will proceed.

Signed-off-by: ermakov-oleg <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 11b46c2)
cnpg-bot pushed a commit that referenced this pull request Feb 18, 2026
…ut slots (#9977)

The rollout manager uses a single global slot to coordinate Pod rollouts
across all clusters. When a rollout is initiated, the slot is claimed
and held for a configurable delay before another rollout can proceed.

Previously, clusters using the supervised primary update strategy would
claim the slot even though they only wait for user action (a manual
switchover). This blocked the other clusters from performing their
rollouts until the user intervened.

This fix moves the supervised strategy check ensuring that supervised
primaries never claim the slot. The slot is now only occupied when an
actual rollout will proceed.

Signed-off-by: ermakov-oleg <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: Armando Ruocco <[email protected]>
(cherry picked from commit 11b46c2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working lgtm This PR has been approved by a maintainer no-issue ok to merge 👌 This PR can be merged release-1.25 release-1.27 release-1.28 size:L This PR changes 100-499 lines, ignoring generated files.

Development

Successfully merging this pull request may close these issues.

5 participants