fix(rollout): prevent supervised primary strategy from starving rollout slots#9977
Merged
leonardoce merged 3 commits intocloudnative-pg:mainfrom Feb 18, 2026
Merged
Conversation
Contributor
|
❗ By default, the pull request is configured to backport to all release branches.
|
Member
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22059518565 |
0a86760 to
b63dc92
Compare
armru
approved these changes
Feb 16, 2026
b63dc92 to
cd2e442
Compare
cd2e442 to
626c2b1
Compare
Move the supervised primary update strategy check before CoordinateRollout() in rolloutRequiredInstances(), so supervised clusters don't consume the global rollout delay slot when they will only wait for user action. Remove the now-unreachable duplicate check from updatePrimaryPod(). Signed-off-by: ermakov-oleg <[email protected]>
Replace hardcoded pod name and namespace in buildPodListWithPrimaryNeedingRollout with cluster.Status.CurrentPrimary and cluster.Namespace to fix unused-parameter lint error. Signed-off-by: Armando Ruocco <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
626c2b1 to
62628bc
Compare
leonardoce
approved these changes
Feb 18, 2026
cnpg-bot
pushed a commit
that referenced
this pull request
Feb 18, 2026
…ut slots (#9977) The rollout manager uses a single global slot to coordinate Pod rollouts across all clusters. When a rollout is initiated, the slot is claimed and held for a configurable delay before another rollout can proceed. Previously, clusters using the supervised primary update strategy would claim the slot even though they only wait for user action (a manual switchover). This blocked the other clusters from performing their rollouts until the user intervened. This fix moves the supervised strategy check ensuring that supervised primaries never claim the slot. The slot is now only occupied when an actual rollout will proceed. Signed-off-by: ermakov-oleg <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 11b46c2)
cnpg-bot
pushed a commit
that referenced
this pull request
Feb 18, 2026
…ut slots (#9977) The rollout manager uses a single global slot to coordinate Pod rollouts across all clusters. When a rollout is initiated, the slot is claimed and held for a configurable delay before another rollout can proceed. Previously, clusters using the supervised primary update strategy would claim the slot even though they only wait for user action (a manual switchover). This blocked the other clusters from performing their rollouts until the user intervened. This fix moves the supervised strategy check ensuring that supervised primaries never claim the slot. The slot is now only occupied when an actual rollout will proceed. Signed-off-by: ermakov-oleg <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 11b46c2)
cnpg-bot
pushed a commit
that referenced
this pull request
Feb 18, 2026
…ut slots (#9977) The rollout manager uses a single global slot to coordinate Pod rollouts across all clusters. When a rollout is initiated, the slot is claimed and held for a configurable delay before another rollout can proceed. Previously, clusters using the supervised primary update strategy would claim the slot even though they only wait for user action (a manual switchover). This blocked the other clusters from performing their rollouts until the user intervened. This fix moves the supervised strategy check ensuring that supervised primaries never claim the slot. The slot is now only occupied when an actual rollout will proceed. Signed-off-by: ermakov-oleg <[email protected]> Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: Armando Ruocco <[email protected]> (cherry picked from commit 11b46c2)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a cluster with
primaryUpdateStrategy: supervisedhas its primary podpending rollout, it calls
CoordinateRollout()which resets the global rolloutdelay timer — but then does nothing (just logs "Waiting for the user to request
a switchover"). This blocks all other clusters from getting rollout slots
indefinitely, causing complete rollout starvation.
Changes
Move the supervised strategy check in
rolloutRequiredInstances()to executebefore
CoordinateRollout(), so supervised clusters setPhaseWaitingForUserand return early without consuming the global rolloutslot. The now-unreachable duplicate check in
updatePrimaryPod()is removed.Unit tests added to verify supervised clusters don't consume rollout slots
and unsupervised clusters still work as before.