Skip to content

fix: validate VolumeSnapshot existence before using as storage source#10192

Merged
gbartolini merged 1 commit intomainfrom
dev/10029
Mar 11, 2026
Merged

fix: validate VolumeSnapshot existence before using as storage source#10192
gbartolini merged 1 commit intomainfrom
dev/10029

Conversation

@armru
Copy link
Member

@armru armru commented Mar 6, 2026

When a cluster is bootstrapped from a VolumeSnapshot that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely.

Add VolumeSnapshot existence validation in GetCandidateStorageSourceForReplica and getCandidateSourceFromBackupList. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to pg_basebackup for replica creation.

Closes #10029

@armru armru requested a review from a team as a code owner March 6, 2026 11:03
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 6, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Mar 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Mar 6, 2026
@armru
Copy link
Member Author

armru commented Mar 6, 2026

Hello @simonapencea, crediting you as co-author for your initial analysis, feel free to add suggestions or feedbacks and so on to the patch!

@armru
Copy link
Member Author

armru commented Mar 6, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22760727137

@armru armru changed the title fix(replica): validate VolumeSnapshot existence before using as storage source fix: validate VolumeSnapshot existence before using as storage source Mar 6, 2026
@simonapencea
Copy link
Contributor

This is very close to what I had in mind as well — thanks for taking this on!

One thing that might still be worth noting is that the check and the creation of the replica are not a single atomic operation. Because of that, there is technically a small window where the volume could exist when the check happens but be deleted before it is actually used as the data source.

That said, this change strengthens the behavior quite a bit and should cover the vast majority of cases. It also makes recovery from that edge case much easier if it ever happens.
Right now, recovering from that situation can be fairly involved, whereas with this change I would expect that deleting the pending PVC together with the snapshot-recovery job should allow the next reconciliation loop to create the correct type of job again.

Overall this looks like a solid improvement to me 👍

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Mar 6, 2026
@armru
Copy link
Member Author

armru commented Mar 6, 2026

Hello @simonapencea! I want to make sure I'm not missing something. Are you referring to:

  • The time window between the snapshot existence check and the actual PVC creation within the same reconciliation loop? Would you suggest a different approach to address this? CNPG in general (not just the snapshots pkg) follows the eventual consistency pattern, like many Kubernetes operators where multiple components interact and report state concurrently. That said, if we want to handle recovery for the case where a snapshot is deleted during that narrow execution window, I'm personally open to an eventual contribution on that matter (would love to hear other maintainers opinion on this) but I think it belongs in a separate PR as an incremental improvement.

  • Or some other behavior I might be overlooking? If so, I'd appreciate the additional context so we can evaluate whether it warrants a change.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 6, 2026
@armru
Copy link
Member Author

armru commented Mar 6, 2026

I've added a commit that handles the case where a VolumeSnapshot is deleted between the operator's existence check and the PVC creation, leaving the PVC stuck in Pending and the restore Job blocked indefinitely. The new code detects these orphaned PVCs and cleans them up along with their associated Jobs, allowing the operator to retry via pg_basebackup on the next reconciliation. I'll let the other maintainers decide whether this warrants a separate PR or can be merged with this one.

@armru
Copy link
Member Author

armru commented Mar 6, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22770102467

@simonapencea
Copy link
Contributor

Sorry i missed this, yes, that was exactly the case i was referring to.

As to including it here or in a separate PR, i have no opinion.

The PR was super valuable even without considering this edge case.
So i am happy either way :)

@leonardoce leonardoce force-pushed the dev/10029 branch 2 times, most recently from aa615f9 to 691fab7 Compare March 9, 2026 10:59
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Mar 9, 2026
@leonardoce
Copy link
Contributor

Hi! Following a discussion with @armru, we've decided to move the second commit into a separate PR (#10223). This will allow us to discuss and review each change independently.

@leonardoce
Copy link
Contributor

/test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

@leonardoce, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22850513034

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 9, 2026
…ge source (#10029)

When a cluster is bootstrapped from a VolumeSnapshot that is later
deleted, adding replicas would fail because the operator referenced
the deleted snapshot as data source for new PVCs, leaving them stuck
in Pending state indefinitely.

Add VolumeSnapshot existence validation in GetCandidateStorageSourceForReplica
and getCandidateSourceFromBackupList. When a referenced snapshot no longer
exists, the operator now skips it and tries the next candidate or falls
back to pg_basebackup for replica creation.

Co-Authored-By: simonapencea <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
@gbartolini
Copy link
Contributor

I approve this, but the current process lacks transparency and control. In the long term, we need to address #9176 to provide clarity to users.

@gbartolini gbartolini merged commit 77e460b into main Mar 11, 2026
40 checks passed
@gbartolini gbartolini deleted the dev/10029 branch March 11, 2026 01:07
cnpg-bot pushed a commit that referenced this pull request Mar 11, 2026
…ce (#10192)

When a cluster is bootstrapped from a `VolumeSnapshot` that is later
deleted, adding replicas would fail because the operator referenced the
deleted snapshot as data source for new PVCs, leaving them stuck in
Pending state indefinitely.

Add VolumeSnapshot existence validation in `GetCandidateStorageSourceForReplica`
and `getCandidateSourceFromBackupList`. When a referenced snapshot no longer
exists, the operator now skips it and tries the next candidate or falls
back to `pg_basebackup` for replica creation.

Closes #10029

Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: simonapencea <[email protected]>
(cherry picked from commit 77e460b)
mnencia pushed a commit that referenced this pull request Mar 11, 2026
…ce (#10192)

When a cluster is bootstrapped from a `VolumeSnapshot` that is later
deleted, adding replicas would fail because the operator referenced the
deleted snapshot as data source for new PVCs, leaving them stuck in
Pending state indefinitely.

Add VolumeSnapshot existence validation in `GetCandidateStorageSourceForReplica`
and `getCandidateSourceFromBackupList`. When a referenced snapshot no longer
exists, the operator now skips it and tries the next candidate or falls
back to `pg_basebackup` for replica creation.

Closes #10029

Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: simonapencea <[email protected]>
(cherry picked from commit 77e460b)
mnencia pushed a commit that referenced this pull request Mar 11, 2026
…ce (#10192)

When a cluster is bootstrapped from a `VolumeSnapshot` that is later
deleted, adding replicas would fail because the operator referenced the
deleted snapshot as data source for new PVCs, leaving them stuck in
Pending state indefinitely.

Add VolumeSnapshot existence validation in `GetCandidateStorageSourceForReplica`
and `getCandidateSourceFromBackupList`. When a referenced snapshot no longer
exists, the operator now skips it and tries the next candidate or falls
back to `pg_basebackup` for replica creation.

Closes #10029

Signed-off-by: Armando Ruocco <[email protected]>
Co-authored-by: simonapencea <[email protected]>
(cherry picked from commit 77e460b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working lgtm This PR has been approved by a maintainer ok to merge 👌 This PR can be merged release-1.25 release-1.27 release-1.28 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Replica creation fails when bootstrap volume no longer exists

6 participants