Skip to content

feat(replica): clean up Pending PVCs with deleted VolumeSnapshot dataSource#10223

Open
leonardoce wants to merge 2 commits intomainfrom
dev/10029-bis
Open

feat(replica): clean up Pending PVCs with deleted VolumeSnapshot dataSource#10223
leonardoce wants to merge 2 commits intomainfrom
dev/10029-bis

Conversation

@leonardoce
Copy link
Contributor

When a VolumeSnapshot is deleted after a PVC has already been created with it as dataSource, the PVC stays in Pending state indefinitely and the associated restore Job blocks the reconciliation loop.

Add DeletePVCsWithMissingVolumeSnapshots to detect Pending PVCs whose VolumeSnapshot dataSource no longer exists and delete them along with any associated Job, allowing the operator to retry replica creation via pg_basebackup on the next reconciliation.

armru and others added 2 commits March 9, 2026 11:40
…ge source (#10029)

When a cluster is bootstrapped from a VolumeSnapshot that is later
deleted, adding replicas would fail because the operator referenced
the deleted snapshot as data source for new PVCs, leaving them stuck
in Pending state indefinitely.

Add VolumeSnapshot existence validation in GetCandidateStorageSourceForReplica
and getCandidateSourceFromBackupList. When a referenced snapshot no longer
exists, the operator now skips it and tries the next candidate or falls
back to pg_basebackup for replica creation.

Co-Authored-By: simonapencea <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
…ource

When a VolumeSnapshot is deleted after a PVC has already been created
with it as dataSource, the PVC stays in Pending state indefinitely
and the associated restore Job blocks the reconciliation loop.

Add DeletePVCsWithMissingVolumeSnapshots to detect Pending PVCs whose
VolumeSnapshot dataSource no longer exists and delete them along with
any associated Job, allowing the operator to retry replica creation
via pg_basebackup on the next reconciliation.

Signed-off-by: Armando Ruocco <[email protected]>
@leonardoce leonardoce requested a review from a team as a code owner March 9, 2026 11:00
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug 🐛 Something isn't working labels Mar 9, 2026
@leonardoce leonardoce changed the title fix(replica): clean up Pending PVCs with deleted VolumeSnapshot dataSource feat(replica): clean up Pending PVCs with deleted VolumeSnapshot dataSource Mar 9, 2026
@leonardoce leonardoce added the do not backport This PR must not be backported - it will be in the next minor release label Mar 9, 2026
@leonardoce
Copy link
Contributor Author

This PR has been split from #10192

@leonardoce
Copy link
Contributor Author

Thanks for working on this, @armru and @simonapencea!

I've been thinking about this approach and wanted to share some concerns.
My main hesitation is that checking for the existence of the VolumeSnapshot object alone doesn't give us a complete picture. A PVC can stay in Pending state for several reasons beyond a missing VolumeSnapshot:

  • The VolumeSnapshotContent object may not exist
  • The VolumeSnapshotContent may reference a snapshot that was deleted at the storage layer
  • The CSI driver may be experiencing issues
  • The snapshot may have readyToUse: false

Since #10029 already addresses the root cause by preventing the operator from selecting missing snapshots as candidates for new replicas, I'm wondering if this additional cleanup logic is necessary.
The preventive approach should stop this situation from occurring in the first place.

I'm also a bit cautious about the operator automatically deleting PVCs based on this heuristic.
If we can't be certain why a PVC is stuck in Pending, it feels safer to leave it for an administrator to investigate rather than risk deleting something based on an incomplete check.

Would you be open to dropping this part and relying on the preventive fix from #10029?

@armru
Copy link
Member

armru commented Mar 9, 2026

I agree with @leonardoce's concerns. The heuristic of checking only for the VolumeSnapshot object is indeed incomplete, and there are several other reasons a PVC can remain in Pending state that we wouldn't catch.

@armru armru removed the bug 🐛 Something isn't working label Mar 9, 2026
@simonapencea
Copy link
Contributor

As i mentioned in the original PR #10029, there is a small window of time where a volume snapshot that passed the existence check can be deleted before the job starts and the pvc gets created - and as far as i understood, this PR intended to cover for that situation.
However, i believe that this window is relatively small so it is essentially a corner case from where we can recover easily now - just by deleting the pending PVC and job, whereas before it was pretty involved.
So PR #10029 already improves the situation by a lot.
What this PR does is, as far as i understand, provide the basis for a sort of self-healing, which is always complicated to get right. It would be a nice feature to have, but atm it may be a bit biased toward the use-case we discussed and maybe misses some other use cases.
I think this one can wait until we can gain more context.

That being said, another path may be to allow users to essentially opt-out of using volumesnapshots for replicas. This will empower the users to select what best fits their use-case. Just a thought.

Base automatically changed from dev/10029 to main March 11, 2026 01:07
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not backport This PR must not be backported - it will be in the next minor release size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants