feat(replica): clean up Pending PVCs with deleted VolumeSnapshot dataSource#10223
feat(replica): clean up Pending PVCs with deleted VolumeSnapshot dataSource#10223leonardoce wants to merge 2 commits intomainfrom
Conversation
…ge source (#10029) When a cluster is bootstrapped from a VolumeSnapshot that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely. Add VolumeSnapshot existence validation in GetCandidateStorageSourceForReplica and getCandidateSourceFromBackupList. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to pg_basebackup for replica creation. Co-Authored-By: simonapencea <[email protected]> Signed-off-by: Armando Ruocco <[email protected]>
…ource When a VolumeSnapshot is deleted after a PVC has already been created with it as dataSource, the PVC stays in Pending state indefinitely and the associated restore Job blocks the reconciliation loop. Add DeletePVCsWithMissingVolumeSnapshots to detect Pending PVCs whose VolumeSnapshot dataSource no longer exists and delete them along with any associated Job, allowing the operator to retry replica creation via pg_basebackup on the next reconciliation. Signed-off-by: Armando Ruocco <[email protected]>
|
This PR has been split from #10192 |
|
Thanks for working on this, @armru and @simonapencea! I've been thinking about this approach and wanted to share some concerns.
Since #10029 already addresses the root cause by preventing the operator from selecting missing snapshots as candidates for new replicas, I'm wondering if this additional cleanup logic is necessary. I'm also a bit cautious about the operator automatically deleting PVCs based on this heuristic. Would you be open to dropping this part and relying on the preventive fix from #10029? |
|
I agree with @leonardoce's concerns. The heuristic of checking only for the VolumeSnapshot object is indeed incomplete, and there are several other reasons a PVC can remain in Pending state that we wouldn't catch. |
|
As i mentioned in the original PR #10029, there is a small window of time where a volume snapshot that passed the existence check can be deleted before the job starts and the pvc gets created - and as far as i understood, this PR intended to cover for that situation. That being said, another path may be to allow users to essentially opt-out of using volumesnapshots for replicas. This will empower the users to select what best fits their use-case. Just a thought. |
When a VolumeSnapshot is deleted after a PVC has already been created with it as dataSource, the PVC stays in Pending state indefinitely and the associated restore Job blocks the reconciliation loop.
Add DeletePVCsWithMissingVolumeSnapshots to detect Pending PVCs whose VolumeSnapshot dataSource no longer exists and delete them along with any associated Job, allowing the operator to retry replica creation via pg_basebackup on the next reconciliation.