fix: validate VolumeSnapshot existence before using as storage source#10192
fix: validate VolumeSnapshot existence before using as storage source#10192gbartolini merged 1 commit intomainfrom
Conversation
|
❗ By default, the pull request is configured to backport to all release branches.
|
|
Hello @simonapencea, crediting you as co-author for your initial analysis, feel free to add suggestions or feedbacks and so on to the patch! |
|
/test |
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22760727137 |
|
This is very close to what I had in mind as well — thanks for taking this on! One thing that might still be worth noting is that the check and the creation of the replica are not a single atomic operation. Because of that, there is technically a small window where the volume could exist when the check happens but be deleted before it is actually used as the data source. That said, this change strengthens the behavior quite a bit and should cover the vast majority of cases. It also makes recovery from that edge case much easier if it ever happens. Overall this looks like a solid improvement to me 👍 |
|
Hello @simonapencea! I want to make sure I'm not missing something. Are you referring to:
|
|
I've added a commit that handles the case where a VolumeSnapshot is deleted between the operator's existence check and the PVC creation, leaving the PVC stuck in Pending and the restore Job blocked indefinitely. The new code detects these orphaned PVCs and cleans them up along with their associated Jobs, allowing the operator to retry via pg_basebackup on the next reconciliation. I'll let the other maintainers decide whether this warrants a separate PR or can be merged with this one. |
|
/test |
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22770102467 |
|
Sorry i missed this, yes, that was exactly the case i was referring to. As to including it here or in a separate PR, i have no opinion. The PR was super valuable even without considering this edge case. |
aa615f9 to
691fab7
Compare
|
/test |
|
@leonardoce, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/22850513034 |
…ge source (#10029) When a cluster is bootstrapped from a VolumeSnapshot that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely. Add VolumeSnapshot existence validation in GetCandidateStorageSourceForReplica and getCandidateSourceFromBackupList. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to pg_basebackup for replica creation. Co-Authored-By: simonapencea <[email protected]> Signed-off-by: Armando Ruocco <[email protected]>
|
I approve this, but the current process lacks transparency and control. In the long term, we need to address #9176 to provide clarity to users. |
…ce (#10192) When a cluster is bootstrapped from a `VolumeSnapshot` that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely. Add VolumeSnapshot existence validation in `GetCandidateStorageSourceForReplica` and `getCandidateSourceFromBackupList`. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to `pg_basebackup` for replica creation. Closes #10029 Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: simonapencea <[email protected]> (cherry picked from commit 77e460b)
…ce (#10192) When a cluster is bootstrapped from a `VolumeSnapshot` that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely. Add VolumeSnapshot existence validation in `GetCandidateStorageSourceForReplica` and `getCandidateSourceFromBackupList`. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to `pg_basebackup` for replica creation. Closes #10029 Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: simonapencea <[email protected]> (cherry picked from commit 77e460b)
…ce (#10192) When a cluster is bootstrapped from a `VolumeSnapshot` that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely. Add VolumeSnapshot existence validation in `GetCandidateStorageSourceForReplica` and `getCandidateSourceFromBackupList`. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to `pg_basebackup` for replica creation. Closes #10029 Signed-off-by: Armando Ruocco <[email protected]> Co-authored-by: simonapencea <[email protected]> (cherry picked from commit 77e460b)
When a cluster is bootstrapped from a VolumeSnapshot that is later deleted, adding replicas would fail because the operator referenced the deleted snapshot as data source for new PVCs, leaving them stuck in Pending state indefinitely.
Add VolumeSnapshot existence validation in GetCandidateStorageSourceForReplica and getCandidateSourceFromBackupList. When a referenced snapshot no longer exists, the operator now skips it and tries the next candidate or falls back to pg_basebackup for replica creation.
Closes #10029