This article is for DevOps engineers, SREs, platform teams, and infrastructure leads who run Kubernetes in real environments and need a backup strategy that holds up during an outage, migration, ransomware event, cluster failure, or operator mistake. If you own recovery time objectives, recovery point objectives, compliance requirements, or day-two reliability, this is worth your time.
By the end, you’ll have a clear model for how Kubernetes backup works, what needs protection, where etcd fits, how PVCs and CSI snapshots affect recoverability, and what features matter when choosing a backup solution. You’ll also see why CloudCasa is a strong fit for production Kubernetes environments.
Kubernetes changes the shape of infrastructure operations. Applications are assembled from declarative resources and scheduled dynamically across nodes. Storage is abstracted through persistent volumes and claims. Controllers create and reconcile resources continuously. Operators extend the API surface. Managed services push part of the stack outside the cluster boundary.
That architecture brings flexibility, though it also changes what “backup” means.
In a Kubernetes environment, there is no single artifact that captures the full state of an application in a form that is always ready for recovery. A control plane snapshot helps recover cluster metadata and API objects. A volume snapshot helps recover persistent data. A namespace export captures part of the desired state. A database dump protects a specific data service. Real protection comes from understanding how these pieces fit together.
That is why a Kubernetes backup plan should answer a few practical questions:
If the answer is uncertain, the backup strategy needs work.
A reliable backup design starts with the right inventory. In Kubernetes, four major categories matter.
This includes the Kubernetes objects that define and control the application and the cluster environment. Examples include:
These resources live in the Kubernetes API and are stored in etcd. They represent the structure and configuration of the environment. They do not contain the application’s actual file or block data stored in persistent volumes.
For stateful applications, the business value usually sits in the PVC-backed data. This includes database files, uploaded content, repositories, indexes, logs, queue data, machine learning artifacts, and internal application state.
A deployment manifest can recreate a pod. It cannot recreate the bytes inside a missing volume. That data needs its own protection path.
In self-managed environments, control plane recovery details matter. In managed Kubernetes environments such as EKS, AKS, and GKE, cloud-level settings also matter. Those settings can include networking configuration, node pool details, region and zone settings, IAM-related integrations, and cluster service parameters that are useful during a rebuild.
This matters in disaster recovery scenarios where the target cluster no longer exists and the recovery plan includes creating or recreating the cluster.
A Kubernetes application often depends on components outside the cluster boundary, including:
These dependencies need protection and recovery planning too. For some workloads, the external data service is the primary system of record.
etcd is the key-value store used by Kubernetes to hold cluster state. It stores the objects that represent the current state of the Kubernetes API. That makes etcd central to disaster recovery for the control plane.
For self-managed clusters, periodic etcd backup is a core best practice. An etcd snapshot can help restore cluster state after control-plane corruption, deletion, or severe misconfiguration. It is especially useful when you need to recover the cluster’s own API objects as they existed at a known point in time.
That said, etcd is only one layer of protection.
An etcd snapshot does not protect the file contents of a database volume attached to a StatefulSet. It does not capture the bytes stored in a PVC. It does not automatically protect cloud databases that live outside the cluster. It gives you Kubernetes state, resource definitions, and metadata. That is valuable, though it is not the whole recovery picture.
A sound mental model is simple:
That distinction helps teams avoid a common mistake, which is assuming that control-plane protection covers application recovery end to end.
Persistent volumes and persistent volume claims are how Kubernetes manages durable storage for workloads. The claim defines what the workload requests. The volume represents the underlying storage resource. The storage class and CSI driver determine how that storage is provisioned and managed.
For backup planning, this means the application data lifecycle is separate from the pod lifecycle. Pods can be rescheduled. Nodes can change. The PVC remains the anchor for stateful data consumption. That design is useful operationally and important for recovery planning.
When a workload depends on a PVC, backup needs to cover:
Teams often discover this the hard way during restore tests. The YAML comes back. The pods start. The application fails because the volume contents are stale, missing, inconsistent, or attached to the wrong recovery flow.
Container Storage Interface, or CSI, is the standard Kubernetes uses to interact with storage systems. CSI snapshots provide a point in time snapshot mechanism for supported CSI volumes through Kubernetes APIs.
This is an important piece of the backup stack because snapshots are fast, storage-aware, and useful for recovery workflows. They work well for many production scenarios, especially when the CSI driver supports them cleanly and the storage platform is stable.
CSI snapshots help with:
CSI snapshots do require the right conditions in the cluster:
That last point matters. Kubernetes exposes the interface. The storage behavior still depends on the driver and backend.
CSI snapshots are extremely useful, though they should be placed in context.
A snapshot is a point in time recovery primitive. It helps you roll back or read from a consistent storage state. It improves the efficiency of backup operations because the backup tool can read from the snapshot instead of the live mounted volume. It can also shorten restore operations inside the same storage environment.
What it does not guarantee on its own is off-cluster durability.
A local volume snapshot stored within the same storage domain helps with quick operational recovery. It does not automatically give you a separate backup copy that survives loss of the source storage system, a broader infrastructure event, or a malicious deletion scenario.
That is why mature Kubernetes backup platforms distinguish between:
That second path matters for serious disaster recovery and long-term retention.
Stateful workloads need more than “something got copied.” They need a recovery point that makes sense for the application.
Many systems can tolerate crash-consistent snapshots. Some databases and transactional services need more controlled handling so data is flushed, paused, frozen, or otherwise prepared before backup begins. Without that step, recovery can still succeed, though the restore quality may depend heavily on the application’s own integrity and journal replay behavior.
In Kubernetes backup, consistency is often improved with application hooks. These hooks let the backup platform run commands before backup, after backup, and after restore. For example, a pre-backup hook can flush an application state or trigger database-specific coordination. A post-restore hook can perform bootstrap or validation steps after recovery.
For virtualized workloads running through KubeVirt or related platforms, consistency can also be improved through guest agent integration, including freeze and unfreeze operations for VM filesystems.
This is a major buying criterion for backup products. A checkbox for “supports backups” tells you very little. Hook support, application awareness, and testable restore workflows tell you much more.
There are several valid ways to protect Kubernetes. The right one depends on workload type, data criticality, recovery targets, and infrastructure design.
This strategy captures Kubernetes resources and configuration, often including cluster-scoped objects. It is useful for:
It is not enough for stateful applications that rely on PVC data.
This adds control-plane recovery for self-managed clusters and gives stronger cluster-state recovery coverage. It is useful when recovering the cluster itself is part of the DR plan.
It still needs separate PVC protection for stateful workloads.
This strategy uses CSI snapshots or other storage-level snapshot methods to protect persistent volumes. It supports fast recovery and efficient backup orchestration. It works well when the storage stack is snapshot-capable and well-integrated.
Teams should still evaluate whether they also need backup copies outside the source storage environment.
This is one of the strongest general-purpose strategies for production Kubernetes. The snapshot provides a stable source and fast local recovery option. The backup copy provides durable retention and better survival against storage failure or broader infrastructure incidents.
This model supports many real-world objectives:
Some environments need low-RTO recovery paths that combine backup with storage replication and cross-cluster failover. In these designs, the backup system restores Kubernetes resources while storage replication provides the volume-level data path.
This is useful for organizations with stricter service continuity requirements and storage platforms that expose remote replication capabilities.
A serious backup product for Kubernetes should support the operational shape of production clusters. That means more than capturing YAML and taking snapshots.
Here’s what to look for.
The platform should protect Kubernetes objects and PVC data in one coordinated workflow. If the product only handles resources or only handles data, you inherit more manual recovery work.
Recovery should support multiple scopes, including:
Granularity matters because many recovery events are small and targeted.
A good product should restore to a different cluster, including migration and DR use cases. This is crucial when the source cluster is unavailable or when workloads need to be moved during upgrades, consolidation, or platform transitions.
The platform should support CSI snapshot workflows, understand the difference between snapshot and backup copy, and document supported PV types clearly. Storage behavior is too important for hand-wavy language.
Look for application hooks, database-aware workflows, and VM guest coordination where relevant. Restore quality matters as much as backup completion.
Backup should be schedulable, policy-driven, and manageable through API or infrastructure-as-code workflows. DevOps teams need repeatable operations, not purely manual console steps.
Important features include immutable backups, RBAC, encryption support, access controls, and deployment models that fit regulated or sovereign environments.
Some organizations want SaaS management. Some need self-hosted control. Some need air-gapped deployments. A Kubernetes backup solution should fit the operating model of the organization.
CloudCasa aligns well with the way Kubernetes backup works in production because it addresses the real layers of recovery instead of narrowing the problem to a single mechanism.
At the Kubernetes layer, CloudCasa protects cluster resources, namespaces, and individual resources. It also includes etcd backup support as part of the broader recovery picture. For persistent data, it supports snapshot-based protection for supported volumes and can copy volume data to backup storage for durable retention. That combination matters because production recovery usually needs both orchestration state and application data.
For storage workflows, CloudCasa supports CSI snapshot-based backups and clearly distinguishes snapshot-only operations from copy-to-backup-storage workflows. That is exactly how engineers should think about protection design. Fast local recovery and durable off-cluster retention serve different purposes.
For consistency and stateful workloads, CloudCasa supports application hooks for pre-backup, post-backup, and post-restore actions. It also supports guest-aware consistency workflows for KubeVirt environments through QEMU guest agent integration. That makes the platform relevant for both containerized stateful apps and virtualized workloads running on Kubernetes.
For restore flexibility, CloudCasa supports restore at the cluster, namespace, resource, and volume level. It also supports migration and replication workflows, which is useful for platform teams handling cross-cluster movement, DR exercises, and environment transitions.
For more advanced DR scenarios, CloudCasa also introduced DR for Storage support, enabling recovery workflows that integrate storage replication with Kubernetes resource restoration. For teams that need faster service continuity paths, this is a meaningful capability.
CloudCasa’s current feature set is also strong in adjacent areas that matter to platform teams:
CloudCasa supports a wide range of Kubernetes distributions and managed services, including major environments such as EKS, AKS, GKE, OpenShift, Rancher, and VMware Tanzu. That matters for organizations with mixed environments or platform transitions in flight.
CloudCasa supports backup and restore for KubeVirt, OpenShift Virtualization, and SUSE Virtualization workloads. It also supports VM file-level restore, which is useful when the goal is to recover specific files without restoring a full virtual machine.
CloudCasa supports object storage targets and has added support for NFS backup targets and SMB backup targets. It also introduced backup compression, which helps with transfer and storage efficiency.
Immutable backup support is important for ransomware resilience and retention governance. Backup immutability strengthens the recovery posture by protecting stored backup copies from tampering during the retention period.
When connected to cloud accounts, CloudCasa can auto-discover managed Kubernetes clusters and preserve cloud-related cluster parameters to support restore workflows. That reduces the manual burden during rebuild scenarios.
This is a big one for DevOps teams. CloudCasa supports automation through API and CLI workflows, fine-grained RBAC, and Terraform integration. Backup should live comfortably inside a modern platform engineering workflow, and these features help get it there.
CloudCasa is available as a SaaS platform and as a self-hosted solution. That helps organizations with different operational, compliance, and sovereignty requirements. Self-hosted deployment is especially relevant in regulated, controlled, or air-gapped environments.
For workloads that rely on cloud databases, CloudCasa also supports backup workflows for services such as Amazon RDS. That is useful because many Kubernetes applications depend on state that exists outside the cluster boundary.
CloudCasa is a strong fit for:
In practice, the product fits especially well when the backup conversation includes PVC data, recovery granularity, automation, and real DR planning. That is where lightweight approaches start to fray.
Kubernetes backup is a layered discipline. etcd matters because it protects cluster state. PVC protection matters because application data lives there. CSI snapshots matter because they provide fast, storage-aware recovery primitives. Backup copies matter because durable retention and disaster recovery require data outside the source failure domain. Consistency matters because successful restore is the actual goal.
A good backup strategy reflects that reality.
A good backup product does too.
CloudCasa stands out because it supports the full shape of Kubernetes recovery: resources, etcd, PVCs, snapshot workflows, backup copies, migration, replication, hooks, VM support, automation, and deployment flexibility. For teams searching for a Kubernetes backup solution that fits production operations, it is a strong choice.
Back up cluster resources, persistent volumes, KubeVirt VMs, and cloud-native workloads from a single platform built for modern Kubernetes operations.
Try CloudCasa with a 60-day free trial and validate backup and restore workflows in your own Kubernetes environment.
Most of the tooling that exists today wasn’t built with Kubernetes in mind. It was built for something else and extended toward it. That gap shows up in real ways: in how recovery works, in how pricing scales, and in how much operational overhead teams end up carrying just to keep things protected.
This is the problem space that Paweł Staniec, Head of Technology and Alliances for the EMEA region at CloudCasa, walks through in the latest episode of Partner Power 5, the Red Hat partner podcast.
OpenShift Virtualization is changing how teams think about their workload estate. Virtual machines are moving onto the same platform as containers. That consolidation makes operational sense, but it creates a protection challenge that’s genuinely different from either traditional VM backup or pure Kubernetes backup.
You’re no longer dealing with one workload type in one environment. You’re dealing with VMs, containers, and stateful applications, potentially spread across on-prem and managed cloud environments like ROSA, ARO, or OpenShift Dedicated. Recovery needs to be granular enough to handle all of it, and consistent enough that engineers aren’t learning a different workflow for every scenario.
File-level restore, namespace-level restore, cross-cluster migration: for teams running production workloads, these aren’t advanced features. They’re baseline expectations.
CloudCasa is Red Hat certified, which matters beyond the badge. Certification means the integration has been validated against OpenShift’s specific architecture, including storage, RBAC, and multi-tenancy. Role-based access control and multi-tenancy support are built in. Recovery operations follow a consistent model whether you’re restoring a single file or an entire cluster.
CloudCasa is available as both SaaS and self-hosted, so teams can choose the deployment model that fits their security posture. That flexibility matters particularly in regulated industries or air-gapped environments where SaaS isn’t an option.
Node-based pricing is straightforward to model and budget. It doesn’t penalize teams for storing more data or running more workloads per cluster. Traditional backup licensing models made sense when VM counts were the primary variable. In a Kubernetes environment where workload density shifts quickly, those models tend to produce cost surprises.
Paweł goes deeper on the specific scenarios where a Kubernetes-native approach changes the recovery experience, how teams use CloudCasa to support workload migration during modernization, and what the Red Hat certification process actually validated.
If you’re evaluating data protection options for OpenShift, or already running something and finding gaps, the conversation is practical and grounded. Find it now on the Partner Power 5 feed.
]]>VMware exits rarely fail because engineers cannot move bits. They fail because the organization discovers, mid-flight, that it cannot reliably recover those bits once they land somewhere new.
If you are migrating from vSphere to Red Hat OpenShift Virtualization using Migration Toolkit for Virtualization (MTV), there is a high-leverage move that often gets postponed until “after the first wave”: set up VM-aware data protection on the OpenShift side first, then migrate.
This is not about buying insurance for hypothetical disasters. It is about turning migration into a controlled, testable delivery pipeline where rollback is a practiced operation, not a prayer.
This guide reflects the current state of MTV 2.10, OpenShift Virtualization 4.21, and CloudCasa’s VM and cluster backup capabilities as of February 2026.
Before we get tactical, it helps to name the three systems you are implicitly trusting during a vSphere-to-OpenShift Virtualization migration.
MTV is delivered as an OpenShift Operator and drives migrations through custom resources and a UI workflow. It supports cold and warm migrations. MTV 2.10 builds on the storage offloading capabilities introduced in 2.9, delegating disk copy to the underlying storage system for dramatically faster migrations with compatible storage partners. Raw copy mode handles VMs with unsupported guest operating systems.
OpenShift Virtualization is KubeVirt-based. VM disks are typically backed by PVCs, and snapshot integrity depends on CSI snapshot support. For running VMs, the QEMU guest agent coordinates filesystem quiescing during snapshot operations to deliver application-consistent recovery points.
A Kubernetes-native backup tool that only captures manifests is not enough for VM recovery. You need VM-aware selection and restore semantics that pull in the VM object plus the associated DataVolumes, PVCs, secrets, and supporting resources as a coherent unit.
CloudCasa provides backup, restore, and migration services for VMs running on KubeVirt-based platforms. Compatibility has been verified with KubeVirt v1.0.0 and above, CDI v1.57.0 and above, and Red Hat OpenShift Virtualization. VM detection is automatic, and CloudCasa uses the QEMU guest agent to execute freeze/unfreeze hooks for crash-consistent online backups.
Migration plans tend to validate that VMs can be moved and booted. That is necessary, not sufficient.
Standing up protection early forces you to validate the parts that fail in production: whether your storage class actually supports snapshots consistently under load, whether restores recreate the right disk objects and bindings, whether you have a clean path for restoring a VM into an isolated namespace for testing, and whether the guest agent setup produces consistent online snapshots.
Red Hat’s guidance is clear: for the highest integrity snapshots of running VMs, install the QEMU guest agent. Without it, you get crash-consistent snapshots at best.
MTV handles orchestration well, but you will still hit edge cases: driver conversion behaviors, mapping mismatches, and workload-specific surprises. If protection is already installed and tested on the target cluster, you can enforce a simple policy: every VM that lands and passes validation gets a recovery point on the new platform immediately. That turns “we can re-run the migration” into “we can restore on the target now.”
The best migrations feel boring because they are rehearsed. With a VM-aware backup and restore system in place before the wave, you can migrate a representative set of VMs, take a recovery point, restore into a separate namespace, and validate boot, data, and networking without touching the production landing zone.
Most VMware exits are phased. You will run workloads in both places for a while. That creates a vulnerable window if OpenShift Virtualization is receiving workloads faster than your protection program is being built. With protect-first, you close that gap and maintain immutable recovery points from day one, which is critical for ransomware resilience.
Backups stress storage and APIs. The right time to discover that snapshot operations trigger latency spikes, or that your object storage throttles hard, is during a controlled test, not during wave two when leadership is watching a Gantt chart bleed.
VM-level protection is essential, but it is not the whole story. Your OpenShift cluster stores every API object, every configuration, every secret, and every VM definition in etcd. If etcd is corrupted or lost, your cluster cannot run. Period.
This is the difference between recovering individual workloads and recovering the ability to run workloads at all. A migration project that protects VMs but ignores the control plane is building on sand.
etcd is the key-value datastore that holds the entire state of your Kubernetes cluster: namespaces, deployments, services, secrets, ConfigMaps, RBAC policies, custom resources including VirtualMachine definitions, network policies, storage classes, and PersistentVolumeClaims. Lose etcd, and you lose the cluster’s memory of what should be running and how.
During a VMware exit, your OpenShift cluster is under construction. You are adding namespaces, network mappings, storage configurations, and migrated VM definitions at a rapid pace. A control plane failure mid-migration without a valid etcd backup means rebuilding not just the cluster, but all of the migration work you have already completed.
CloudCasa backs up Kubernetes cluster resources including etcd as part of its standard protection workflow. This gives you a single pane of glass for VM protection, persistent volume backup, and cluster state recovery.
An etcd backup that cannot be restored is not a backup. It is a comfort object. Before your first migration wave, verify the following:
Quick validation commands after downloading an etcd backup:
# Verify the backup file decompresses cleanly gunzip -t etcd-snapshot.db.gz && echo "Archive OK" # Decompress and check snapshot status gunzip etcd-snapshot.db.gz etcdutl snapshot status etcd-snapshot.db --write-out=table
If the snapshot status shows a valid hash, revision count, and key count, your backup is structurally sound. This ten-minute check can save you days of rebuilding a corrupted cluster.
With CloudCasa, you get a unified approach: etcd and cluster resource backups protect the control plane, VM-aware backups protect individual workloads, and persistent volume backups protect application data. This means you can recover from a single VM failure, a namespace deletion, or a complete cluster loss from the same management interface.
For a VMware migration, this layered protection is not optional. You are building a new platform while running production workloads on it. Protect the platform, not just the workloads.
A protect-first approach lives or dies on a short list of prerequisites.
For VM snapshots and snapshot-based backup flows, your storage provider needs CSI snapshot support. Red Hat documents snapshots as relying on the Kubernetes Volume Snapshot API through CSI.
Quick checks:
oc get volumesnapshotclass oc get sc
If volumesnapshotclass is empty or your VM disk storage class lacks snapshot support, fix that before pretending you have a recoverable virtualization platform.
For running VMs, snapshot integrity improves when the QEMU guest agent can freeze and thaw filesystems during backup or snapshot operations. Red Hat describes the freeze process and application notification behavior, including Windows VSS integration.
Practical policy that works: Tier 1 workloads require guest agent before cutover. Tier 2 workloads require guest agent before they are considered stable. Tier 3 workloads can accept crash-consistent recovery points during early waves.
Most teams forget this until they need it. Decide early which team roles can restore VMs, where restores are allowed to land (production namespace vs restore-lab namespace), and how secrets and config are handled during restore. This is less about security theater and more about avoiding a restore that accidentally stomps on an active workload.
Run your first etcd backup, download it, decompress it, and verify it is valid before you start migrating production workloads. This is a ten-minute task that validates your ability to recover the entire cluster.
MTV encourages wave-based migrations. Your protection strategy should follow the same shape.
The pattern: MTV migrates VMs into OpenShift Virtualization. You validate the workload on the target. You label the VM as migrated and validated. Backup selection uses labels, so new arrivals automatically get protected.
Example labeling approach:
# List VMs in the target namespace oc get vm -n finance -o name # Label each VM that passed validation for vm in $(oc get vm -n finance -o name); do oc label "$vm" -n finance migration.wave=wave1 \ app.tier=tier1 validated=true done
Now your backup tool can select validated=true and migration.wave=wave1 and stay aligned with how the migration is actually managed. CloudCasa supports selecting KubeVirt VMs by namespace, labels, or individually through its VMs/PVCs tab, with all associated resources automatically included.
Even if you plan to rely on a managed protection layer, you should run a native VM snapshot once. It validates the basics and gives you an early warning if storage or guest agent behavior is off.
Example VirtualMachineSnapshot manifest (for OpenShift 4.21, using the v1beta1 API version):
apiVersion: snapshot.kubevirt.io/v1beta1 kind: VirtualMachineSnapshot metadata: name: demo-snap-01 spec: source: apiGroup: kubevirt.io kind: VirtualMachine name: demo-vm
Create and verify:
oc create -f demo-snap-01.yaml oc get virtualmachinesnapshot demo-snap-01 -o yaml
Look for status.readyToUse: true. If snapshots do not reach this state, that is not a “backup tool problem.” It is your platform telling you the foundations are shaky.
Backups without restores are expensive comforting stories. Before the first serious MTV wave, run this drill and document the results.
CloudCasa’s restore wizard supports VM-oriented selection and multiple restore transforms including clearing MAC addresses, generating new firmware UUIDs (to avoid licensing conflicts if the original VM is still active), and controlling the VM run strategy on restore.
Storage Class Surprises
“Snapshots are supported” is not a feeling. It is a property of your CSI driver and how it behaves under your workloads. Red Hat explicitly ties snapshot support to CSI drivers and the Volume Snapshot API. Test under realistic load before migration day.
Guest Agent Drift
You start with good templates and then custom images show up. Put guest agent checks into your VM standards before wave three turns into image chaos. Consider adding guest agent validation to your MTV post-migration hooks.
Restore Collisions
If restores are allowed to land in the same namespaces as production VMs without guardrails, you will eventually restore into an occupied space. Prevent this with namespace targeting rules and process. CloudCasa offers options to clear MAC addresses and regenerate firmware UUIDs specifically to avoid these conflicts.
etcd Backup Neglect
Teams focus on VM protection and forget the control plane. An etcd corruption during a migration wave means rebuilding the cluster and re-running every completed migration. Include etcd backup verification in your day-zero checklist and your ongoing operational runbook.
Engineers hate buying tools on vibes. They want proof. CloudCasa offers a free trial (no payment required), which fits neatly into a migration runway.
A pragmatic 60-day plan:
MTV moves workloads. OpenShift Virtualization runs them. Protection makes them survivable.
Setting up VM-aware data protection on OpenShift Virtualization before you run MTV at scale is a disciplined way to reduce migration risk. It validates snapshot integrity, forces restore practice early, maintains RPO and RTO coverage from day one, and turns each migration wave into something you can repeat, measure, and recover from.
But do not stop at VM protection. CloudCasa’s ability to back up etcd and cluster resources means you are protecting the platform itself, not just the workloads running on it. A corrupted etcd with no valid backup turns a VM migration into a full cluster rebuild. That is a risk no migration plan should accept.
A VMware exit is a logistics project disguised as a technical one. Protect-first is how you keep the logistics honest.
Learn more: cloudcasa.io/backup-recovery-kubevirt-red-hat-openshift-suse-harvester/
]]>