CloudCasa

Kubernetes Backup: How It Works, What to Protect, and How to Choose a Solution in 2026

Pawel Staniec — Fri, 06 Mar 2026 21:04:45 +0000

Kubernetes backup sounds straightforward until you look closely at what a real application includes. A production workload usually spans Kubernetes resources, cluster configuration, persistent volumes, secrets, service accounts, network policies, and external dependencies such as cloud databases or object storage. Protecting one of those layers helps. Protecting all of them in a coordinated way is what makes recovery practical.

This article is for DevOps engineers, SREs, platform teams, and infrastructure leads who run Kubernetes in real environments and need a backup strategy that holds up during an outage, migration, ransomware event, cluster failure, or operator mistake. If you own recovery time objectives, recovery point objectives, compliance requirements, or day-two reliability, this is worth your time.

By the end, you’ll have a clear model for how Kubernetes backup works, what needs protection, where etcd fits, how PVCs and CSI snapshots affect recoverability, and what features matter when choosing a backup solution. You’ll also see why CloudCasa is a strong fit for production Kubernetes environments.

Why Kubernetes backup needs its own strategy

Kubernetes changes the shape of infrastructure operations. Applications are assembled from declarative resources and scheduled dynamically across nodes. Storage is abstracted through persistent volumes and claims. Controllers create and reconcile resources continuously. Operators extend the API surface. Managed services push part of the stack outside the cluster boundary.

That architecture brings flexibility, though it also changes what “backup” means.

In a Kubernetes environment, there is no single artifact that captures the full state of an application in a form that is always ready for recovery. A control plane snapshot helps recover cluster metadata and API objects. A volume snapshot helps recover persistent data. A namespace export captures part of the desired state. A database dump protects a specific data service. Real protection comes from understanding how these pieces fit together.

That is why a Kubernetes backup plan should answer a few practical questions:

Can I recover cluster resources and configuration?
Can I recover persistent application data from PVCs?
Can I restore to the same cluster and to a different cluster?
Can I recover an individual namespace or resource without restoring everything?
Can I survive storage failure, cluster deletion, region disruption, or ransomware?
Can I do all of that with a repeatable, tested process?

If the answer is uncertain, the backup strategy needs work.

What you need to protect in Kubernetes

A reliable backup design starts with the right inventory. In Kubernetes, four major categories matter.

1. Kubernetes resource state

This includes the Kubernetes objects that define and control the application and the cluster environment. Examples include:

Deployments
StatefulSets
DaemonSets
Services
Ingress resources
ConfigMaps
Secrets
Service accounts
Roles and role bindings
Custom resource definitions and custom resources
Storage classes
Volume snapshot classes
Namespace configuration
Policies and labels

These resources live in the Kubernetes API and are stored in etcd. They represent the structure and configuration of the environment. They do not contain the application’s actual file or block data stored in persistent volumes.

2. Persistent volume data

For stateful applications, the business value usually sits in the PVC-backed data. This includes database files, uploaded content, repositories, indexes, logs, queue data, machine learning artifacts, and internal application state.

A deployment manifest can recreate a pod. It cannot recreate the bytes inside a missing volume. That data needs its own protection path.

3. Cluster and cloud metadata

In self-managed environments, control plane recovery details matter. In managed Kubernetes environments such as EKS, AKS, and GKE, cloud-level settings also matter. Those settings can include networking configuration, node pool details, region and zone settings, IAM-related integrations, and cluster service parameters that are useful during a rebuild.

This matters in disaster recovery scenarios where the target cluster no longer exists and the recovery plan includes creating or recreating the cluster.

4. External dependencies

A Kubernetes application often depends on components outside the cluster boundary, including:

Managed databases such as Amazon RDS
Object storage buckets
DNS records
Identity systems
External message brokers
SaaS integrations
Certificate services

These dependencies need protection and recovery planning too. For some workloads, the external data service is the primary system of record.

The role of etcd in Kubernetes backup

etcd is the key-value store used by Kubernetes to hold cluster state. It stores the objects that represent the current state of the Kubernetes API. That makes etcd central to disaster recovery for the control plane.

For self-managed clusters, periodic etcd backup is a core best practice. An etcd snapshot can help restore cluster state after control-plane corruption, deletion, or severe misconfiguration. It is especially useful when you need to recover the cluster’s own API objects as they existed at a known point in time.

That said, etcd is only one layer of protection.

An etcd snapshot does not protect the file contents of a database volume attached to a StatefulSet. It does not capture the bytes stored in a PVC. It does not automatically protect cloud databases that live outside the cluster. It gives you Kubernetes state, resource definitions, and metadata. That is valuable, though it is not the whole recovery picture.

A sound mental model is simple:

etcd protects Kubernetes object state
volume protection protects persistent application data
backup orchestration ties them together into a recoverable application

That distinction helps teams avoid a common mistake, which is assuming that control-plane protection covers application recovery end to end.

PVCs, PVs, and why volume data needs separate protection

Persistent volumes and persistent volume claims are how Kubernetes manages durable storage for workloads. The claim defines what the workload requests. The volume represents the underlying storage resource. The storage class and CSI driver determine how that storage is provisioned and managed.

For backup planning, this means the application data lifecycle is separate from the pod lifecycle. Pods can be rescheduled. Nodes can change. The PVC remains the anchor for stateful data consumption. That design is useful operationally and important for recovery planning.

When a workload depends on a PVC, backup needs to cover:

the Kubernetes resources that define the workload
the PVC object itself
the underlying persistent volume data
any consistency steps needed before capture

Teams often discover this the hard way during restore tests. The YAML comes back. The pods start. The application fails because the volume contents are stale, missing, inconsistent, or attached to the wrong recovery flow.

Where CSI snapshots fit

Container Storage Interface, or CSI, is the standard Kubernetes uses to interact with storage systems. CSI snapshots provide a point in time snapshot mechanism for supported CSI volumes through Kubernetes APIs.

This is an important piece of the backup stack because snapshots are fast, storage-aware, and useful for recovery workflows. They work well for many production scenarios, especially when the CSI driver supports them cleanly and the storage platform is stable.

CSI snapshots help with:

point in time capture of volume state
fast local recovery workflows
efficient backup pipelines that use the snapshot as a read source
storage-aware recovery for supported drivers

CSI snapshots do require the right conditions in the cluster:

snapshot CRDs must be installed
the snapshot controller must be present
the CSI driver must support snapshots
the storage platform must implement the feature correctly

That last point matters. Kubernetes exposes the interface. The storage behavior still depends on the driver and backend.

What CSI snapshots do well, and what they do not replace

CSI snapshots are extremely useful, though they should be placed in context.

A snapshot is a point in time recovery primitive. It helps you roll back or read from a consistent storage state. It improves the efficiency of backup operations because the backup tool can read from the snapshot instead of the live mounted volume. It can also shorten restore operations inside the same storage environment.

What it does not guarantee on its own is off-cluster durability.

A local volume snapshot stored within the same storage domain helps with quick operational recovery. It does not automatically give you a separate backup copy that survives loss of the source storage system, a broader infrastructure event, or a malicious deletion scenario.

That is why mature Kubernetes backup platforms distinguish between:

snapshot-only protection
snapshot plus copy to backup storage

That second path matters for serious disaster recovery and long-term retention.

Backup consistency for stateful workloads

Stateful workloads need more than “something got copied.” They need a recovery point that makes sense for the application.

Many systems can tolerate crash-consistent snapshots. Some databases and transactional services need more controlled handling so data is flushed, paused, frozen, or otherwise prepared before backup begins. Without that step, recovery can still succeed, though the restore quality may depend heavily on the application’s own integrity and journal replay behavior.

In Kubernetes backup, consistency is often improved with application hooks. These hooks let the backup platform run commands before backup, after backup, and after restore. For example, a pre-backup hook can flush an application state or trigger database-specific coordination. A post-restore hook can perform bootstrap or validation steps after recovery.

For virtualized workloads running through KubeVirt or related platforms, consistency can also be improved through guest agent integration, including freeze and unfreeze operations for VM filesystems.

This is a major buying criterion for backup products. A checkbox for “supports backups” tells you very little. Hook support, application awareness, and testable restore workflows tell you much more.

The core Kubernetes backup strategies

There are several valid ways to protect Kubernetes. The right one depends on workload type, data criticality, recovery targets, and infrastructure design.

Resource backup only

This strategy captures Kubernetes resources and configuration, often including cluster-scoped objects. It is useful for:

stateless workloads
GitOps-driven environments
reference recovery of YAML and cluster state
policy and compliance capture

It is not enough for stateful applications that rely on PVC data.

etcd backup plus resource protection

This adds control-plane recovery for self-managed clusters and gives stronger cluster-state recovery coverage. It is useful when recovering the cluster itself is part of the DR plan.

It still needs separate PVC protection for stateful workloads.

Snapshot-oriented volume protection

This strategy uses CSI snapshots or other storage-level snapshot methods to protect persistent volumes. It supports fast recovery and efficient backup orchestration. It works well when the storage stack is snapshot-capable and well-integrated.

Teams should still evaluate whether they also need backup copies outside the source storage environment.

Snapshot plus copy to backup storage

This is one of the strongest general-purpose strategies for production Kubernetes. The snapshot provides a stable source and fast local recovery option. The backup copy provides durable retention and better survival against storage failure or broader infrastructure incidents.

This model supports many real-world objectives:

operational restores
ransomware recovery
disaster recovery
retention policies
cross-cluster restores
migration projects

Replication and disaster recovery workflows

Some environments need low-RTO recovery paths that combine backup with storage replication and cross-cluster failover. In these designs, the backup system restores Kubernetes resources while storage replication provides the volume-level data path.

This is useful for organizations with stricter service continuity requirements and storage platforms that expose remote replication capabilities.

How to evaluate a Kubernetes backup solution

A serious backup product for Kubernetes should support the operational shape of production clusters. That means more than capturing YAML and taking snapshots.

Here’s what to look for.

Coverage of both resources and data

The platform should protect Kubernetes objects and PVC data in one coordinated workflow. If the product only handles resources or only handles data, you inherit more manual recovery work.

Granular restore options

Recovery should support multiple scopes, including:

full cluster restore
namespace restore
individual resource restore
volume restore
file-level recovery when applicable

Granularity matters because many recovery events are small and targeted.

Cross-cluster recovery

A good product should restore to a different cluster, including migration and DR use cases. This is crucial when the source cluster is unavailable or when workloads need to be moved during upgrades, consolidation, or platform transitions.

CSI and storage awareness

The platform should support CSI snapshot workflows, understand the difference between snapshot and backup copy, and document supported PV types clearly. Storage behavior is too important for hand-wavy language.

Application consistency support

Look for application hooks, database-aware workflows, and VM guest coordination where relevant. Restore quality matters as much as backup completion.

Policy and automation

Backup should be schedulable, policy-driven, and manageable through API or infrastructure-as-code workflows. DevOps teams need repeatable operations, not purely manual console steps.

Security and compliance features

Important features include immutable backups, RBAC, encryption support, access controls, and deployment models that fit regulated or sovereign environments.

Deployment flexibility

Some organizations want SaaS management. Some need self-hosted control. Some need air-gapped deployments. A Kubernetes backup solution should fit the operating model of the organization.

Why CloudCasa is a strong fit for Kubernetes backup

CloudCasa aligns well with the way Kubernetes backup works in production because it addresses the real layers of recovery instead of narrowing the problem to a single mechanism.

At the Kubernetes layer, CloudCasa protects cluster resources, namespaces, and individual resources. It also includes etcd backup support as part of the broader recovery picture. For persistent data, it supports snapshot-based protection for supported volumes and can copy volume data to backup storage for durable retention. That combination matters because production recovery usually needs both orchestration state and application data.

For storage workflows, CloudCasa supports CSI snapshot-based backups and clearly distinguishes snapshot-only operations from copy-to-backup-storage workflows. That is exactly how engineers should think about protection design. Fast local recovery and durable off-cluster retention serve different purposes.

For consistency and stateful workloads, CloudCasa supports application hooks for pre-backup, post-backup, and post-restore actions. It also supports guest-aware consistency workflows for KubeVirt environments through QEMU guest agent integration. That makes the platform relevant for both containerized stateful apps and virtualized workloads running on Kubernetes.

For restore flexibility, CloudCasa supports restore at the cluster, namespace, resource, and volume level. It also supports migration and replication workflows, which is useful for platform teams handling cross-cluster movement, DR exercises, and environment transitions.

For more advanced DR scenarios, CloudCasa also introduced DR for Storage support, enabling recovery workflows that integrate storage replication with Kubernetes resource restoration. For teams that need faster service continuity paths, this is a meaningful capability.

CloudCasa’s current feature set is also strong in adjacent areas that matter to platform teams:

Broad Kubernetes and platform support

CloudCasa supports a wide range of Kubernetes distributions and managed services, including major environments such as EKS, AKS, GKE, OpenShift, Rancher, and VMware Tanzu. That matters for organizations with mixed environments or platform transitions in flight.

KubeVirt and virtualization support

CloudCasa supports backup and restore for KubeVirt, OpenShift Virtualization, and SUSE Virtualization workloads. It also supports VM file-level restore, which is useful when the goal is to recover specific files without restoring a full virtual machine.

Modern backup targets and storage options

CloudCasa supports object storage targets and has added support for NFS backup targets and SMB backup targets. It also introduced backup compression, which helps with transfer and storage efficiency.

Immutable backups and security-focused recovery

Immutable backup support is important for ransomware resilience and retention governance. Backup immutability strengthens the recovery posture by protecting stored backup copies from tampering during the retention period.

Cloud-aware recovery workflows

When connected to cloud accounts, CloudCasa can auto-discover managed Kubernetes clusters and preserve cloud-related cluster parameters to support restore workflows. That reduces the manual burden during rebuild scenarios.

API, CLI, RBAC, and Terraform support

This is a big one for DevOps teams. CloudCasa supports automation through API and CLI workflows, fine-grained RBAC, and Terraform integration. Backup should live comfortably inside a modern platform engineering workflow, and these features help get it there.

SaaS and self-hosted deployment models

CloudCasa is available as a SaaS platform and as a self-hosted solution. That helps organizations with different operational, compliance, and sovereignty requirements. Self-hosted deployment is especially relevant in regulated, controlled, or air-gapped environments.

Protection beyond in-cluster storage

For workloads that rely on cloud databases, CloudCasa also supports backup workflows for services such as Amazon RDS. That is useful because many Kubernetes applications depend on state that exists outside the cluster boundary.

Who should seriously consider CloudCasa

CloudCasa is a strong fit for:

DevOps teams running production Kubernetes with stateful workloads
platform engineering teams standardizing backup and restore across clusters
organizations using managed Kubernetes and wanting cloud-aware recovery
companies running KubeVirt or virtualization on Kubernetes
teams that need cross-cluster migration and DR workflows
enterprises that need self-hosted backup management
organizations with compliance or ransomware recovery requirements

In practice, the product fits especially well when the backup conversation includes PVC data, recovery granularity, automation, and real DR planning. That is where lightweight approaches start to fray.

Final takeaway

Kubernetes backup is a layered discipline. etcd matters because it protects cluster state. PVC protection matters because application data lives there. CSI snapshots matter because they provide fast, storage-aware recovery primitives. Backup copies matter because durable retention and disaster recovery require data outside the source failure domain. Consistency matters because successful restore is the actual goal.

A good backup strategy reflects that reality.

A good backup product does too.

CloudCasa stands out because it supports the full shape of Kubernetes recovery: resources, etcd, PVCs, snapshot workflows, backup copies, migration, replication, hooks, VM support, automation, and deployment flexibility. For teams searching for a Kubernetes backup solution that fits production operations, it is a strong choice.

Protect Your Kubernetes Workloads with CloudCasa

Back up cluster resources, persistent volumes, KubeVirt VMs, and cloud-native workloads from a single platform built for modern Kubernetes operations.

Why teams choose CloudCasa

Kubernetes resource and PVC protection
CSI snapshot support with copy-to-backup-storage workflows
Granular restore, migration, and replication
Immutable backups and RBAC
SaaS and self-hosted deployment options
API, CLI, and Terraform support

Start your free trial

Try CloudCasa with a 60-day free trial and validate backup and restore workflows in your own Kubernetes environment.

Protecting OpenShift Workloads Without the Complexity: A Conversation Worth Having

CloudCasa — Tue, 03 Mar 2026 14:15:04 +0000

DevOps engineers running OpenShift know the platform well. They know how to build on it, scale on it, and operate it under pressure. What they often hit unexpectedly is the question of backup and recovery, especially once OpenShift Virtualization enters the picture.

Most of the tooling that exists today wasn’t built with Kubernetes in mind. It was built for something else and extended toward it. That gap shows up in real ways: in how recovery works, in how pricing scales, and in how much operational overhead teams end up carrying just to keep things protected.

This is the problem space that Paweł Staniec, Head of Technology and Alliances for the EMEA region at CloudCasa, walks through in the latest episode of Partner Power 5, the Red Hat partner podcast.

The Virtualization Shift Is Creating a New Set of Requirements

OpenShift Virtualization is changing how teams think about their workload estate. Virtual machines are moving onto the same platform as containers. That consolidation makes operational sense, but it creates a protection challenge that’s genuinely different from either traditional VM backup or pure Kubernetes backup.

You’re no longer dealing with one workload type in one environment. You’re dealing with VMs, containers, and stateful applications, potentially spread across on-prem and managed cloud environments like ROSA, ARO, or OpenShift Dedicated. Recovery needs to be granular enough to handle all of it, and consistent enough that engineers aren’t learning a different workflow for every scenario.

File-level restore, namespace-level restore, cross-cluster migration: for teams running production workloads, these aren’t advanced features. They’re baseline expectations.

Built for Kubernetes, Certified for OpenShift

CloudCasa is Red Hat certified, which matters beyond the badge. Certification means the integration has been validated against OpenShift’s specific architecture, including storage, RBAC, and multi-tenancy. Role-based access control and multi-tenancy support are built in. Recovery operations follow a consistent model whether you’re restoring a single file or an entire cluster.

CloudCasa is available as both SaaS and self-hosted, so teams can choose the deployment model that fits their security posture. That flexibility matters particularly in regulated industries or air-gapped environments where SaaS isn’t an option.

Pricing That Reflects How Kubernetes Actually Scales

Node-based pricing is straightforward to model and budget. It doesn’t penalize teams for storing more data or running more workloads per cluster. Traditional backup licensing models made sense when VM counts were the primary variable. In a Kubernetes environment where workload density shifts quickly, those models tend to produce cost surprises.

What the Episode Covers

Paweł goes deeper on the specific scenarios where a Kubernetes-native approach changes the recovery experience, how teams use CloudCasa to support workload migration during modernization, and what the Red Hat certification process actually validated.

If you’re evaluating data protection options for OpenShift, or already running something and finding gaps, the conversation is practical and grounded. Find it now on the Partner Power 5 feed.

Protect OpenShift Virtualization Before Your MTV Migration Wave Hits

Pawel Staniec — Wed, 18 Feb 2026 17:07:09 +0000

A practical guide to VM-aware data protection for VMware-to-OpenShift migrations

VMware exits rarely fail because engineers cannot move bits. They fail because the organization discovers, mid-flight, that it cannot reliably recover those bits once they land somewhere new.

If you are migrating from vSphere to Red Hat OpenShift Virtualization using Migration Toolkit for Virtualization (MTV), there is a high-leverage move that often gets postponed until “after the first wave”: set up VM-aware data protection on the OpenShift side first, then migrate.

This is not about buying insurance for hypothetical disasters. It is about turning migration into a controlled, testable delivery pipeline where rollback is a practiced operation, not a prayer.

This guide reflects the current state of MTV 2.10, OpenShift Virtualization 4.21, and CloudCasa’s VM and cluster backup capabilities as of February 2026.

The Migration Stack You Are Betting On

Before we get tactical, it helps to name the three systems you are implicitly trusting during a vSphere-to-OpenShift Virtualization migration.

MTV Orchestration

MTV is delivered as an OpenShift Operator and drives migrations through custom resources and a UI workflow. It supports cold and warm migrations. MTV 2.10 builds on the storage offloading capabilities introduced in 2.9, delegating disk copy to the underlying storage system for dramatically faster migrations with compatible storage partners. Raw copy mode handles VMs with unsupported guest operating systems.

KubeVirt VM Plumbing and Snapshots

OpenShift Virtualization is KubeVirt-based. VM disks are typically backed by PVCs, and snapshot integrity depends on CSI snapshot support. For running VMs, the QEMU guest agent coordinates filesystem quiescing during snapshot operations to deliver application-consistent recovery points.

A Protection Layer That Understands Both Kubernetes and VMs

A Kubernetes-native backup tool that only captures manifests is not enough for VM recovery. You need VM-aware selection and restore semantics that pull in the VM object plus the associated DataVolumes, PVCs, secrets, and supporting resources as a coherent unit.

CloudCasa provides backup, restore, and migration services for VMs running on KubeVirt-based platforms. Compatibility has been verified with KubeVirt v1.0.0 and above, CDI v1.57.0 and above, and Red Hat OpenShift Virtualization. VM detection is automatic, and CloudCasa uses the QEMU guest agent to execute freeze/unfreeze hooks for crash-consistent online backups.

Why Protecting OpenShift Virtualization Before MTV Reduces Real Risk

You Prove the Target Platform Is Recoverable Before It Becomes Busy

Migration plans tend to validate that VMs can be moved and booted. That is necessary, not sufficient.

Standing up protection early forces you to validate the parts that fail in production: whether your storage class actually supports snapshots consistently under load, whether restores recreate the right disk objects and bindings, whether you have a clean path for restoring a VM into an isolated namespace for testing, and whether the guest agent setup produces consistent online snapshots.

Red Hat’s guidance is clear: for the highest integrity snapshots of running VMs, install the QEMU guest agent. Without it, you get crash-consistent snapshots at best.

You Shrink the Blast Radius of Each Migration Wave

MTV handles orchestration well, but you will still hit edge cases: driver conversion behaviors, mapping mismatches, and workload-specific surprises. If protection is already installed and tested on the target cluster, you can enforce a simple policy: every VM that lands and passes validation gets a recovery point on the new platform immediately. That turns “we can re-run the migration” into “we can restore on the target now.”

You Get Rehearsal Capability

The best migrations feel boring because they are rehearsed. With a VM-aware backup and restore system in place before the wave, you can migrate a representative set of VMs, take a recovery point, restore into a separate namespace, and validate boot, data, and networking without touching the production landing zone.

You Avoid an Unprotected Window During Dual-Run Phases

Most VMware exits are phased. You will run workloads in both places for a while. That creates a vulnerable window if OpenShift Virtualization is receiving workloads faster than your protection program is being built. With protect-first, you close that gap and maintain immutable recovery points from day one, which is critical for ransomware resilience.

You Size Performance and Operational Overhead Before the Flood

Backups stress storage and APIs. The right time to discover that snapshot operations trigger latency spikes, or that your object storage throttles hard, is during a controlled test, not during wave two when leadership is watching a Gantt chart bleed.

Protecting the Cluster Itself: Why etcd Backup Matters

VM-level protection is essential, but it is not the whole story. Your OpenShift cluster stores every API object, every configuration, every secret, and every VM definition in etcd. If etcd is corrupted or lost, your cluster cannot run. Period.

This is the difference between recovering individual workloads and recovering the ability to run workloads at all. A migration project that protects VMs but ignores the control plane is building on sand.

What etcd Holds

etcd is the key-value datastore that holds the entire state of your Kubernetes cluster: namespaces, deployments, services, secrets, ConfigMaps, RBAC policies, custom resources including VirtualMachine definitions, network policies, storage classes, and PersistentVolumeClaims. Lose etcd, and you lose the cluster’s memory of what should be running and how.

Why This Matters During Migration

During a VMware exit, your OpenShift cluster is under construction. You are adding namespaces, network mappings, storage configurations, and migrated VM definitions at a rapid pace. A control plane failure mid-migration without a valid etcd backup means rebuilding not just the cluster, but all of the migration work you have already completed.

CloudCasa backs up Kubernetes cluster resources including etcd as part of its standard protection workflow. This gives you a single pane of glass for VM protection, persistent volume backup, and cluster state recovery.

Validate That Your etcd Backup Is Actually Usable

An etcd backup that cannot be restored is not a backup. It is a comfort object. Before your first migration wave, verify the following:

The backup completes without errors. Check CloudCasa job status and logs. A partial or failed backup is worse than no backup because it creates false confidence.
The backup file is valid and decompressable. etcd backups are typically compressed snapshots. Download a backup and verify you can decompress it. A corrupted archive that fails extraction is useless when you need it most.
The snapshot contains expected data. After decompression, use etcdutl or etcdctl to inspect the snapshot. Verify that key namespaces, secrets, and resources are present. If you can, restore to a test cluster and confirm the cluster state matches expectations.
Recovery time is acceptable. Measure how long a restore takes. Your RTO for the control plane is different from your RTO for individual VMs, and both matter.

Quick validation commands after downloading an etcd backup:

# Verify the backup file decompresses cleanly

gunzip -t etcd-snapshot.db.gz && echo "Archive OK"




# Decompress and check snapshot status

gunzip etcd-snapshot.db.gz

etcdutl snapshot status etcd-snapshot.db --write-out=table

If the snapshot status shows a valid hash, revision count, and key count, your backup is structurally sound. This ten-minute check can save you days of rebuilding a corrupted cluster.

The Combined Protection Model

With CloudCasa, you get a unified approach: etcd and cluster resource backups protect the control plane, VM-aware backups protect individual workloads, and persistent volume backups protect application data. This means you can recover from a single VM failure, a namespace deletion, or a complete cluster loss from the same management interface.

For a VMware migration, this layered protection is not optional. You are building a new platform while running production workloads on it. Protect the platform, not just the workloads.

What to Validate on Day Zero

A protect-first approach lives or dies on a short list of prerequisites.

1. Snapshot Readiness

For VM snapshots and snapshot-based backup flows, your storage provider needs CSI snapshot support. Red Hat documents snapshots as relying on the Kubernetes Volume Snapshot API through CSI.

Quick checks:

oc get volumesnapshotclass

oc get sc

If volumesnapshotclass is empty or your VM disk storage class lacks snapshot support, fix that before pretending you have a recoverable virtualization platform.

2. Guest Agent Coverage

For running VMs, snapshot integrity improves when the QEMU guest agent can freeze and thaw filesystems during backup or snapshot operations. Red Hat describes the freeze process and application notification behavior, including Windows VSS integration.

Practical policy that works: Tier 1 workloads require guest agent before cutover. Tier 2 workloads require guest agent before they are considered stable. Tier 3 workloads can accept crash-consistent recovery points during early waves.

3. Restore Permissions and Isolation Strategy

Most teams forget this until they need it. Decide early which team roles can restore VMs, where restores are allowed to land (production namespace vs restore-lab namespace), and how secrets and config are handled during restore. This is less about security theater and more about avoiding a restore that accidentally stomps on an active workload.

4. etcd Backup Verification

Run your first etcd backup, download it, decompress it, and verify it is valid before you start migrating production workloads. This is a ten-minute task that validates your ability to recover the entire cluster.

Label-Driven Protection That Follows MTV

MTV encourages wave-based migrations. Your protection strategy should follow the same shape.

The pattern: MTV migrates VMs into OpenShift Virtualization. You validate the workload on the target. You label the VM as migrated and validated. Backup selection uses labels, so new arrivals automatically get protected.

Example labeling approach:

# List VMs in the target namespace

oc get vm -n finance -o name

# Label each VM that passed validation

for vm in $(oc get vm -n finance -o name); do

  oc label "$vm" -n finance migration.wave=wave1 \

    app.tier=tier1 validated=true

done

Now your backup tool can select validated=true and migration.wave=wave1 and stay aligned with how the migration is actually managed. CloudCasa supports selecting KubeVirt VMs by namespace, labels, or individually through its VMs/PVCs tab, with all associated resources automatically included.

A Concrete Snapshot Example to Run Before the First Migration Weekend

Even if you plan to rely on a managed protection layer, you should run a native VM snapshot once. It validates the basics and gives you an early warning if storage or guest agent behavior is off.

Example VirtualMachineSnapshot manifest (for OpenShift 4.21, using the v1beta1 API version):

apiVersion: snapshot.kubevirt.io/v1beta1

kind: VirtualMachineSnapshot

metadata:

  name: demo-snap-01

spec:

  source:

    apiGroup: kubevirt.io

    kind: VirtualMachine

    name: demo-vm

Create and verify:

oc create -f demo-snap-01.yaml

oc get virtualmachinesnapshot demo-snap-01 -o yaml

Look for status.readyToUse: true. If snapshots do not reach this state, that is not a “backup tool problem.” It is your platform telling you the foundations are shaky.

The Restore Drill That Separates Confidence from Optimism

Backups without restores are expensive comforting stories. Before the first serious MTV wave, run this drill and document the results.

Pick three test VMs: One Linux VM with a database-like write pattern, one Windows VM if you have them, and one multi-disk VM.
Take a recovery point and restore into an isolated namespace: Make restore-lab a real namespace that has network policies and RBAC configured for testing.
Validate: Confirm the VM boots, disks are present and mounted, the application comes up, and data matches expectations.
Measure: Time the operation. Your stakeholders will ask about recovery time. Bring numbers, not estimates.

CloudCasa’s restore wizard supports VM-oriented selection and multiple restore transforms including clearing MAC addresses, generating new firmware UUIDs (to avoid licensing conflicts if the original VM is still active), and controlling the VM run strategy on restore.

Common Failure Modes to Discover Early

Storage Class Surprises

“Snapshots are supported” is not a feeling. It is a property of your CSI driver and how it behaves under your workloads. Red Hat explicitly ties snapshot support to CSI drivers and the Volume Snapshot API. Test under realistic load before migration day.

Guest Agent Drift

You start with good templates and then custom images show up. Put guest agent checks into your VM standards before wave three turns into image chaos. Consider adding guest agent validation to your MTV post-migration hooks.

Restore Collisions

If restores are allowed to land in the same namespaces as production VMs without guardrails, you will eventually restore into an occupied space. Prevent this with namespace targeting rules and process. CloudCasa offers options to clear MAC addresses and regenerate firmware UUIDs specifically to avoid these conflicts.

etcd Backup Neglect

Teams focus on VM protection and forget the control plane. An etcd corruption during a migration wave means rebuilding the cluster and re-running every completed migration. Include etcd backup verification in your day-zero checklist and your ongoing operational runbook.

A Practical 60-Day Evaluation Window

Engineers hate buying tools on vibes. They want proof. CloudCasa offers a free trial (no payment required), which fits neatly into a migration runway.

A pragmatic 60-day plan:

Week 1: Onboard the cluster, validate snapshot readiness, verify etcd backup produces a valid decompressable file, run one VM restore drill
Week 2: Build label-driven selections that align to MTV waves
Week 3: Run performance tests during business hours, capture impact metrics
Week 4: Protect your first MTV wave immediately after validation
Weeks 5-8: Tighten RBAC, finalize runbooks, practice VM and etcd restores monthly, enable immutable backups for ransomware resilience

Bottom Line

MTV moves workloads. OpenShift Virtualization runs them. Protection makes them survivable.

Setting up VM-aware data protection on OpenShift Virtualization before you run MTV at scale is a disciplined way to reduce migration risk. It validates snapshot integrity, forces restore practice early, maintains RPO and RTO coverage from day one, and turns each migration wave into something you can repeat, measure, and recover from.

But do not stop at VM protection. CloudCasa’s ability to back up etcd and cluster resources means you are protecting the platform itself, not just the workloads running on it. A corrupted etcd with no valid backup turns a VM migration into a full cluster rebuild. That is a risk no migration plan should accept.

A VMware exit is a logistics project disguised as a technical one. Protect-first is how you keep the logistics honest.

Learn more: cloudcasa.io/backup-recovery-kubevirt-red-hat-openshift-suse-harvester/

Protected: Protecting SUSE AI Workloads: Enterprise Backup and Data Sovereignty with CloudCasa

Pawel Staniec — Thu, 05 Feb 2026 17:37:32 +0000

Great news for platform teams: CloudCasa has just achieved a certification for SUSE AI, giving you a proven path to protect your local AI workloads with the same rigor you apply to your databases and applications. If you’re running large language models on Kubernetes and wondering what actually needs backup – and how to do it without scripting yourself into a corner – this guide is for you.

The Sovereignty Play: Why Running LLMs Locally Changes Everything

When organizations talk about “AI sovereignty,” they mean control in three concrete ways:

➢ Your training data, embeddings, and prompt histories stay within your infrastructure. There is no border crossings, vendor handoffs, or compliance headaches. When embeddings live three racks away instead of three continents away, response times improve noticeably.

➢ You control model upgrades, GPU allocation, and request routing. No surprise deprecations, inconvenient maintenance windows, or mid-quarter pricing changes.

➢ You can prove where data lives, who accessed it, and when – without complex explanations for auditors.

Why SUSE AI Makes Local LLMs Practical

SUSE AI is built to run natively on Kubernetes, which means it plays nicely with the tools you already know – RKE2, SUSE Rancher Prime, SUSE Observability. Your AI / LLM stack becomes just another workload in a platform you’ve already secured, observed, and learned to upgrade confidently.

Installing SUSE AI is a straightforward process involving a Helm deployment onto your cluster. (even better with SUSE Rancher Prime, which provides a clear overview of the health status of all pods). This eliminates the need for complex vendor-specific orchestration and unfamiliar operational models.

Performance-wise – it is crucial to keep embeddings and vectors close to your application and on storage that you control. By doing so, you can eliminate tail latency, ensuring that users experience optimal performance and reducing frustration.

What Actually Needs Protection in Your SUSE AI Stack

This is where many teams forget about the persistent data that can’t be rolled out from git. They back up the YAML definitions but forget the expensive, time-consuming stuff those definitions create. Let’s fix that.

1. Vector Database Data (The Memory Your LLMs Rely On)

Your vector database is the retrieval brain of any RAG (Retrieval-Augmented Generation) system.

Lose the embeddings or their index, and your model gets amnesia.

Worse: recomputing embeddings over terabytes of documents can take days and burn real money in GPU hours. Protecting the Persistent Volumes backing your vector database isn’t optional – it’s the difference between a 15-minute restore and a multi-day rebuild that costs thousands in compute.

2. Training and Fine-Tuning Artifacts (Time Capsules of Expensive Compute)

Checkpoints, adapter weights (LoRA, QLoRA), preprocessed datasets, feature stores – these are the outputs of training runs that might have cost you thousands or tens of thousands in GPU time.

If you only back up the definitions but not the actual artifacts, you’re setting yourself up to rebuild computational grids for no reason. Your team will spend days reproducing work that could have been restored in minutes.

3. Inference State and Application Scaffolding

Prompt templates, routing rules, rate-limit configurations, model selection logic—these often live as ConfigMaps and Secrets. They look small and insignificant until you need the exact version that eliminated a production regression two weeks ago.

Without these, you’re not just restoring data – you’re debugging why your restored / rebuilt system behaves differently than the one that was working fine yesterday.

4. Observability Trails You Actually Rely On

Not all logs are created equal. Some are noise. Some are compliance requirements. Some are the difference between diagnosing an incident in 10 minutes versus 10 hours.

Decide which PV-backed logs you’d rather restore than regenerate from scratch, and make sure they’re in scope.

Bottom line: Protect both the blueprint (Kubernetes objects) and the house (Persistent Volumes). SUSE AI deploys cleanly via Helm and surfaces its running components in Rancher, making it simple to identify exactly which namespaces and volumes must be included.

Why CloudCasaIs the Right Backup Control Plane

CloudCasa gives you something better than a bunch of scripts:

A control plane you run – perfect for data-residency requirements and network-boundary constraints. Your auditors and security team stay happy because everything stays inside your perimeter.

Agent-based discovery – CloudCasa understands Kubernetes natively. It knows about namespaces, Persistent Volumes, and VolumeSnapshots. You don’t translate between storage concepts; you work with Kubernetes primitives. You get visibility and restorability down to a file-level.

Explicit job results and coverage metrics – You get clear answers: “Are 100% of my PVs protected?” If the answer is no, you know exactly which volume or driver issue to fix. No guessing.

AI Workload needs mobility? – You started on a small cluster and need to move it onto a bigger box?

Migration (including resource-translation mechanisms) is built-into CloudCasa ofering, making sure that you can not only restrore cross-cluster or cross-infrastructure – but migrate your workloads the same way with no problems at all.

Consistent day-2 operations – Protect dev environments on CloudCasa SaaS if you want speed. Run production on self-hosted for sovereignty. Your operators use the same workflow, same UI, same operational patterns. They don’t need to learn two mental models.

And critically: it’s proven. CloudCasa’s new certification for SUSE AI means this isn’t experimentalit’s a validated approach that works.

Your Compact, Runnable Checklist

Install SUSE AI via Helm; confirm pods are Ready

Verify PVs exist and are Bound in Rancher

Wait for “Established” connection status

Create backup job targeting SUSE AI namespaces with PVs included

Run first backup and review Activity → Jobs for success

Confirm 100% PV coverage

Add schedules matching your RPO/RTO requirements

Drill a small restore every sprint; document friction and fix it

Final Word: Make Resilience Routine

The promise of local LLMs is control. The price is responsibility.

SUSE AI gives you a Kubernetes-native way to run models where you want them, using tools and patterns you already understand. CloudCasa’s new certification for SUSE AI gives you a proven, auditable way to protect the pieces that matter – vector memory, model artifacts, inference configurations, and the scaffolding that keeps everything working together.

Do the boring things well: wire snapshots properly, watch coverage metrics, rehearse restores regularly. Then when someone ships a bad config or a node fails at 3 AM, the outcome is predictable: Click. Restore. Back to green.

That’s the difference between an incident and a Tuesday

Learn more about SUSE AI: here

Get certified with SUSE AI: link

SUSE AI documentation: here

Kubernetes PVC Backup with CloudCasa: A Step-by-Step Guide to Protect Persistent Volumes

Pawel Staniec — Tue, 03 Feb 2026 22:59:58 +0000

Kubernetes applications often rely on Persistent Volume Claims (PVCs) to store critical data, from databases to user uploads. Losing this data due to cluster failures or accidents can be catastrophic for DevOps teams. In this post, we’ll walk through how to safeguard your Kubernetes PVCs using CloudCasa, a backup-as-a-service platform. We’ll cover why Kubernetes PVC backup is essential for data protection, how to set up CloudCasa’s Kubernetes backup PVC solution, and a real-world use case of recovering from volume loss. This technical guide provides a detailed, step-by-step walkthrough of CloudCasa’s features and benefits for PVC backups, including screenshots of the process.

Why Kubernetes Persistent Volume Backups Are Essential for Data Protection

In Kubernetes, deployments and configurations (manifests, pods, services, etc.) are usually stateless and can be re-deployed, but PVCs contain stateful data that cannot be simply regenerated. If a node failure, storage corruption, or human error wipes out a Persistent Volume, you risk permanent data loss. Traditional high availability or replication might not cover scenarios like cloud account breaches or complete cluster loss. That’s why regular backups of PVCs are a critical part of Kubernetes data protection.

A robust backup strategy ensures that even if your cluster’s storage is compromised, you can restore persistent data quickly and keep applications running. For example, in one real incident an electrical failure damaged a data center’s storage, but the team restored 24 lost Persistent Volumes in the same day using CloudCasa . This kind of rapid recovery can save hours of downtime and lost revenue. Moreover, backups enable flexibility – you can migrate stateful workloads to new clusters or cloud regions using the saved PVC data, supporting disaster recovery and cluster migrations.

CloudCasa for Kubernetes PVC Backup: Features and Benefits

CloudCasa is a managed Kubernetes backup service (offered both as SaaS and self-hosted) that focuses on protecting entire clusters including PVC data. Unlike DIY tools that you have to install and maintain, CloudCasa provides enterprise-grade backup without the complexity of managing backup infrastructure. This means DevOps teams don’t need to set up or babysit backup servers – CloudCasa’s cloud service handles the heavy lifting, with a lightweight agent in your cluster coordinating the backups.

Key benefits of using CloudCasa for PVC backups include:

Managed Backup Storage: By default, CloudCasa will copy your PVC data to secure offsite storage (CloudCasa’s cloud or your own S3 bucket) instead of relying solely on local snapshots. Off-cluster backups ensure that even if your entire cluster or region goes down, your data backups remain safe externally.
Snapshot Integration for Speed: CloudCasa can integrate with CSI volume snapshots for faster, point-in-time captures of PVCs. It can snapshot volumes and then upload the data to cloud storage in the background, minimizing application downtime. Snapshots are orchestrated automatically where supported, and CloudCasa even handles volumes that don’t support snapshots by falling back to direct backups.
Granular & Application-Aware Backups: You can back up entire clusters or select specific namespaces, labels, or even individual PVCs. CloudCasa’s backup wizard lets you include all resources or filter by app, and it ensures application-consistent backups (including databases) with hooks for quiescing if needed . It also offers file-level restore, so you can recover individual files from a PVC backup without restoring the whole volume .
Cross-Cluster Recovery and Migration: CloudCasa enables cross-cluster and cross-cloud restores. For instance, you can restore a PVC into a different Kubernetes cluster (e.g., in a new region or for a test environment) as part of a disaster recovery or migration workflow. This flexibility means your backups aren’t tied to the original cluster – you can recover your data wherever you need it.
User-Friendly SaaS Dashboard: DevOps users get a single pane of glass to manage backups for multiple clusters. CloudCasa’s intuitive GUI guides you through cluster onboarding, backup policy creation, and restore workflows. The dashboard provides a centralized catalog of backups, monitoring, and alerts so you can easily track backup status across all clusters. (Plus, built-in chat support is available right from the interface if you need help.)

CloudCasa’s managed service approach means no maintenance overhead – updates, scaling, and storage management are handled for you. The free trial plan’s 5-node limit and included features lower the barrier to entry, letting teams implement Kubernetes PVC backups quickly. Next, we’ll illustrate a use case highlighting CloudCasa in action and then dive into a step-by-step guide.

Use Case: Recovering from Persistent Volume Loss with CloudCasa

Scenario: Imagine you’re running a stateful application on Kubernetes – for example, a MySQL database with data stored on a Persistent Volume Claim. One day, a mishap occurs: a storage failure or accidental deletion renders the PV (Persistent Volume) unusable. Without a backup, this would mean significant data loss and application downtime. Fortunately, you’ve been backing up the namespace (including its PVC) nightly with CloudCasa.

With CloudCasa, recovery is straightforward. You log into the CloudCasa console, locate last night’s backup of the affected PVC, and launch the restore wizard. CloudCasa allows you to select the specific PVC (or entire namespace) from the backup restore point, choose whether to restore in-place or to a different cluster, and then it handles the rest. In this scenario, you decide to restore to a new PV in the same cluster. CloudCasa pulls the stored volume data from its backup repository and re-creates the PVC and its contents in the cluster. Within minutes, your MySQL pod is re-attached to a recovered volume, and the database is back online with minimal data loss (only data since the last backup).

Now imagine a slight variation: you want to migrate this application to a new Kubernetes cluster (for example, moving from a dev cluster to a production cluster, or from on-prem to cloud). CloudCasa’s restore workflow can also target a different cluster – even in another cloud provider or region – effectively cloning the PVC data into that environment. This use case shows how CloudCasa supports not just recovery, but also migration of persistent data as part of a DevOps workflow.

Through this narrative, it’s clear how CloudCasa’s PVC backup capability can save the day in a crisis. Next, we’ll go through the detailed steps to set up CloudCasa and perform backups and restores of PVCs so you can implement this in your own environment.

Step-by-Step: Backing Up and Restoring PVCs with CloudCasa

In this section, we’ll provide a technical walkthrough of using CloudCasa to back up and restore Kubernetes PVCs. From initial setup to recovery, follow these steps:

Sign Up for CloudCasa and Connect Your Kubernetes Cluster – Start by creating a free trial CloudCasa account (no credit card required). Once logged in to the CloudCasa dashboard, add your Kubernetes cluster. CloudCasa will prompt you to register the cluster and install a small agent on it. You can either auto-discover clusters if you’ve linked your cloud account, or do a manual install. The simplest method is to choose “Install via kubectl”. CloudCasa provides a one-line kubectl command in a dialog – copy this and run it in your terminal configured to your cluster’s context . This command deploys CloudCasa’s lightweight agent (in the cloudcasa-io namespace). After a minute, the cluster should appear as “Active” in CloudCasa’s cluster list, meaning the agent is connected and ready.
Configure a Backup Policy for Persistent Volume Claims – With the cluster connected, navigate to the Backups section and click “Create Backup” (or use the Cluster Backup Wizard). You’ll first select which cluster to back up. Next, define what to back up. For a PVC backup, you have options:
- Entire cluster vs. specific resources: You can back up the whole cluster (all namespaces) or specify particular namespaces or labels. If you want to back up a specific application, select the namespace (or label) it runs in.
- Include Persistent Volumes: Make sure the option “Include Persistent Volumes” is enabled . This is on by default, which ensures that along with Kubernetes resource manifests, the actual PVC data will be captured.
- Selecting individual PVCs (optional): CloudCasa provides a VMs/PVCs tab in the wizard where you can pick specific PVCs to back up if you don’t want everything . For example, you might exclude a large PVC that’s less important, or target only a critical database PVC.
- Snapshot vs. Data Copy: Choose the method for PV protection. The default and recommended choice is “Copy to backup storage”, which will copy the PVC data to CloudCasa’s storage (or your configured backup repository). This ensures an off-site backup. Alternatively, “Snapshot only” can be selected if you just want local snapshots on the cluster (faster, but less safe on its own) . We’ll stick with the default, which actually uses snapshots plus off-cluster copy for the best of both worlds.
- Scheduling and retention: Set a schedule (e.g. nightly at 2 AM) or choose on-demand backup. For this walkthrough, you might run it once manually. Also choose a retention period; for the free trial plan, backups are retained up to 30 days by default. After configuring these options, save the backup job. CloudCasa will either run it immediately (if on-demand) or at the scheduled time.
Run and Monitor the Backup – If you created a manual backup job, kick it off by clicking “Run Now” (or simply wait for the schedule). CloudCasa will begin backing up the selected resources. Under the hood, if “Copy to backup storage” was chosen, CloudCasa will first attempt to create CSI snapshots of the PVCs for consistency, then spawn data mover pods to stream the snapshot data to cloud storage. You can monitor progress in the Activity or Backups dashboard in CloudCasa’s UI. There you’ll see the backup job running, and it will list how many resources and PVCs are being backed up. Once complete, it should show a “Completed” or “Success” status. You can click on the backup to see details, like which PVCs were included and the size of the backup. Tip: CloudCasa’s centralized catalog keeps all your backup recovery points organized, so you can confirm that the PVC backup is indeed listed and retained for restore.
Restore the PVC Data from Backup – Now, simulate the scenario of needing to recover data. In CloudCasa, go to the Recovery Points or Restores section. Find the backup you created (by date/label/cluster name) and start the Restore Wizard. The wizard will guide you to select:
- Source Recovery Point: Choose the specific backup snapshot that contains the PVC you want (e.g., the backup from last night).
- Resources to Restore: You can restore everything from that backup or drill down to a particular PVC (or the entire namespace that contained the PVC). For our use case, select the PVC (or the app’s namespace) to restore.
- Target Cluster and Namespace: Decide where to restore the data. You can restore back to the original cluster (either overwriting the original PVC or under a new name) or to a different cluster (for migration or testing). If restoring to a different cluster, make sure that cluster is also registered in CloudCasa. You can also specify a new namespace or use the original one.
- Restore actions: CloudCasa will then create the PVC and restore the data into it. If the original PersistentVolume or StorageClass doesn’t exist on the target, CloudCasa can map to a new storage class (you may need to confirm the mapping if different storage backends are used).

Start the restore job and let CloudCasa do the work. In the activity monitor, you’ll see it retrieving the backup from storage and rehydrating the PVC. Once finished, you should have the PVC data restored in the cluster. For example, if you had deleted a MySQL PVC, CloudCasa would recreate that PVC and you could reattach your MySQL pod to it. If you chose to restore to a new cluster, you’ll see the namespace and PVC appear there. CloudCasa’s UI makes this process straightforward – simply point and click to select the backup, files or volumes you need, and where to put them, without complex CLI commands.

After the restore, verify that your application is using the recovered PVC data (e.g., the database is running with all its tables). You have successfully demonstrated a full backup and restore cycle for a Kubernetes PVC using CloudCasa!

Conclusion: Simplify Kubernetes PVC Backups with CloudCasa

Backing up Kubernetes PVCs is no longer a daunting task reserved for storage experts. Tools like CloudCasa make Kubernetes data protection an accessible, automated part of your DevOps workflow. In this guide, we showed how CloudCasa’s PVC backup solution provides a safety net for stateful workloads: from easy setup and scheduling to fast recovery when it counts. CloudCasa’s free trial plan (protecting up to 5 nodes at no cost) and its managed service approach eliminate excuses – you can start securing your cluster’s persistent data within minutes .

For DevOps teams, the value is clear: prevent data loss, enable cross-cloud migration, and ensure business continuity for your Kubernetes applications with minimal effort. CloudCasa’s rich feature set (like file-level restore and multi-cluster support) further enhances your backup strategy without adding complexity. If you’re responsible for Kubernetes operations, don’t wait for a disaster to strike. Take advantage of CloudCasa’s SaaS solution and try CloudCasa today to safeguard your PVCs and keep your Kubernetes workloads resilient. Your future self (and your organization’s uptime) will thank you.

A multi-cloud BCP approach for CPS 230 compliance using CloudCasa

CloudCasa — Fri, 07 Nov 2025 08:09:13 +0000

When Amazon Web Services’ US-East-1 region went down recently, a long list of global apps and services went with it.

For most companies, that meant a few hours of frustration.

For APRA-regulated financial institutions in Australia, an outage like that is something much more serious — a compliance and operational-resilience test under CPS 230, which is now in force as of July 2025.

CPS 230 requires banks, insurers, and superannuation funds to prove that they can keep their critical operations running within defined tolerance levels, even during a severe disruption. In other words, downtime is no longer just inconvenient — it’s non-compliant.

CPS 230: Raising the Bar on Operational Resilience

Under CPS 230, APRA-regulated entities must:

Identify and manage operational risks across systems and service providers.
Maintain credible business continuity plans (BCPs) that are regularly tested.
Set clear tolerance levels for downtime and data loss.
Ensure they can still operate if a material service provider or cloud region fails.

That last point is key.

Many organizations now run core workloads on managed Kubernetes platforms such as EKS, AKS, or GKE. While these platforms simplify operations, they also concentrate risk. If a single cloud region goes offline, entire clusters — and the applications they host — can disappear from production.

How CloudCasa’s Any2Cloud Fit In

CloudCasa provides a Kubernetes-native, cloud-agnostic backup and recovery platform. It protects not just data, but also namespaces, configurations, and cluster metadata. CloudCasa’s Any2Cloud automates the creation of new Kubernetes clusters in other regions or even other cloud providers which can deliver the practical resilience that CPS 230 expects.

1. Cross-Region and Cross-Cloud Protection

CloudCasa makes it simple to back up clusters running in one region — say AWS Sydney— and replicate them to storage in another region or another provider.
Backups can be stored in S3-compatible object storage, Azure Blob, or Google Cloud Storage, helping reduce geographic concentration risk.

This approach supports CPS 230’s requirements for defined data-loss tolerance and business continuity across severe disruptions. If a primary cloud region fails, the data and configuration needed to rebuild your workloads are already available elsewhere.

2. Rapid Recovery Through Any2Cloud

If your production cluster becomes unavailable, Any2Cloud can deploy a new Kubernetes cluster in a secondary region or alternate provider within minutes. Once the new cluster is live, CloudCasa restores your applications, persistent volumes, and configurations automatically. Traffic can then be rerouted, allowing you to continue critical operations without missing your defined tolerance windows.

This workflow directly supports CPS 230’s intent — keeping critical operations running within acceptable limits, no matter where an outage occurs.

3. Strengthening Service-Provider Oversight

CPS 230 also raises expectations around the way organizations manage material service providers.

CloudCasa helps address these governance points by offering:

Clear visibility of where data is stored, including region and provider.
Full audit trails for backup and restore activity.
Support for bring-your-own-storage models, so data can remain under your control or within approved jurisdictions.

These features make it easier for compliance and risk teams to demonstrate proper oversight and data-sovereignty management.

4. Regular Testing and Evidence for Auditors

With CPS 230 now active, APRA expects organizations to have credible, tested business continuity plans that demonstrate their ability to recover from severe but plausible scenarios. With CloudCasa, restores can be scheduled and documented as part of these tests. Reports generated from these exercises provide tangible proof that recovery objectives are being met — something auditors and boards are now specifically reviewing.

Mapping CPS 230 to Real Capabilities

CPS 230 Expectation	CloudCasa’s Any2Cloud Capability
Maintain critical operations within tolerance levels	Cross-region and cross-cloud Kubernetes recovery
Credible, tested BCP	Automated restore testing and drill reporting
Manage risks from material service providers	Data-location visibility and provider independence
Minimise impact of disruptions	Continuous backups and geo-replication
Return to normal promptly	Automated cluster rebuild and restore

From Compliance to Continuous Resilience

Now that CPS 230 is in effect, regulated organizations need to move beyond policy documentation and into proven, operational resilience. By combining CloudCasa’s multi-cloud data protection with Any2Cloud’s cluster mobility, financial and insurance institutions can:

Recover quickly from outages or cyber events
Demonstrate compliance through auditable tests and reports
Avoid single-provider or single-region dependency
Protect customers and reputation with minimal downtime

CPS 230 sets a high bar — but it’s also an opportunity to strengthen trust and reliability through technology that delivers resilience by design.

Staying Resilient in the CPS 230 Era

APRA’s new standard isn’t just another compliance checklist — it’s a framework for sustained operational integrity.
CloudCasa and Any2Cloud provide the tools to make that framework real: policy-driven backup, cross-cloud recovery, and automated continuity for your Kubernetes environments.

Beyond the AWS Outage: How CloudCasa and Any2Cloud Enable True Multi-Cloud Resilience for Kubernetes

Ryan Kaw — Thu, 23 Oct 2025 12:00:56 +0000

When AWS’s US-East-1 region went down again this month, it reminded the industry of an uncomfortable truth: even the most trusted cloud platforms can fail. From streaming services to SaaS providers, many businesses were caught off guard, not because they lacked backups, but because they lacked redundancy.

In a Kubernetes world, redundancy isn’t just about having data snapshots. It’s about ensuring your clusters, configurations, and persistent data can be recovered anywhere, quickly, and without lock-in.

That’s exactly where CloudCasa and our Any2Cloud cluster mobility capability come together.

Why Region and Cloud Redundancy Matters

Modern organizations often run production workloads on managed Kubernetes platforms like AWS EKS. It’s convenient, performant, and well-integrated with AWS services — until a regional outage strikes.

The GeekWire article captured it well:

“Many sites have not adequately implemented the redundancy needed to quickly fall back to other regions or cloud providers.”

When a cloud region fails, everything tied to that region — compute, storage, networking, control planes — can be inaccessible. Without a tested cross-region or cross-cloud recovery plan, downtime is inevitable.

Enter CloudCasa: Resilient Backups Beyond the Primary Cloud

CloudCasa provides a Kubernetes-native backup-as-a-service solution that goes beyond simple snapshots. It automatically discovers clusters across multiple clouds and regions, protecting not only persistent volumes but also namespaces, resources, secrets, and metadata.

Key capabilities include:

Cross-region and cross-cloud backups – Back up Kubernetes clusters on AWS and store data in another AWS region, or even on Azure, Google Cloud, or any S3-compatible object store.
Granular recovery – Restore entire clusters or individual namespaces and PVCs.
Any2Cloud Recovery – Restore a backup from one cluster or provider into another — EKS → AKS, GKE → on-prem, and everything in between.
Centralized management – Unified policies and reporting for all clusters across clouds.

This flexibility means your data protection isn’t bound to one provider or region. CloudCasa’s off-cloud backup storage ensures your critical data is always retrievable, even during a major cloud outage.

Any2Cloud: Instantly Rebuild Kubernetes Clusters Anywhere

Backups are only part of the resilience equation. To truly achieve business continuity, you also need a way to bring up Kubernetes clusters in new regions or providers, fast.

That’s where Any2Cloud comes in.

Our Any2Cloud automation allows you to deploy new Kubernetes clusters across AWS, Azure, GCP, or on-prem environments with a few clicks or API calls. Combined with CloudCasa’s recovery engine, you can go from outage to operational in minutes, not hours.

Here’s how it works in practice:

Backup Continuously
CloudCasa backs up your EKS clusters, configurations, and data to a geo-separated or cross-cloud object store.
Outage or Disaster Occurs
AWS Region A experiences downtime. Your workloads are unavailable.
Spin Up a New Cluster
Using Any2Cloud, deploy a fresh Kubernetes cluster in Region B or on another cloud.
Restore with CloudCasa
Restore all Kubernetes objects, namespaces, and persistent data from your latest backup.
Resume Operations
Update DNS or ingress rules — and you’re back online, now in a different region or cloud.

A Practical Example

Imagine your production workloads run on AWS EKS in us-east-1, and your backups are stored in Google Cloud Storage via CloudCasa.

If AWS experiences an outage, Any2Cloud can provision a new GKE cluster in europe-west1. CloudCasa then restores the backup, re-deploying your apps, services, and data exactly as they were. Within minutes, your platform is live again, on a different provider, in a different region, without manual reconfiguration.

That’s multi-cloud resilience in action.

From Outages to Opportunities

Every cloud outage is a reminder that resilience isn’t just an IT problem, it’s a business imperative. CloudCasa and Any2Cloud together provide a pragmatic, Kubernetes-native solution for ensuring that outages on one cloud or region don’t stop your business.

With this combination, you get:

Cross-region, cross-cloud recovery
Automated cluster recreation anywhere
Reduced downtime and data loss (low RTO/RPO)
Freedom from single-cloud lock-in

So the next time a cloud region falters, your customers won’t even notice.

Ready to Build Your Resilient Kubernetes Strategy?

Contact us to learn how CloudCasa and Any2Cloud can protect your Kubernetes workloads from cloud outages — and give you true any-cloud freedom.

What You Should Know About the CloudCasa October 2025 Feature Update

CloudCasa — Tue, 21 Oct 2025 12:00:55 +0000

The latest CloudCasa Feature Update introduces a range of powerful new features and improvements that every user should know about.

This release delivers major advancements in Kubernetes disaster recovery and backup flexibility, helping organizations recover faster, simplify management, and protect workloads across diverse environments.

From low-RTO DR for Storage and NFS backup support to KubeVirt file-level restore, this release reinforces CloudCasa’s commitment to making enterprise-grade data protection both accessible and effortless for all Kubernetes and VM users.

Below is a detailed breakdown of everything that’s new and improved.

What’s New in October 2025

1. CloudCasa DR for Storage

A key highlight of this release is CloudCasa DR for Storage, enabling low-RTO disaster recovery by integrating directly with remote volume replication at the storage layer. When enabled, restores to a DR cluster will use replicated volumes instead of only relying on backup data, resulting in recovery times that are dramatically faster.

In this first release, CloudCasa DR for Storage supports SUSE Storage (Longhorn) and adds a new Clusters → Disaster Recovery page in the UI for management. This feature is available for paid subscriptions and includes an additional per-worker-node charge.

2. Backup to NFS

CloudCasa now supports NFS shares as backup targets, addressing the needs of edge or small-office environments where object storage may not be practical. A new NFS Storage tab has been added under Configuration → Storage, allowing you to define and select NFS targets. Please note that immutability is not supported for NFS backups. Additionally, the label “Backup to object storage” has been renamed to “Copy to backup storage.”

3. File-Level Restore for VMs

File-level restore has been extended to KubeVirt-based VMs, allowing users to browse and recover individual files from VM backups without the need for in-guest agents. This provides fine-grained recovery control for Kubernetes clusters running KubeVirt workloads. You can initiate this from Clusters → Recovery Points → Files, select a VM and proxy cluster, browse the filesystem, and download specific files or ZIP archives. The “Browse” button in Recovery Points has also been renamed to “Resources” to reflect both resource and file browsing.

4. File Download for Non-VM Filesystems

For non-VM filesystem PVs, CloudCasa now supports direct file downloads from backups — eliminating the need to restore entire volumes when only a few files are needed.

5. Resource Modifiers

Available to premium users, Resource Modifiers allow YAML-based customization of Kubernetes resources during restore, migration, or replication jobs. The format is compatible with Velero’s Restore Resource Modifier, providing a familiar and flexible way to fine-tune restores or migrations.

6. Free Plan Update

The Free Service Plan now allows a maximum monthly average of 5 worker nodes, reduced from 10. Existing users can take advantage of limited-time discounts when upgrading to CloudCasa Pro. Contact CloudCasa sales for more details.

7. Other Changes

This release also introduces several usability and performance enhancements:

Added an organization-level option to disable CloudCasa for Velero features, simplifying the UI.
The new Cloud Resources tab in Activity Details lists all cloud resources created during restore or replication jobs.
Descriptive names for Rancher Project resources now appear in the resource browser.
The PV Details (Restore) tab now includes snapshot restore status.
Automatic agent updates can be enabled for private container registries (recommended only when new agent images are immediately available).
“Last Active” date is now shown for clusters in Pending state that were previously active.
The cluster setup view now shows a “Done” button instead of “Save.”
The CloudCasa agent now auto-detects KubeVirt installations and reconfigures itself automatically.
The number of agent containers has been reduced for improved efficiency.

New Certifications

CloudCasa is now certified for compatibility with SUSE Virtualization and SUSE AI, further expanding enterprise interoperability.

Kubernetes Agent Updates

Finally, this release also includes several enhancements to the CloudCasa Kubernetes agent to improve performance, reliability, and compatibility with the new features. Agents configured for automatic updates will receive these improvements automatically. If you’ve disabled auto-updates, please update your agents manually to ensure full compatibility.

Note: After logging in, refresh your browser or clear the cache (Ctrl + F5) to make sure you’re viewing the latest version of the CloudCasa web app.

Ready to Try the New Features?

CloudCasa continues to deliver the tools teams need to confidently protect, recover, and manage Kubernetes and hybrid workloads — all with a simple, intuitive experience.

Update your CloudCasa agents to the latest version and explore all the new features in your dashboard today.

Start your free trial today.

For more details, check out the full October 2025 Release Notes.

Why VM Backups Are Not Enough in Tanzu

CloudCasa — Fri, 17 Oct 2025 12:00:32 +0000

Running Kubernetes on VMware Tanzu gives you flexibility, scalability, and strong enterprise integration. But when it comes to protecting applications and data, many teams still rely on traditional VM backups. At first glance, it seems logical: if you back up the VM that hosts your cluster, you should be safe.

Unfortunately, that’s a dangerous assumption. VM backups capture the virtual machine state—but they don’t understand Kubernetes. In Tanzu, that gap can make the difference between a quick recovery and a major outage.

Let’s look at why VM backups are not enough for Tanzu and what you should consider instead.

1. VM Backups Protect Infrastructure, Not Applications

Virtual machine snapshots are designed to capture the operating system and disk state of a VM. What they don’t capture are the Kubernetes objects—deployments, services, secrets, and configurations—that define your Tanzu workloads.

Restoring a VM might bring the node back online, but your applications won’t automatically return to the right state. That’s because the intelligence of Kubernetes lives outside the VM image.

2. Kubernetes Is Ephemeral by Design

Kubernetes continuously manages the lifecycle of pods and containers. They are stateless and dynamic, which means VM-level restores often bring back resources that Kubernetes has already replaced or rescheduled elsewhere.

For stateless apps, that might not matter—but for stateful services like databases, a crash-consistent VM snapshot can easily lead to corruption or data loss.

3. Persistent Volumes Need Application-Aware Protection

In Tanzu, application data lives in Persistent Volumes (PVs). VM backups don’t coordinate with Kubernetes’ CSI snapshots or storage classes, which means your PVs may not be restored in sync with the application layer.

The result? Databases and stateful apps that don’t match the cluster state—often requiring manual repair. Kubernetes-native backups, in contrast, take application-consistent snapshots of both workloads and data together.

4. Namespace and Multi-Tenant Complexity

Most Tanzu clusters run multiple applications, each isolated in its own namespace. VM-level backups don’t understand that structure. You can’t selectively restore a single namespace or workload—it’s an all-or-nothing restore of the VM.

Kubernetes-native solutions let you restore at the right level of granularity: a single app, a namespace, or the entire cluster. That flexibility is crucial for large, shared environments.

5. Recovery Time and SLA Gaps

VM restores are slow, resource-heavy, and often require manual intervention before applications are actually usable again. In today’s always-on world, that doesn’t meet business SLAs.

With Kubernetes-aware backup, you can restore applications in minutes, complete with configuration and persistent data. Faster RTOs (Recovery Time Objectives) mean less downtime and happier stakeholders.

6. Compliance and Portability

Enterprises in regulated industries need auditable, portable, policy-driven backups. VM snapshots lock you into a specific hypervisor and restore workflow.

By contrast, Kubernetes-native backups work at the application layer, making it possible to:

Migrate workloads across clusters or clouds
Enforce backup policies per namespace or team
Prove compliance with automated reporting

The Bottom Line

VM backups are valuable for protecting infrastructure, but they are blind to Kubernetes workloads. In Tanzu, that’s not enough. To safeguard your applications, you need a Kubernetes-native backup solution that:

Captures cluster resources (YAML objects, namespaces, secrets)
Backs up persistent volumes consistently
Restores applications at the right level of granularity
Meets compliance and portability requirements

With the right approach, you can ensure your Tanzu environment is not just resilient at the VM layer, but application-consistent, portable, and compliant.

Explore how CloudCasa protects Tanzu workloads with Kubernetes-native backup. Start a free trial or contact us for a personalized demo.

Recovering Tanzu Kubernetes Clusters After VM Loss: Step-by-Step Guide

CloudCasa — Thu, 16 Oct 2025 12:00:00 +0000

VM-level backups aren't enough to fully protect Tanzu clusters. This guide explains how to recover your Kubernetes workloads using CloudCasa, covering everything from rebuilding infrastructure to restoring applications while maintaining consistency and minimizing downtime.

When a VM hosting your Tanzu Kubernetes cluster crashes, your recovery strategy can make or break application availability. Traditional VM backups often miss Kubernetes-specific data, leading to incomplete or inconsistent restores. This guide walks you through a reliable recovery process using CloudCasa, ensuring you restore both infrastructure and application state with confidence.

Why VM Backups Alone Fall Short for Tanzu

VM-level backups are designed for infrastructure, not for the complexities of Kubernetes. Tanzu clusters—running on VMware—comprise dynamic resources like pods, persistent volumes, secrets, and custom objects. Restoring a VM snapshot might bring the host back online, but it won’t necessarily restore your cluster’s control plane, workloads, or configurations correctly.

Common pitfalls with VM-only restores:

Lost etcd state or corrupted control plane components
Out-of-sync persistent volume claims
Missing application manifests or configmaps
No visibility into namespace-level or Helm-deployed resources

To fully recover from a Tanzu VM failure, you need a Kubernetes-native backup and restore solution.

Tanzu VM Failure Recovery with CloudCasa

CloudCasa supports agentless, Kubernetes-aware backups for Tanzu clusters running on VMware. Here’s how to perform a full recovery:

Step 1: Assess the Failure Scope

Before triggering any recovery, identify:

Which VM(s) failed—worker node, control plane, or both?
Are persistent volumes or shared storage affected?
Was the CloudCasa agent or CSI driver impacted?

If the control plane is lost, you may need to rebootstrap a new Tanzu cluster first.

Step 2: Rebuild Infrastructure if Necessary

If the VM host cannot be recovered:

Provision a new VM via vSphere with the same specs
Rejoin the node to the cluster if only a worker was lost
If the entire cluster is lost, use the same Tanzu YAML templates or Terraform scripts to recreate the cluster infrastructure

Step 3: Reinstall CloudCasa Agent

If the control plane is accessible but the CloudCasa agent is missing:

kubectl apply -f https://app.cloudcasa.io/k8s/install.sh

Ensure the agent connects to your CloudCasa dashboard and is linked to the correct cluster identity.

Step 4: Restore Cluster Resources

In CloudCasa:

Navigate to your Tanzu cluster backup snapshot
Choose “Restore” → “Cluster restore” or “Namespace-level restore”
Confirm whether to overwrite existing resources or restore to a new namespace
Restore PVCs, Helm releases, and RBAC settings if included in the backup

Step 5: Validate Application Health

After restore:

Check kubectl get pods -A for running workloads
Validate service endpoints and ingress
Confirm that PVCs are mounted and data is intact
Restart any failed pods or services as needed

Best Practices for Tanzu Cluster Protection

To avoid data loss in future failures:

Schedule daily or hourly backups in CloudCasa
Include etcd, PVCs, and all namespaces
Test restores monthly to validate cluster integrity
Tag critical workloads for high-priority backup
Integrate with vSphere alerts for automated triggers

Conclusion

Tanzu Kubernetes clusters are powerful but can be fragile when relying solely on VM-level protection. With CloudCasa’s Kubernetes-native backup, you get peace of mind knowing your clusters are application-consistent and fully restorable—even after severe VM-level failures.

Strengthen your recovery strategy today and keep your Tanzu environment

To learn more about how CloudCasa can enhance your Kubernetes backup strategy, please see the following case study: https://cloudcasa.io/resources/streamlining-kubernetes-backup-dccs/