Cloudbase Solutions

Webinar – Virtualization revitalized: How Oracle and Cloudbase Solutions empower you to do more for less

Chris Valean — Thu, 15 Jan 2026 12:50:26 +0000

Modernizing your infrastructure shouldn’t feel like a leap into the unknown. With the right combination of tools, cloud migration becomes not only manageable but genuinely strategic. That’s exactly what happens when you bring Cloudbase Coriolis together with Oracle Linux Virtualization Manager (OLVM).

Coriolis streamlines and automates the migration and replication of virtual machines across heterogeneous environments. Whether you’re moving workloads from VMware, Hyper‑V, AWS, or other platforms, Coriolis handles the heavy lifting with precision and reliability.

What makes this even more compelling is its integration with Oracle OLVM, Oracle’s powerful KVM‑based virtualization platform. OLVM offers enterprise‑grade performance, strong security, and a modern management interface—making it an ideal destination for organizations looking to optimize costs, improve performance, or consolidate infrastructure. When paired with Coriolis, migrating into OLVM becomes a smooth, predictable, and fully automated experience.

If you want to see this in action, our colleague Adrian Dumitrache, together with John Priest from Oracle, has presented a great walkthrough that demonstrates how Coriolis enables seamless migrations into Oracle OLVM and other Oracle Cloud virtualization technologies.

Watch the on-demand webinar here: https://go.oracle.com/LP=150124

If you’d like more information or want to explore a demo, feel free to reach out anytime in the Contact us section. Every migration project has its own specifics, and our team is here to discuss the best approach for your cloud.

The post Webinar – Virtualization revitalized: How Oracle and Cloudbase Solutions empower you to do more for less appeared first on Cloudbase Solutions.

Hassle-free Migration and Disaster Recovery from VMware vSphere to Proxmox VE with Coriolis

Nashwan Azhari — Mon, 08 Apr 2024 14:08:31 +0000

Following Broadcom’s recent acquisition of VMware, the level of uncertainty within the private virtualized infrastructure space has reached an all-time high. With many companies now put on the spot by unexpected increases to platform procurement and upgrade costs, the need for an Open Source and feature-equivalent alternative to VMware has never been greater.

In this post, we’ll be giving a cursory introduction to Proxmox VE’s features and characteristics, and later showing how easy it is to migrate your VMware virtual infrastructure to Proxmox VE using Cloudbase’s in-house-developed Cloud Migration and Disaster Recovery as a Service solution: Coriolis.

Proxmox VE: the Open Source KVM node manager

The Proxmox Virtual Environment (or PVE for short) is a fully Open Source Hyper-Converged Infrastructure solution for on-premises infrastructure virtualization developed by Proxmox Server Solutions GmbH.

While not as feature-rich as more complete Infrastructure-as-a-Service (IaaS) type solutions like OpenStack, PVE brings to the table a very grounded approach to easily installing, managing, and upgrading numerous independent Open Source solutions packaged together under a single umbrella platform.

Proxmox VE Manager

Key Open Source components of PVE include:

Linux Kernel Virtual Machine (KVM) type one hypervisor: the industry standard for Linux-based hardware virtualization which, with the aid of QEMU, can run a wide array of virtual guest configurations
Linux Containers (LXC): a containerization solution which brings together the numerous native process/resource isolation mechanisms offered by the Linux kernel into a unified container runtime
OpenZFS: the premiere Linux implementation of the highly reliable, scalable, and feature-rich ZFS filesystem, providing flexible local storage to PVE nodes
Ceph software defined storage: a robust storage platform designed to provide block, object, and file storage features, even on commodity hardware and disks

Coriolis: CMaaS and DRaaS made easy

Coriolis is the Cloud Migration and Disaster Recovery as a Service (CMaaS/DRaaS) solution developed in-house by Cloudbase Solutions.

While our main goal with Coriolis during its original conception was to prevent vendor lock-in for our customers by offering Lift-and-Shift-type migration capabilities, it has quickly evolved into a fully-fledged Disaster Recovery solution capable of seamlessly replicating virtual infrastructure between most of every Public/Private IaaS platform you can think of.

Coriolis Architectural Diagram

Components and feature set

Coriolis lends much of its architectural inspiration to the OpenStack ecosystem, with its design intrinsically offering the following notable characteristics:

highly scalable: all of Coriolis’ components are independently horizontally-scalable, which allows Coriolis to easily accommodate usecases both small and large
fault tolerant: Coriolis’ components offer both high redundancy at every level, as well as fault tolerance when it comes to interacting with the source/target platforms every step of the way, thus preventing any transient infrastructure-level issues from tripping up business critical transfers
easy to use: Coriolis offers both a user-friendly Web-based Graphical User Interface and bundled command line client for easy management
straightforward integration: Coriolis’ easy-to-consume HTTP-based REST API makes integration with existing monitoring infrastructure and tooling a breeze
data security and integrity: all datapaths between the source and target platforms go through tightly secured error-checked connections, and any sensitive information such as platform credentials are securely stored within OpenStack’s Barbican secret manager

Coriolis vSphere VM Selection Screen

Agent-less, non-invasive, and non-disruptive

Coriolis is designed to be usable by any end-user of the source/target platforms.

This implies that Coriolis does not demand any access beyond what a normal end-user might have, such as:

no need to install an agent in the guest you’re migrating:
Coriolis performs Lift-and-Shift type transfers of your virtual infrastructure entirely from the platform level, not the guest level
no need for admin accounts or access to underlying infrastructure:
Coriolis strictly leverages normal user accounts and platform-level API features to perform its transfer operations, so handing Coriolis invasive admin access to your control plane is not required
no downtime for your virtual infrastructure:
Coriolis is designed to perform all its transfer operations with zero downtime to your existing virtual infrastructure on the source platform, you can sync and deploy the new infrastructure on the target and perform the cutover whenever you feel comfortable

Coriolis’ edge over similar solutions

Coriolis offers notable advantages over most similar cross-platform guest replication solutions to Proxmox currently on the market, including the recently announced additions to the Proxmox Import Wizard.

Coriolis Platform Selection Screen

Of special note are Coriolis’ abilities to:

support a wide array of sources: apart from VMware vSphere, Coriolis allows migrating guests to Proxmox from a large selection of source platforms, from standalone ESXi hosts with vSAN, all the way to public clouds such as AWS or Azure.
Have a look at the current list of platforms supported by Coriolis
adapt migrated guest operating systems for their new home: Coriolis automatically takes steps to ensure the migrated virtual infrastructure is perfectly suited for its new environment
Everything from installing appropriate drivers, adapting the guest’s networking configuration, and injecting any add-ons required for full integration with Proxmox — like the VirtIO drivers for Windows guests, or the QEMU guest agent for management, for example — is transparently handled for you!
replicate guest data with zero adverse effects to business continuity: Coriolis’ agent-less data replication approach and careful design considerations to be as least disruptive as possible enables it move data with no required downtime to your virtualized applications

Coriolis in Action

Here’s a showcase of how to set up full Disaster Recovery between VMware vSphere and Proxmox VE in mere minutes using Coriolis:

Try Coriolis out!

If you’d like to get hands-on with Coriolis’ features for yourself, please contact us for a demo and trial appliance!

While we get back to you, feel free to have a look over:

Coriolis’ overview page
VMware plugin‘s documentation
Proxmox plugin‘s documentation

The post Hassle-free Migration and Disaster Recovery from VMware vSphere to Proxmox VE with Coriolis appeared first on Cloudbase Solutions.

K8S Bare Metal deployment Part 3 – Workload Cluster

Adrian Vladu — Mon, 02 Oct 2023 10:00:00 +0000

Hello and welcome back to the third part of our series on Kubernetes bare metal deployment – deploying the Kubernetes Workload Cluster.

In Part 2, we already prepared the environment for the deployment, we just need now to start the deployment of our Workload Kubernetes cluster on two bare metal servers (two ARM64 Ampere Computing ALTRA Mt. Collins servers).

Prerequisites

Before starting the deployment, we need to take some time discussing the status quo of ARM64 support on the different open source projects we use for the automation and the necessary changes that are in the process of being upstreamed.

While k3d, ArgoCD, helm, clusterctl, Cilium worked out of the box on both ARM64 and AMD64, Bird, Tinkerbell’s Hook and Boots, Ceph, Kubevirt, virtctl and virt-vnc, Flatcar, Cluster API image builder — all required some code changes or building the missing ARM64 Docker image:

Bird – missing Docker image for ARM64
Tinkerbell Hook – missing RTC, SAS and XHCI in its Linux Kernel Configuration
Tinkerbell Boots – improve iPXE boot times to not wait for all NICs to be tried
Ceph on Flatcar requires mon_osd_crush_smoke_test=false, otherwise mons enter an infinite loop
Kubevirt docker images for ARM64 were broken since March 2023 (were in fact, for AMD64)
virtctl binary is not released for ARM64 (manual building is required)
virt-vnc – missing Docker image for ARM64
Flatcar – missing VirtIO GPU driver in its Linux Kernel Configuration
Cluster API image builder – no support to build ARM64 images

All the above issues have either been already solved upstream or we have patches that have been sent upstream and are in review process.

With the issues above solved, we can start the preparation for deployment.

Hardware definitions

First, we need to define Tinkerbell Hardwares and Machines.

Hardware is the CRD that has the information about the bare metal server (architecture, storage, networking), and Machine is the CRD that has the information about the BMC (IP, username, password).

argocd app sync hardware
argocd app sync machine

Deploying the workload cluster

Now we are ready to initialize the Cluster API workflows that will end up creating the K8S Workload Cluster:

until argocd app sync workload-cluster;  do sleep 1; done
clusterctl get kubeconfig kub-poc -n tink-system > ~/kub-poc.kubeconfig

until kubectl --kubeconfig ~/kub-poc.kubeconfig get node -A; do sleep 1; done
until kubectl --kubeconfig ~/kub-poc.kubeconfig get node sut01-altra; do sleep 1; done
until kubectl --kubeconfig ~/kub-poc.kubeconfig get node sut02-altra; do sleep 1; done

As this stage will take a while (around 10 minutes), ArgoCD Web UI can be used to visualize the status of the operations (see Part 2 for more details)

Adding the workload cluster in ArgoCD

Once our 2 node Workload cluster has been created, we can add it to ArgoCD for further automation and synchronize the Workload Cluster Applications:

argocd cluster add kub-poc-admin@kub-poc \
   --kubeconfig ~/kub-poc.kubeconfig \
   --server argo-cd.mgmt.kub-poc.local \
   --insecure --yes

argocd app create workload-cluster-apps \
    --repo [email protected]:cloudbase/k8sbm.git \
    --path applications/workload --dest-namespace argo-cd \
    --dest-server https://kubernetes.default.svc \
    --revision "main" --sync-policy automated

Configuring the CNI

At this moment, our K8S Workload cluster is the most basic K8S cluster there is, it has no networking or storage services. As Cilium is yet to be installed, the coredns pods are still in Pending status.

The next step is to install the CNI (Container Network Interface) using Cilium with BGP external connectivity. At this moment, we need to install on the K8S Management Cluster, the Bird host network container, that will allow us to connect to the External IPs from the K8S Workload Cluster.

argocd app sync bird
until kubectl get CiliumLoadBalancerIPPool --kubeconfig ~/kub-poc.kubeconfig || (argocd app sync cilium-manifests && argocd app sync cilium-kub-poc); do sleep 1; done

Storage Configuration

Once we have the CNI up and running, we can move to the CSI (Container Storage Interface) installation, leveraging Rook and Ceph. Ceph OSDs are configured to use the second NVME disk, an Intel SSD, on both Altra nodes. For this to happen, we need to untaint the Control Plane first and then clean up the secondary NVME disks.

kubectl --kubeconfig ~/kub-poc.kubeconfig patch node sut01-altra -p '{"spec":{"taints":[]}}' || true

argocd app sync rook-ceph-operator
until kubectl --kubeconfig ~/kub-poc.kubeconfig wait deployment -n rook-ceph rook-ceph-operator --for condition=Available=True --timeout=90s; do sleep 1; done

KUBECONFIG=~/kub-poc.kubeconfig kubectl node-shell sut01-altra -- sh -c 'export DISK="/dev/nvme1n1" && echo "w" | fdisk $DISK && sgdisk --zap-all $DISK && blkdiscard $DISK || sudo dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync && partprobe $DISK && rm -rf /var/lib/rook'

KUBECONFIG=~/kub-poc.kubeconfig kubectl node-shell sut02-altra -- sh -c 'export DISK="/dev/nvme1n1" && echo "w" | fdisk $DISK && sgdisk --zap-all $DISK && blkdiscard $DISK || sudo dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync && partprobe $DISK && rm -rf /var/lib/rook'

argocd app sync rook-ceph-cluster

until kubectl  --kubeconfig ~/kub-poc.kubeconfig -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status; do sleep 1; done

The output of the rook-ceph namespaced pods and ceph status command should look like this:

After around 10 minutes, all the rook-ceph pods are nicely running / completed succesfully and we have two functional managers, 3 monitors (for quorum) and two OSDs ready to be used, amounting to 1.8TiB of free space.

These were the steps to automate the deployment of the K8S Workload cluster. The entire process takes around half an hour using the hardware of choice.

Next up in the series, we will proceed to validate the K8S Workload Cluster in Part 4.

The post K8S Bare Metal deployment Part 3 – Workload Cluster appeared first on Cloudbase Solutions.

Kubernetes Bare Metal Deployment Part 2: Setting Up the Kubernetes Management Cluster

Adrian Vladu — Tue, 08 Aug 2023 08:45:00 +0000

Hello again! Welcome back to our continuing journey on Kubernetes bare metal deployment. Today, we’ll delve deeper into establishing the Kubernetes Management Cluster.

Understanding the Role of the K8S Management Cluster

Before we dive deep into the Kubernetes bare metal deployment intricacies, it’s essential to understand the heart of our operations: the K8S Management Cluster. Think of this cluster as the control tower in an airport. Just as the tower oversees and manages every plane taking off, landing, or merely taxiing around, the K8S Management Cluster orchestrates the creation, maintenance, and monitoring of other Kubernetes clusters.

Now, why do we need it? There are a few pivotal reasons:

Centralized Control: With a management cluster, you can create, update, or delete multiple Kubernetes workload clusters from a single point of command.
Isolation: It keeps the administrative tasks separate from the applications and workloads. This separation ensures that any issues in the workload clusters don’t affect the management functionalities.
Scalability: As your infrastructure grows, managing each cluster individually can become a daunting task. A management cluster simplifies this by scaling operations across numerous clusters seamlessly.
Uniformity: Ensuring every cluster is set up and maintained using consistent configurations and policies becomes a breeze with a central management cluster.

Hardware

Let’s talk about hardware. Here’s what we suggest as minimum requirements:

Storage: At least 50GB for k3d.
RAM: 16GB or more (Primarily for services like Prometheus).
CPU Cores: 4-8 should suffice.

The foundation? Ubuntu 22.04 Server ARM64, get it here.

Preparation: Gathering the Essentials

Before diving into Kubernetes deployment, let’s stock up on the needed binaries. We’re looking at k3d, helm, argocd, and clusterctl.

Here’s the command line magic to get these:

# Determining the architecture
ARCH=$(dpkg-architecture -q DEB_BUILD_ARCH)

# Grabbing k3d
wget https://github.com/k3d-io/k3d/releases/download/v5.5.1/k3d-linux-${ARCH} -O k3d
chmod a+x k3d
mv k3d /usr/local/bin/

# Helm, the package manager for K8s
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

# ArgoCD for Continuous Delivery
wget https://github.com/argoproj/argo-cd/releases/download/v2.6.8/argocd-linux-${ARCH} -O argocd
chmod a+x argocd
mv argocd /usr/local/bin/

# Clusterctl for Cluster API
wget https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.4.2/clusterctl-linux-${ARCH} -O clusterctl
chmod a+x clusterctl
mv clusterctl /usr/local/bin/

Unfolding the Kubernetes Magic with k3d

With k3d, we’ll set up our K8S Management Cluster. A few pointers:

Tinkerbell’s Boots needs the load balancer off.
We’ll need host networking (again, credit to Boots) and host pid mode.

k3d cluster create --network host --no-lb --k3s-arg "--disable=traefik,servicelb" \
--k3s-arg "--kube-apiserver-arg=feature-gates=MixedProtocolLBService=true" \
--host-pid-mode

mkdir -p ~/.kube/
k3d kubeconfig get -a >~/.kube/config
until kubectl wait --for=condition=Ready nodes --all --timeout=600s; do sleep 1; done

Automation with ArgoCD

ArgoCD, our preferred automation tool, comes next. For this, an unused IP from our static range will be used for ArgoCD services. We’ve chosen 10.8.10.133 and our NIC named “enp1s0f0np0”.

All necessary configurations are available in the cloudbase/k8sbm repository.

# Getting the repository
git clone https://github.com/cloudbase/k8sbm
cd k8sbm

Friendly Reminder: Ensure every repository code change gets pushed since ArgoCD relies on the remote repository, not the local one.

We’ll also utilize existing helm repositories and charts for automation:

# Helm charts for various services
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add argo-cd https://argoproj.github.io/argo-helm
helm repo add kube-vip https://kube-vip.github.io/helm-charts/
helm repo update

# Additional helm commands for setting up services
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
  --version 4.5.2 --namespace ingress-nginx \
  --create-namespace \
  -f config/management/ingress-nginx/values.yaml -v 6
until kubectl wait deployment -n ingress-nginx ingress-nginx-controller --for condition=Available=True --timeout=90s; do sleep 1; done

helm upgrade --install kube-vip kube-vip/kube-vip \
  --namespace kube-vip --create-namespace \
  -f config/management/ingress-nginx/kube-vip-values.yaml -v 6

helm upgrade --install argo-cd \
  --create-namespace --namespace argo-cd \
  -f config/management/argocd/values.yaml argo-cd/argo-cd
until kubectl wait deployment -n argo-cd argo-cd-argocd-server --for condition=Available=True --timeout=90s; do sleep 1; done
until kubectl wait deployment -n argo-cd argo-cd-argocd-applicationset-controller --for condition=Available=True --timeout=90s; do sleep 1; done
until kubectl wait deployment -n argo-cd argo-cd-argocd-repo-server --for condition=Available=True --timeout=90s; do sleep 1; done

Post-deployment, ArgoCD’s dashboard is accessible by updating host mappings. Here’s how:

echo "10.8.10.133 argo-cd.mgmt.kub-poc.local" | sudo tee -a /etc/hosts

For ArgoCD access, the default username is admin. Retrieve the password via CLI:

pass=$(kubectl -n argo-cd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
echo $pass

To wrap up, configure ArgoCD with our GitHub repository and introduce the management applications for our K8S Management Cluster.

argocd login argo-cd.mgmt.kub-poc.local --username admin --password $pass --insecure

argocd repo add [email protected]:cloudbase/k8sbm.git \
    --ssh-private-key-path ~/.ssh/id_rsa
argocd app create management-apps \
    --repo [email protected]:cloudbase/k8sbm.git \
    --path applications/management --dest-namespace argo-cd \
    --dest-server https://kubernetes.default.svc \
    --revision "main" --sync-policy automated

argocd app sync management-apps
argocd app get management-apps --hard-refresh

Let’s check the status of ArgoCD from the WEB UI, it should look similar to this

Taking our journey to the next level, we’ll be diving into the installation of the Tinkerbell stack. This stack consist of an array of services: from the Tink server and controller to Boots, Hegel, and Rufio. The cherry on top? We’ll leverage ArgoCD for the deployment, ensuring a streamlined process. Giving a nod to Tinkerbell’s HTTP services, we’ve earmarked the IP 10.8.10.130 from our stash, diligently setting it in applications/management/values.yaml.

Pop open the terminal and roll with:

argocd app sync tink-stack
until kubectl wait deployment -n tink-system tink-stack --for condition=Available=True --timeout=90s; do sleep 1; done

With that wrapped up, it’s a go-ahead for the installation of Cluster API services. Finally, we’re prepping for the rollout of the K8S Workload Cluster. Key in:

export TINKERBELL_IP="10.8.10.130"

mkdir -p ~/.cluster-api
cat > ~/.cluster-api/clusterctl.yaml <



A quick aside: given our affinity with Flatcar in the K8S Workload Cluster nodes, there was a need to activate a novel feature in the Cluster API: Ignition userdata format support. So, we set EXP_KUBEADM_BOOTSTRAP_FORMAT_IGNITION=true.



Here we go! Check your K8S Management Cluster. It’s ready and set to bring in our K8S Workload Cluster!







The Grand Result 



By now, you should have a fully equipped K8S Management Cluster, ready to manage workloads!



Stay tuned, as in Part 3, we’ll launch the K8S Workload Cluster!
The post Kubernetes Bare Metal Deployment Part 2: Setting Up the Kubernetes Management Cluster appeared first on Cloudbase Solutions.

Bare metal Kubernetes on mixed x64 and ARM64

Alessandro Pilotti — Mon, 31 Jul 2023 10:00:00 +0000

This is the first blog post in a series about running mixed x86 and ARM64 Kubernetes clusters, starting with a general architectural overview and then moving to detailed step by step instructions.

Kubernetes doesn’t need any introduction at this point, as it became the de facto standard container orchestration system. If you, dear reader, developed or deployed any workload in the last few years, there’s a very high probability that you had to interact with a Kubernetes cluster.

Most users deploy Kubernetes via services provided by their hyperscaler of choice, typically employing virtual machines as the underlying isolation mechanism. This is a well proven solution, at the expense of an inefficient resource usage, causing many organizations to look for alternatives in order to reduce costs.

One solution consists in deploying Kubernetes on top of an existing on-premise infrastructure running a full scale IaaS solution like OpenStack, or a traditional virtualization solution like VMware, using VMs underneath. This is similar to what happens on public clouds, with the advantage of allowing users to mix legacy virtualized workloads with modern container based applications on top of the same infrastructure. It is a very popular option, as we see a rise in OpenStack deployments for this specific purpose.

But, as more and more companies are interested in dedicated infrastructure for their Kubernetes clusters, especially for Edge use cases, there’s no need for an underlying IaaS or virtualization technology that adds unnecessary complexity and performance limitations.

This is where deploying Kubernetes on bare metal servers really shines, as the clusters can take full advantage of the whole infrastructure, often with significant TCO benefits. Running on bare metal allows us to freely choose between the x64 and ARM64 architectures, or a combination of both, taking advantage of the lower energy footprint provided by ARM servers with the compatibility offered by the more common x64 architecture.

A Kubernetes infrastructure comes with non-trivial complexity, which requires a fully automated solution for deployment, management, upgrades, observability and monitoring. Here’s a brief list the key components in the solution that we are about to present.

Host operating system

When it comes to Linux there’s definitely no lack of options. We needed a Linux distro aimed at lean container infrastructure workloads, with a large deployment base on many different physical servers, avoiding a full fledged traditional Linux server footprint. We decided to use Flatcar, for a series of reasons:

Longstanding (in cloud years) proven success, being the continuation of CoreOS
CNCF incubating project
Active community, expert in container scenarios, including Cloudbase engineers
Support for both ARM64 and x64
Commercial support, provided by Cloudbase, as a result of the partnership with Microsoft / Kinvolk

The host OS options are not limited to Flatcar, we tested successfully many other alternatives, including Mariner and Ubuntu. This is not trivial, as packaging and optimizing images for this sort of infrastructure requires significant domain expertize.

Bare metal host provisioning

This component allows to boot every host via IPMI (or other API provided by the BMC), install the operating system via PXE and in general configure every aspect of the host OS and Kubernetes in an automated way. Over the years we worked with many open source host provisioning solutions (MaaS, Ironic, Crowbar), but we opted for Thinkerbell in this case due to its integration in the Kubernetes ecosystem and support for Cluster API (CAPI).

Distributed Storage

Although the solution presented here can support traditional SAN storage, the storage model in our scenario will be distributed and hyperconverged, with every server providing compute, storage and networking roles. We chose Ceph (deployed via Rook), being the leading open source distributed storage and given our involvement in the community. When properly configured, it can deliver outstanding performance even on small clusters.

Networking and load balancing

While traditionally Kubernetes clusters employ Flannel or Calico for networking, a bare metal scenario can take advantage of a more modern technology like Cilium. Additionally, Cilium can provide load balancing via BGP out of the box, without the need for additional components like MetalLB.

High availability

All components in the deployment are designed with high availability in mind, including storage, networking, compute nodes and API.

Declarative GitOps

Argo CD offers a way to manage in a declarative way the whole CI/CD deployment pipeline. Other open source alternatives like Tekton or FluxCD can also be employed.

Observability and Monitoring

Last but not least, observability is a key area beyond simple logs, metrics and traces, to ensure that the whole infrastructure performs as expected, for which we employ Prometheus and Grafana. For monitoring, and ensuring that prompt actions can be taken in case of issues, we use Sentry.

Coming next

The next blog posts in this series will explain in detail how to deploy the whole architecture presented here, thanks for reading!

The post Bare metal Kubernetes on mixed x64 and ARM64 appeared first on Cloudbase Solutions.

Ampere ALTRA – Industry leading ARM64 Server

Adrian Vladu — Mon, 10 Oct 2022 14:53:38 +0000

AmpereComputing were kind enough to send us at our Cloudbase Solutions office the new version of their top-tier server product, which incorporates their latest Ampere ALTRA ARM64 Processor. The server version has a beefy dual-socket AMPERE ALTRA setup, with 24 NVME slots and up to 8 TB installable RAM, spread over 8 channels.

Let the unboxing begin!

Unboxing Started

First, we can see the beautiful dual-socket setup, each ALTRA CPU being nicely placed between the multiple fans in a row and the back end. The server height format is 2U, which was required for the dual-socket and multi-NVME placement.

The AMPERE ALTRA CPU in the box has a whopping 80 cores which boast 3.0 Ghz clock speed in this variant (Altra AC-108021002P), built on 7nm architecture. This server setup can take advantage of two such CPUs, with 160 cores spread over 2 NUMA nodes. Also, the maximum amount of RAM can go to 8TB (4TB / socket), if using 32 DIMMs of 256GB each. In our setup, we have 2 x 32GB ECC in dual-channel setup for each NUMA node, totaling 128GB.

AMPERE ALTRA in Dual Socket Setup

$> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0..79
node 0 size: 64931 MB
node 1 cpus: 80..159
node 1 size: 64325 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

From the networking side perspective, we have a BMC network port, 2 x 1Gb Intel i350 and an Open Compute Rack port (OCP 3.0) for one PCIe Gen4 network card, in our case a Broadcom BCM957504-N425G, that supports 100Gbps over the 4 ports (25Gbps/each). Furthermore, if you are a Tensorflow or AI workload aficionado, there is plenty of room for 3x Double Width GPU or 8x Single Width GPU (example: 3x Nvidia A100 or 8x Nvidia T4).

Intel 350 Dual Port + IPMI port

OCP 3 Broadcom Network Card with 4x25Gpbs ports

The storage setup comes with 24 NVME PCIe ports, where we have one 1TB SSD, plus another M2 NVME SSD that has been placed inside the server, southwards to the first socket.

24 NVME slots, one from the left being active

After the back cover has been put back, the server placed in the rack and connected to a redundant power source (the two power sources have 2000 Watts each), let’s go and power on the system!

Before the OS installation, let’s make a small detour and check out the IPMI web interface, which is based on the MEGARAC SP-X WEB UI, a very fast UI. The board is an Ast2600, firmware version 0.40.1. The H5Viewer console is snappy, and a complete power cycle takes approximately 2 minutes (from a powered off system state to the Operating System bootloader start).

As operating system of choice, we have chosen the latest developer build of Windows 11 ARM64, available for download as VHDX at https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewARM64 (Microsoft account required).

This version of Windows 11 has been installed as a trial only for demo purposes. At this moment, Windows 11 ARM64 build 25201 was available for download. The installation procedure is not yet so straightforward, as there is no ISO currently provided by Microsoft. The following steps were required in order to install the above Windows version on the Altra Server:

Download the Windows 11 ARM64 Build 225201 VHDX from https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewARM64
Use qemu-img to convert the VHDX to a RAW file and copy the RAW file to USB stick (USB Stick 1)
- qemu-img convert -O raw Windows11_InsiderPreview_Client_ARM64_en-us_25201.VHDX Windows11_InsiderPreview_Client_ARM64_en-us_25201.raw
Download an Ubuntu 22.04 ARM64 Desktop and burn it to an USB stick (USB Stick 2)
Power on the server with the two USB sticks attached
Boot into the Ubuntu 22.04 Live from USB Stick 2
Use “dd” to copy the RAW file directly to NVME device of choice
- dd if=/mnt/Windows11_InsiderPreview_Client_ARM64_en-us_25201.raw of=/dev/nvme0n1
Reboot the server

After the server has been rebooted and completing the Out Of The Box Experience (OOBE) steps, Windows 11 ARM64 is ready to be used.

As you can see, Windows properly recognises the CPU and NUMA architecture, with 2 NUMA nodes with 80 cores each. We tried to see if a simple tool like cpuconsume will properly work while running on a certain NUMA Node, using the command “start /node /affinity” and we observed that the CPU load was also spread on the other node, so we decided to investigate this initially weird behaviour.

In Windows, there are two concepts that apply from the perspective of NUMA topology: Processor Groups and NUMA Support, both being implemented without any caveats for processors with a maximum of 64 cores. Here comes the interesting part, as in our case, one processor has 80 cores, which is by default split in 3 Processor Groups in the default Windows installation. The first group aggregates 64 cores from NUMA Node 0, group 3 aggregates 64 cores from NUMA Node 1, while Group 2 aggregates the remaining 16 + 16 cores from NUMA Node 0 and NUMA Node 1, for a total of 32 cores. If cpuconsume.exe is started with this command: “cmd /c start /node 0 /affinity 0xffffffffff cpuconsume.exe -cpu-time 10000”, there is 50% chance that it is started on Processor Group 0 or Processor Group 1. Also, changing the affinity from Task Manager, while it was working to change the affinity in the same group, it did not work to change it to processors from multiple Processor Groups.

Fortunately, AMPERE ALTRA’s firmware configuration offers a great trade-off solution, which makes running Windows application simpler on NUMA Nodes – it has a configurable topology: Monolithic, Hemisphere and Quadrant.

The Monolithic configuration is the default one, where there is 1:1 NUMA Node to Socket (CPU). In this case, there are 3 Processor Groups on Windows, the ones explained above. Setting the configuration to “Quadrant” did not seem to produce any changes on how the Windows Kernel sees it vs the Monolithic one. The remaining and best option was “Hemisphere”, where the cores are split into 4 NUMA Nodes, and Windows Kernel maps those NUMA Nodes to 4 Processor Groups, with 40 cores each. In this case, Windows applications can have the affinity correctly set using the “start /node /affinity ” command.

CPU Affinity

In the above screenshot, you can see that consume.exe has been instantiated on each NUMA Node and configured to use only 39 cores (mask 0xfffffffffe).

On the Windows Driver support, the Broadcom OCP3 card is not currently supported on this Windows version. There is no support either for the Intel I350 network cards. Fortunately, all the Mellanox Series are supported and we could add one such card to the system from the Ampere EMAG series we had received before (see https://cloudbase.it/cloudbase-init-on-windows-arm64/). For the management network, we used a classic approach – an USB 2.0 to 100Mb Ethernet adapter, that suited perfectly our installation use case.

From the virtualization perspective, we could easily turn on Hyper-V and try out booting a Windows and a Linux Hyper-V Virtual Machines. Windows works perfectly out of the box as a Virtual Machine, while Linux VMs require kernel >=5.15.rc1, which includes this patch that adds Hyper-V ARM64 boot support. For example, Ubuntu 22.04 or higher will boot on Hyper-V. Some issues were noticed on the virtualization part, nested virtualization is currently not supported (failure to start the Virtual Machine if processor virtualization extensions are enabled) and the Linux Kernel needs to have the kernel stall issues fixed, as seen in the screenshot below.

Windows Hyper-V

The performance of the hardware is well managed by the latest Windows version, the experience was flawless during the short period. We will come back with an in-depth review of performance and stability, AmpereComputing did a great job with their latest server products, kudos to them!

The post Ampere ALTRA – Industry leading ARM64 Server appeared first on Cloudbase Solutions.

Manage your own GitHub runners using garm

Gabriel Samfira — Wed, 11 May 2022 15:02:28 +0000

When GitHub Actions was introduced, it gave private repositories the ability to easily create a CI/CD pipeline by simply creating a yaml file within their repository. No special software was needed, no external access had to be granted to third party CI/CD systems. It just worked. One year later, this feature was made available, for free, to public repositories as well. Now, any project hosted on GitHub can enable its own CI/CD pipeline, by creating a workflow. By default, the jobs run on a GitHub hosted runner, which is a virtual machine spun up in Azure using the Standard_DS2_v2 size. Also, you have a choice of images for various operating systems and versions, which are bundled with a lot of the common libraries and SDKs that are used in projects throughout GitHub.

If you want to test your project, you have a fleet of GitHub hosted runners primed and ready to execute a request from your repository to run a workflow. The virtual machines the tests run on are replaced after each run. This means you always get a clean machine whenever you want to run a test. Which is great, and in most cases is more than enough to run a series of unit tests or even integration tests. The workflows have a maximum duration of 6 hours, after which the job is automatically canceled and the runner is cleaned up.

But what happens if you want to run your tests on an operating system that is not in the list of supported images? Or what if you need more disk space? More CPU/Memory? What if you’re testing a huge project like Flatcar which needs to build many packages as part of their pipeline? What if you need access to a GPU, or some other specialized hardware?

Well, in that case GitHub recommends you set up your own self-hosted runners.

But can you do this easily? Does it require a huge learning curve? Complicated setups? I mean, I want my own runners, but not if I have to go through piles of howtos to get them.

The answer is: yes, it can be done easily. We’ll get to that soon. But first, we need to understand the problem that’s being solved.

About self hosted runners

Self hosted runners are compute resources (virtual machines, containers or bare metal servers), on which you install and run the GitHub Actions Runner. This runner will then connect to GitHub, and become available within your repository, organization or enterprise. You can then target that particular runner in your workflow, using labels.

There are two ways a runner can be registered:

Persistent
Ephemeral

Persistent runners are set up and manually maintained by you. You install them, add them to your repository and use them however many times you wish. Persistent runners are capable of running as many jobs as you throw at them. However, it falls onto you, to make sure that the machine is cleaned up and in working order after each job run. Otherwise, new jobs that get scheduled to it will most likely fail, or at the very best they will give you unreliable results.

Ephemeral runners accept only one job, after which they are automatically removed by GitHub from the list of available runners. This ensures that you always get a fresh machine to run your tests, but in this case, you need to implement some sort of auto-scaling that will tear down the runner that completed a job, and replace it with a new one. These runners give you the best experience, as they are fresh and untouched by previous tests.

We’ll be focusing on ephemeral runners in this article, and a way to automatically scale and maintain a pool of those.

The challenges of auto scaling

Auto-scaling of runners is done using GitHub web hooks. Whenever a new workflow job is triggered, GitHub will push an event via web hooks that will let you know that a job has be queued and a new worker is needed. If a worker is already online and idle, that worker is selected, and another web hook is triggered that lets you know a job is now in_progress. Finally, when a job finishes, a final web hook is triggered with a completed message. As part of the queued web hook, we also get a list of labels that the job is targeting.

We can use this information to implement our auto-scaling solution. Now here comes the tricky part. We need some sort of automation that will spin up runners that match the requested label. A label describes the needs of the workflow job. So we need a way to model that request into an operation that will set up exactly the kind of runner that is suited for that job. If your workflow requests a label called gpu, you need to spin up a runner with access to a GPU. If your workflow requests a label called hpc you may need to set up a runner with access to lots of CPU and memory. The idea is to be able to define multiple types of runners and make them available to your workflows. After all, this is the reason you might decide to use self-hosted runners instead of the default ones provided by GitHub.

You may have your own specialized hardware that you want to make available to a workflow, or you may have some spare hardware gathering dust and want to give it new life. Or you may have access to multiple cloud accounts that you could leverage to spin up compute resources of various types.

Introducing: GitHub Actions Runners Manager (garm)

Garm is a self-hosted, automated system that maintains pools of GitHub runners on potentially any IaaS which has an API that will allow us to create compute resources. Garm is a single binary written in Go that you can run on any machine, within your private network. It requires no central management system, it doesn’t need to call home and is fully open source under the Apache 2.0 license.

Garm is meant to be easy to set up, easy to configure and hopefully, something you can forget about once it’s up and running. There are no complicated concepts to understand, no lengthy setup guide, no administrator guide that could rival the New York phone book in thickness. Garm is a simple app that aims to stay out of your way.

The only API endpoint that needs to be public, is the web hook endpoint, which GitHub calls into. It’s how GitHub lets garm know that a new runner is needed and that old runners need to be cleaned up.

Everything else can be hidden away behind a reverse proxy.

Where can garm create compute resources?

Right now garm has native support for LXD and an external provider for OpenStack and Azure. The current external OpenStack and Azure Providers are just a sample at the moment, but it can be used for testing and as an example for creating new external providers that can enable garm to leverage other clouds. External providers are executables that garm calls into, to manage the lifecycle of instances that end up running the GitHub Action Runner. They are similar to what containerd does when it comes co CNIs. As long as those binaries adhere to the required interface, garm can use them.

Sounds like garm spins up virtual machines?

In short: yes, but it doesn’t have to use VMs exclusively. I’ll explain.

We focused on virtual machines for the initial release because of their isolation from the host. Running workflows for public repositories is not without risks, so we need to be mindful of where we run jobs. The isolation offered by a virtual machine is desirable in favor of that of a container. That being said, there is no reason why a provider can’t be written for any system that can spin up compute resources, including containers.

In fact, writing a provider is easy, and you already have two examples of how to do it. With a little over 400 lines of bash, you could write a provider for virtually anything that has an API. And it doesn’t have to be bash. You could use anything you prefer, as long as the API for external providers is implemented by your executable.

In any case, I think it’s time to have a look at what garm can do.

Defining repositories/organizations

This article won’t go into details about how to set up garm. Those details are laid out in the project home page on GitHub. Instead, I’ll show you how to use it to manage your runners.

Garm has three layers:

Repositories or organizations
Pools of runners
The runners

Repositories and organizations can have multiple pools. Each pool can have different settings, can each use a different provider and will spin up multiple runners of the same type. When defining a new repository or organization, we need a Personal Access Token (PAT) to be configured in garm. Repositories use PATs to request registration tokens for runners, list existing runners and potentially forcefully remove them if the compute instance becomes defunct (on the roadmap). You can define multiple PATs and configure each repository or organization to use a different one.

Here is an example of defining a repository:

Creating pools

A pool of runners will create a number of runners of the same type inside a single provider. You can have multiple pools defined for your repository and each pool may have different settings with access to different providers. You can create one pool on LXD, another pool on OpenStack, each maintaining runners for different operating systems and with different sizes.

Let’s define a pool for the previously created repository:

We created a pool on LXD using default as a flavor and ubuntu:20.04 as an image. For LXD, garm maps flavors to profiles. The image names are the same images you would use to spin up virtual machines using the lxc command. So this pool will spin up an Ubuntu 20.04 image from the usual ubuntu: remote and will apply the default profile.

You can create new LXD profiles with whatever resources your runners need. Need multiple disks, more CPU or access to a number of different networks? Add them to the profile. The VMs that will be created with that profile will automatically have the desired specifications.

Let’s enable the pool and have it spin up the runners:

By default, when you create a new pool, the maximum number of runners will be set to 5 and the minimum idle runners will be set to 1 (configurable during create). This means that this particular pool will create a maximum number of 5 runners. The minimum idle runner option, attempts to maintain at least 1 runner in idle state, ready to be used by a GitHub workflow.

If you want more total runners or more idle runners, you can update the pool:

Now let’s add a new pool for the same repository, but this time we’ll add it on the external OpenStack provider:

On OpenStack flavor maps to the OpenStack flavor and image maps to the glance image. The flavor in OpenStack, aside from the basic resources it can configure, has the ability to target specific hosts via host aggregates and grant access to specialized hardware like GPUs, FPGAs, etc. If you need runners with access to special hardware, have a look at host aggregates.

Now that we have our pools and a few runners up and running, let’s try them out in a workflow:

We can see that we have five runners in total. Three on the LXD pool and another 2 on the OpenStack pool. As we trigger workflows in github, garm spins up new runners to replace the ones that are currently being used, maintaining that minimum idle runner count.

That’s all. If you want to try it out, head over to the project home page on GitHub and take it for a spin. Fair warning, this is an initial release, so if you run into any trouble, drop us a line.

The post Manage your own GitHub runners using garm appeared first on Cloudbase Solutions.

OpenStack on Azure

Adrian Vladu — Tue, 27 Jul 2021 18:48:22 +0000

One might argue what is the point in running a cloud infrastructure software like OpenStack on top of another one, namely the Azure public cloud as in this blog post. The main use cases are typically testing and API compatibility, but as Azure nested virtualization and pass-through features came a long way recently in terms of performance, other more advanced use cases are viable, especially in areas where OpenStack has a strong user base (e.g. Telcos).

There are many ways to deploy OpenStack, in this post we will use Kolla Ansible for a containerized OpenStack with Ubuntu 20.04 Server as the host OS.

Preparing the infrastructure

In our scenario, we need at least one beefy virtual machine, that supports nested virtualization and can handle all the CPU/RAM/storage requirements for a full fledged All-In-One OpenStack. For this purpose, we chose a Standard_D8s_v3 size for the OpenStack controller virtual machine (8 vCPU, 32 GB RAM) and 512 GB of storage. For a multinode deployment, subject of a future post, more virtual machines can be added, depending on how many virtual machines are to be supported by the deployment.

To be able to use the Azure CLI from PowerShell, it can be installed following the instructions here https://docs.microsoft.com/en-us/cli/azure/install-azure-cli.

# connect to Azure
az login

# create an ssh key for authentication
ssh-keygen

# create the OpenStack controller VM

az vm create `
     --name openstack-controller `
     --resource-group "openstack-rg" `
     --subscription "openstack-subscription" `
     --image Canonical:0001-com-ubuntu-server-focal:20_04-lts-gen2:latest `
     --location westeurope `
     --admin-username openstackuser `
     --ssh-key-values ~/.ssh/id_rsa.pub `
     --nsg-rule SSH `
     --os-disk-size-gb 512 `
     --size Standard_D8s_v3

# az vm create will output the public IP of the instance
$openStackControllerIP = ""

# create the static private IP used by Kolla as VIP
az network nic ip-config create --name MyIpConfig `
    --nic-name openstack-controllerVMNic `
    --private-ip-address 10.0.0.10 `
    --resource-group "openstack-rg" `
    --subscription "openstack-subscription"

# connect via SSH to the VM
ssh openstackuser@$openStackControllerIP

# fix the fqdn
# Kolla/Ansible does not work with *.cloudapp FQDNs, so we need to fix it
sudo tee /etc/hosts << EOT
$(hostname -i) $(hostname)
EOT

# create a dummy interface that will be used by OpenVswitch as the external bridge port
# Azure Public Cloud does not allow spoofed traffic, so we need to rely on NAT for VMs to
# have internal connectivity.
sudo ip tuntap add mode tap br_ex_port
sudo ip link set dev br_ex_port up

OpenStack deployment

For the deployment, we will use the Kolla Ansible containerized approach.

Firstly, installation of the base packages for Ansible/Kolla/Cinder is required.

# from the Azure OpenStack Controller VM

# install ansible/kolla requirements
sudo apt install -y python3-dev libffi-dev gcc libssl-dev python3-venv net-tools

# install Cinder NFS backend requirements
sudo apt install -y nfs-kernel-server

# Cinder NFS setup
CINDER_NFS_HOST=$openStackControllerIP
# Replace with your local network CIDR if you plan to add more nodes
CINDER_NFS_ACCESS=$CINDER_NFS_HOST
sudo mkdir /kolla_nfs
echo "/kolla_nfs $CINDER_NFS_ACCESS(rw,sync,no_root_squash)" | sudo tee -a /etc/exports
echo "$CINDER_NFS_HOST:/kolla_nfs" | sudo tee -a /etc/kolla/config/nfs_shares
sudo systemctl restart nfs-kernel-server

Afterwards, let’s install Ansible/Kolla in a Python virtualenv.

mkdir kolla
cd kolla
 
python3 -m venv venv
source venv/bin/activate
 
pip install -U pip
pip install wheel
pip install 'ansible<2.10'
pip install 'kolla-ansible>=11,<12'

Then, prepare Kolla configuration files and passwords.

sudo mkdir -p /etc/kolla/config
sudo cp -r venv/share/kolla-ansible/etc_examples/kolla/* /etc/kolla
sudo chown -R $USER:$USER /etc/kolla
cp venv/share/kolla-ansible/ansible/inventory/* .
kolla-genpwd

Now, let’s check Ansible works.

ansible -i all-in-one all -m ping

As a next step, we need to configure the OpenStack settings

# This is the static IP we created initially
VIP_ADDR=10.0.0.10
# Azure VM interface is eth0
MGMT_IFACE=eth0
# This is the dummy interface used for OpenVswitch
EXT_IFACE=br_ex_port
# OpenStack Train version
OPENSTACK_TAG=11.0.0

# now use the information above to write it to Kolla configuration file
sudo tee -a /etc/kolla/globals.yml << EOT
kolla_base_distro: "ubuntu"
openstack_tag: "$OPENSTACK_TAG"
kolla_internal_vip_address: "$VIP_ADDR"
network_interface: "$MGMT_IFACE"
neutron_external_interface: "$EXT_IFACE"
enable_cinder: "yes"
enable_cinder_backend_nfs: "yes"
enable_neutron_provider_networks: "yes"
EOT

Now it is time to deploy OpenStack.

kolla-ansible -i ./all-in-one prechecks
kolla-ansible -i ./all-in-one bootstrap-servers
kolla-ansible -i ./all-in-one deploy

After the deployment, we need to create the admin environment variable script.

pip3 install python-openstackclient python-barbicanclient python-heatclient python-octaviaclient
kolla-ansible post-deploy
# Load the vars to access the OpenStack environment
. /etc/kolla/admin-openrc.sh

Let’s make the finishing touches and create an OpenStack instance.

# Set you external network CIDR, range and gateway, matching your environment, e.g.:
export EXT_NET_CIDR='10.0.2.0/24'
export EXT_NET_RANGE='start=10.0.2.150,end=10.0.2.199'
export EXT_NET_GATEWAY='10.0.2.1'
./venv/share/kolla-ansible/init-runonce

# Enable NAT so that VMs can have Internet access and be able to
# reach their floating IP from the controller node.
sudo ifconfig br-ex $EXT_NET_GATEWAY netmask 255.255.255.0 up
sudo iptables -t nat -A POSTROUTING -s $EXT_NET_CIDR -o eth0 -j MASQUERADE

# Create a demo VM
openstack server create --image cirros --flavor m1.tiny --key-name mykey --network demo-net demo1

Conclusions

Deploying OpenStack on Azure is fairly straightforward, with the caveat that the OpenStack instances cannot be accessed from the Internet without further changes (this affects only inbound traffic, the OpenStack instances can access the Internet). Here are the main changes that we introduced to be able to perform the deployment in this scenario:

Add a static IP on the first interface that will be used as the OpenStack API IP
Set the OpenStack Controller FQDN to be the same as the hostname
Create a dummy interface which will be used as the br-ex external port (there is no need for a secondary NIC, as Azure drops any spoofed packets)
Add iptables NAT rules to allow OpenStack VM outbound (Internet) connectivity

The post OpenStack on Azure appeared first on Cloudbase Solutions.

Ceph on Windows – Performance

Lucian Petruț — Wed, 19 May 2021 13:28:41 +0000

Make sure to check out the previous blog post introducing Ceph on Windows, in case you’ve missed it. We are now going to look at how the performance looks like.

Before this Ceph Windows porting, the only way to access Ceph storage from Windows was by using the Ceph iSCSI gateway, which can easily become a performance bottleneck. Our goal was to outperform the iSCSI gateway and to get as close to the native Linux RBD throughput as possible.

Spoiler alert: we managed to surpass both!

Test environment

Before showing some actual results, let’s talk a bit about the test environment. We used 4 identical baremetal servers, having the following specs:

CPU
- Intel(R) Xeon(R) E5-2650 @ 2.00GHz
- 2 sockets
- 8 cores per socket
- 32 vcpus
memory
- 128GB
- 1333Mhz
network adapters
- Chelsio T420-C
- 2 x 10Gb/s
- LACP bond
- 9000 MTU

You may have noticed that we are not mentioning storage disks, that’s because the Ceph OSDs were configured to use memory backed pools. Keep in mind that we’re putting the emphasis on Ceph client performance, not the storage iops on the Linux OSD side.

We used an all-in-one Ceph 16 cluster running on top of Ubuntu 20.04. On the client side, we covered Windows Server 2016, Windows Server 2019 as well as Ubuntu 20.04.

The benchmarks have been performed using the fio tool. It’s highly configurable, commonly used for Ceph benchmarks and most importantly, it’s cross platform.

Below is a sample FIO command line invocation for anyone interested in repeating the tests. We are using direct IO, 2MB blocks, 64 concurrent asynchronous IO operations, testing various IO operations over 8GB chunks. When testing block devices, we are using the raw disk device as opposed to having an NTFS or ReFS partition. Note that this may require fio>=3.20 on Windows.

fio --randrepeat=1 --direct=1 --gtod_reduce=1 --name=test --bs=2M --iodepth=64 --size=8G --readwrite=randwrite --numjobs=1 --filename=\\.\PhysicalDrive2

Windows tuning

The following settings can improve IO throughput:

Windows power plan – few people expect this to be a concern for severs, but by default the “high performance” power plan is not enabled by default, which can lead to CPU throttling
Adding the Ceph and FIO binaries to the Windows Defender whitelist
Using Jumbo frames – we’ve noticed a 15% performance improvement
Enabling the CUBIC TCP congestion algorithm on Windows Server 2016

Test results

Baremetal RBD and CephFS IO

Let’s get straight to the RBD and CephFS benchmark results. Note that we’re using MB/s for measuring speed (higher is better).

+-----------+------------+--------+-------+--------+-------+
|    OS     |    tool    | rand_r | seq_r | rand_w | seq_w |
+-----------+------------+--------+-------+--------+-------+
| WS2016    | rbd-wnbd   |    854 |   925 |    807 |   802 |
| WS2019    | rbd-wnbd   |   1317 |  1320 |   1512 |  1549 |
| WS2019    | iscsi-gw   |    324 |   319 |    624 |   635 |
| Ubuntu 20 | krbd       |    696 |   669 |    659 |   668 |
| Ubuntu 20 | rbd-nbd    |    559 |   567 |    372 |   407 |
|           |            |        |       |        |       |
| WS2016    | ceph-dokan |    642 |   605 |    690 |   676 |
| WS2019    | ceph-dokan |    988 |   948 |    938 |   935 |
| Ubuntu 20 | ceph-fuse  |    162 |   181 |    121 |   138 |
| Ubuntu 20 | kern ceph  |    687 |   663 |    677 |   674 |
+-----------+------------+--------+-------+--------+-------+

It is worth mentioning that RBD caching has been disabled. As we can see, Windows Server 2019 manages to deliver impressive IO throughput. Windows Server 2016 isn’t as fast but it still manages to outperform the Linux RBD clients, including krbd. We are seeing the same pattern on the CephFS side.

We’ve tested the iSCSI gateway with an all-in-one Ceph cluster. The performance bottleneck is likely to become more severe with larger Ceph clusters, considering that the iSCSI gateway doesn’t scale very well.

Virtual Machines

Providing virtual machine block storage is also one of the main Ceph use cases. Here are the test results for Hyper-V Server 2019 and KVM on Ubuntu 20.04, both running Ubuntu 20.04 and Windows Server 2019 VMs booted from RBD images.

+-----------------------+--------------+--------------+
| Hypervisor \ Guest OS |   WS 2019    | Ubuntu 20.04 |
+-----------------------+------+-------+------+-------+
|                       | read | write | read | write |
+-----------------------+------+-------+------+-------+
| Hyper-V               | 1242 | 1697  | 382  | 291   |
| KVM                   | 947  | 558   | 539  | 321   |
+-----------------------+------+-------+------+-------+

The WS 2019 Hyper-V VM managed to get almost native IO speed. What’s interesting is that it managed to fare better than the Ubuntu guest even on KVM, which is probably worth investigating.

WNBD

As stated in the previous post, our initial approach was to attach RBD images using the NBD protocol. That didn’t deliver the performance that we were hoping for, mostly due to the Winsock Kernel (WSK) framework, which is why we implemented from scratch a more efficient IO channel. For convenience, you can still use WNBD as a standalone NBD client for other purposes, in which case you may be interested to knowing how well it performs. It manages to deliver 933MB/s on WS 2019 and 280MB/s on WS 2016 in this test environment.

At the moment, rbd-wnbd uses DeviceIoControl to retrieve IO requests and send IO replies back to the WNBD driver, which is also known as inverted calls. Unlike the RBD NBD server, libwnbd allows adjusting the number of IO dispatch workers. The following table shows how the number of workers impacts performance. Keep in mind that in this specific case, we are benchmarking the driver connection, so no data gets transmitted from / to the Ceph cluster. This gives us a glimpse of the maximum theoretical bandwidth that WNBD could provide, fully saturating the available CPUs:

+---------+------------------+
| Workers | Bandwidth (MB/s) |
+---------+------------------+
|       1 |             1518 |
|       2 |             2917 |
|       3 |             4240 |
|       4 |             5496 |
|       8 |            11059 |
|      16 |            13209 |
|      32 |            12390 |
+---------+------------------+

RBD commands

Apart from IO performance, we were also interested in making sure that a large amount of disks can be managed at the same time. For this reason, we wrote a simple Python script that creates a temporary image, attaches it to the host, performs various IO operations and then cleans it up. Here are the test results for 1000 iterations, 50 at a time. This test was essential in improving RBD performance and stability.

+---------------------------------------------------------------------------------+
|                                   Duration (s)                                  |
+--------------------------+----------+----------+-----------+----------+---------+
|         function         |   min    |   max    |   total   |   mean   | std_dev |
+--------------------------+----------+----------+-----------+----------+---------+
|      TestRunner.run      | 283.5339 | 283.5339 |  283.5339 | 283.5339 |  0.0000 |
|    RbdTest.initialize    |  0.3906  | 10.4063  | 3483.5180 |  3.4835  |  1.5909 |
|     RbdImage.create      |  0.0938  |  5.5157  |  662.8653 |  0.6629  |  0.6021 |
|       RbdImage.map       |  0.2656  |  9.3126  | 2820.6527 |  2.8207  |  1.5056 |
| RbdImage.get_disk_number |  0.0625  |  8.3751  | 1888.0343 |  1.8880  |  1.5171 |
| RbdImage._wait_for_disk  |  0.0000  |  2.0156  |  39.1411  |  0.0391  |  0.1209 |
|      RbdFioTest.run      |  2.3125  | 13.4532  | 8450.7165 |  8.4507  |  2.2068 |
|      RbdImage.unmap      |  0.0781  |  6.6719  | 1031.3077 |  1.0313  |  0.7988 |
|     RbdImage.remove      |  0.1406  |  8.6563  |  977.1185 |  0.9771  |  0.8341 |
+--------------------------+----------+----------+-----------+----------+---------+

I hope you enjoyed this post. Don’t take those results for granted, feel free to run your own tests and let us know what you think!

Coming next

Stay tuned for the next part of this series, if you want to learn more about how Ceph integrates with OpenStack and Hyper-V.

The post Ceph on Windows – Performance appeared first on Cloudbase Solutions.

Natively using Ceph on Windows

Lucian Petruț — Mon, 22 Mar 2021 13:00:00 +0000

The wait is over, Ceph 16 (Pacific) provides Windows native support as a result of the Cloudbase Solutions and Suse partnership.

Ceph is probably the most commonly used software defined storage solution out there. For example, according to surveys, more than 70% of the OpenStack deployments are powered by Ceph. That’s no wonder, considering that it can run on commodity hardware, yet manage to scale up to hundreds of storage nodes and deliver impressive performance.

Using Ceph on Windows had been a pain point, requiring proxies such as iSCSI gateways or re-exporting CephFS using Samba. Those approaches deliver suboptimal performance and overcomplicate the deployment architecture. All that hassle is now gone as RBD and CephFS can be used on Windows natively.

For best performance and features, Windows Server 2019 is recommended. Windows Server 2016 is supported as well but having a few known limitations. Older Windows Server versions as well as client versions such as Windows 10 might work as well but are not currently supported.

Installing

This MSI installer is the recommended way of installing Ceph on Windows. Along with the Ceph binaries, it also bundles the WNBD driver, which is used for mapping RBD images.

Please refer to those guides if you prefer manually building and installing Ceph and WNBD.

Configuring

Minimal configuration is needed in order to use Ceph on Windows. The default config file location is C:\ProgramData\ceph\ceph.conf.

Here’s a config sample. Don’t forget to fill in the right Ceph Monitor addresses and to provide a Ceph keyring file at the specified location. For the time being, slashes “/” must be used as path separators instead of backslashes “\”.

[global]
    log to stderr = true
    ; Uncomment the following to use Windows Event Log
    ; log to syslog = true

    run dir = C:/ProgramData/ceph/out
    crash dir = C:/ProgramData/ceph/out

    ; Use the following to change the cephfs client log level
    ; debug client = 2
[client]
    keyring = C:/ProgramData/ceph/keyring
    ; log file = C:/ProgramData/ceph/out/$name.$pid.log
    admin socket = C:/ProgramData/ceph/out/$name.$pid.asok

    ; client_permissions = true
    ; client_mount_uid = 1000
    ; client_mount_gid = 1000
[global]
    mon host =

RBD

Rados Block Device (RBD) has been the primary focus of this effort. The same CLI that you’re probably already familiar with can be used to create RBD images and attach them to the host as well as Hyper-V virtual machines.

The following PowerShell sample creates an RBD image, attaches it to the host and adds an NTFS partition on top.

rbd create blank_image --size=1G
rbd device map blank_image

$mappingJson = rbd-wnbd show blank_image --format=json
$mappingJson = $mappingJson | ConvertFrom-Json
$diskNumber = $mappingJson.disk_number

# The disk must be online before creating or accessing partitions.
Set-Disk -Number $diskNumber -IsOffline $false

# Initialize the disk, partition it and create a fileystem.
Get-Disk -Number $diskNumber | `
    Initialize-Disk -PassThru | `
    New-Partition -AssignDriveLetter -UseMaximumSize | `
    Format-Volume -Force -Confirm:$false

By default, all RBD mappings are persistent. The “ceph-rbd” service, which can be deployed using the above mentioned MSI installer, takes care of reattaching the RBD images after host reboots. This also allows adjusting the Windows service start order so that RBD images can be mapped before starting services that may depend on it.

The following screenshot shows an RBD image attached to a Hyper-V VM along with benchmark results. We will do a deep dive on the benchmarks in a future post (disclaimer: they look great!).

WNBD

RBD images are exposed as SCSI disks by using the WNBD Storport Miniport driver, written as part of this porting effort.

One interesting feature provided by the WNBD driver is the NBD client capability, allowing it to be used for attaching arbitrary NBD exports. This has been the default way of attaching RBD images before a more efficient mechanism was implemented. Due to its usefulness, this feature has been left in place, although it’s likely to be moved outside the driver at some point, leveraging the same API as rbd-wnbd does.

To mount a standalone NBD image, use the following:

wnbd-client map export_name $nbdAddress --port $nbdPort

CephFS

CephFS support on Windows was our second main goal and is currently experimental. We’re using Dokany, which resembles FUSE, and a revamped version of the seemingly abandoned ceph-dokan project.

The following simple command mounts CephFS using the “X:” drive letter:

ceph-dokan.exe -l x

Current limitations

While the current state of the porting covers already the majority of the yes cases, there are a few limitations that you should be aware of. Those are missing features that are likely to be implemented in a subsequent version.

RBD images cannot be used yet for backing Cluster Shared Volumes (CSV) in Windows Server Failover Clustering (WSFC), which requires SCSI Persistent Reservations support
WNBD disks cannot be live resized
The Python bindings are unavailable
The ceph CLI tool cannot be used natively yet. However, it may be used through Windows Subsystem for Linux to contact running services.

Coming next

The next blog post will be focusing on RBD performance. Stick around for some more benchmark results and a performance tuning guide.

The post Natively using Ceph on Windows appeared first on Cloudbase Solutions.