amedeos home

Inspecting OpenShift container images for cgroups v2 compatibility

2026-03-09T07:00:00+00:00

In my daily work I regularly help customers plan their OpenShift upgrades, and one of the most common concerns when moving to cgroups v2 is understanding which workloads are affected. I needed a quick and reliable way to scan an entire cluster and identify incompatible Java, Node.js, and .NET runtimes, so I built image-cgroupsv2-inspector and released it as open source, so that anyone can use it, modify it, and contribute to it.

In this article, I’ll walk you through the tool: what it does, how it works, and how you can use it on your own clusters. If you’re planning to upgrade your OpenShift cluster to a version that uses cgroups v2 by default (OpenShift 4.14+), or if you’re migrating from cgroups v1, this tool helps you identify which workloads might break after the transition. Starting from OpenShift 4.19+, the migration to cgroups v2 will be mandatory if not already performed, so it is important to assess your workloads sooner rather than later.

For a deep dive into how cgroups v2 impacts these runtimes, I recommend reading the Red Hat Developers article: How does cgroups v2 impact Java, .NET, and Node.js on OpenShift 4?.

Why cgroups v2 compatibility matters

Linux cgroups (control groups) are used by container runtimes to enforce resource limits (CPU, memory, etc.). Cgroups v2 is the successor to cgroups v1 and brings a unified hierarchy, improved resource management, and better support for rootless containers.

However, older versions of popular runtimes (Java, Node.js, .NET) read cgroup information from the filesystem to determine available resources. When a cluster switches from cgroups v1 to v2, the filesystem layout changes, and older runtimes may fail to detect resource limits correctly, leading to:

Java: The JVM may see the host’s total memory instead of the container’s memory limit, potentially causing OOM kills
Node.js: Incorrect memory and CPU detection, leading to performance issues
.NET: Similar resource detection failures

The minimum runtime versions that properly support cgroups v2 are:

Runtime	Minimum cgroups v2 compatible version
OpenJDK / HotSpot	8u372, 11.0.16, 15+
IBM Semeru	8u345-b01, 11.0.16.0, 17.0.4.0, 18.0.2.0+
IBM Java	8.0.7.15+
Node.js	20.3.0+
.NET	5.0+

For more details on the compatibility matrix, refer to the Red Hat Developers article.

What does image-cgroupsv2-inspector do?

The tool connects to your OpenShift cluster, collects all container images from running workloads, and then optionally analyzes each image by:

Collecting images: Scans Deployments, DeploymentConfigs, StatefulSets, DaemonSets, CronJobs, ReplicaSets, Jobs, and standalone Pods to find all container images in use
Smart deduplication: Only reports top-level controllers (e.g., a Deployment but not its child ReplicaSets or Pods), avoiding duplicate entries
Image analysis: For each unique image, pulls it locally, extracts the filesystem, and searches for Java, Node.js, and .NET binaries
Version detection: Executes each binary to determine its exact version
Compatibility check: Compares detected versions against the known cgroups v2 minimum compatibility versions
CSV report: Generates a detailed CSV report with all findings

Requirements

Before starting, you’ll need:

Access to an OpenShift 4.x cluster with a valid token
podman installed on the machine where you run the tool
Python 3.12+
At least 20GB of free disk space for image extraction
The acl package installed (for extended ACL support on the rootfs directory)
Network access to all container registries used by your cluster

Installation

Clone the repository and set up a Python virtual environment:

$ git clone https://github.com/amedeos/image-cgroupsv2-inspector.git
$ cd image-cgroupsv2-inspector
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Usage

Connecting to the cluster

The first step is to connect to your OpenShift cluster. You can provide the API URL and token directly:

$ ./image-cgroupsv2-inspector --api-url https://api.mycluster.example.com:6443 --token sha256~xxxxx

After the first successful connection, credentials are saved to an .env file so you don’t need to provide them again.

Collecting images (without analysis)

To simply list all container images in your cluster without analyzing them:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images

This produces output like:

╔══════════════════════════════════════════════════════════════╗
║           image-cgroupsv2-inspector v1.0.0                   ║
║     OpenShift Container Image Inspector for cgroups v2       ║
╚══════════════════════════════════════════════════════════════╝

🔍 Running system checks...
✓ podman is installed: podman version 5.7.1
✓ podman is functional (OS: linux)

🔧 Setting up rootfs directory at: /tmp/images
✓ Write permission verified on /tmp/images
✓ Sufficient disk space: 32.0GB free (required: 20GB, total: 32.0GB)
✓ Filesystem supports extended ACLs
✓ Created directory: /tmp/images/rootfs
...

🔌 Connecting to OpenShift cluster...
✓ Connected to OpenShift cluster
  Kubernetes version: v1.30.14
  Cluster name: mycluster.example.com
✓ Credentials saved to .env
✓ Pull secret already exists at .pull-secret, skipping download

📋 Namespace exclusion patterns: openshift-*, kube-*

📦 Collecting container images from cluster...
  (Only top-level controllers are reported, child objects are skipped)
  (Excluding namespaces matching: openshift-*, kube-*)
  Collecting images from Deployments...
    Found 10 containers in Deployments
  Collecting images from DeploymentConfigs...
    Found 2 containers in DeploymentConfigs
  Collecting images from StatefulSets...
    Found 0 containers in StatefulSets
  Collecting images from DaemonSets...
    Found 0 containers in DaemonSets
  Collecting images from CronJobs...
    Found 0 containers in CronJobs
  Collecting images from standalone ReplicaSets...
    Found 0 containers in standalone ReplicaSets (skipped 10 managed/empty)
  Collecting images from standalone Jobs...
    Found 0 containers in standalone Jobs (skipped 0 CronJob-managed)
  Collecting images from standalone Pods...
    Found 0 containers in standalone Pods (skipped 307 managed/static pods)

✓ Total containers found: 12
  (Excluded 51 namespaces: openshift-apiserver, openshift-apiserver-operator, ...)

📊 Summary:
   Total containers: 12
   Unique images: 10
   Namespaces: 6
   Object types: {'Deployment': 10, 'DeploymentConfig': 2}

   Output file: output/mycluster.example.com-20260309-085433.csv

📋 Top 10 most used images:
      2 × registry.access.redhat.com/ubi8/openjdk-17:latest
      2 × registry.access.redhat.com/ubi8/openjdk-8:1.14
      1 × registry.redhat.io/ubi8/dotnet-30:latest
      1 × registry.redhat.io/dotnet/sdk:8.0.122
      1 × image-registry.openshift-image-registry.svc:5000/test-java-internalreg/openjdk-17:latest
      1 × image-registry.openshift-image-registry.svc:5000/test-java-internalreg/openjdk-8:1.14
      1 × docker.io/library/eclipse-temurin:8u302-b08-jdk-centos7
      1 × docker.io/library/eclipse-temurin:17
      1 × registry.access.redhat.com/ubi8/nodejs-20:latest
      1 × registry.access.redhat.com/ubi8/nodejs-18:latest
✓ Disconnected from OpenShift cluster

✓ Done!

By default, the tool excludes OpenShift internal namespaces (openshift-*, kube-*) since those are managed by the platform itself.

Full analysis with cgroups v2 compatibility check

To analyze all images and determine cgroups v2 compatibility, add the --analyze flag:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze

The tool will pull each unique image, extract its filesystem, search for Java/Node.js/.NET binaries, run version checks, and report compatibility:

🔬 Analyzing images for Java, NodeJS, and .NET binaries...
  (Each image will be pulled, analyzed, and cleaned up)
  (CSV will be saved after each image for resumability)
  Found 10 unique images to analyze

  [1/10] Analyzing: image-registry.openshift-image-registry.svc:5000/test-java-internalreg...
    Pulling image: image-registry.openshift-image-registry.svc:5000/test-java-internalreg/openjdk-17:latest...
    Exporting container filesystem...
    Extracting filesystem...
    Searching for Java binaries...
    Searching for Node.js binaries...
    Searching for .NET binaries...
      ✓ Java (OpenJDK): 17.0.18 at /usr/lib/jvm/jre-17-openjdk-17.0.18.0.8-1.el8.x86_64/bin/java
    💾 Progress saved: 1 rows (1/10 images)
...
  [5/10] Analyzing: registry.redhat.io/ubi8/dotnet-30:latest...
    Pulling image: registry.redhat.io/ubi8/dotnet-30:latest...
    Exporting container filesystem...
    Extracting filesystem...
    Searching for Java binaries...
    Searching for Node.js binaries...
    Searching for .NET binaries...
      ✗ Node.js: 10.19.0 at /usr/bin/node
      ✗ .NET: 3.0.3 at /usr/lib64/dotnet/dotnet
    💾 Progress saved: 6 rows (5/10 images)
...

Notice the symbols: checkmark means the runtime is compatible with cgroups v2, while cross means it is not compatible and needs to be upgraded.

At the end of the analysis, a summary is printed:

📊 Summary:
   Total containers: 12
   Unique images: 10
   Namespaces: 6
   Object types: {'Deployment': 10, 'DeploymentConfig': 2}

   🔬 Analysis Results:
      Java found in: 8 containers
        ✓ cgroup v2 compatible: 4
        ✗ cgroup v2 incompatible: 4
      Node.js found in: 3 containers
        ✓ cgroup v2 compatible: 1
        ✗ cgroup v2 incompatible: 2
      .NET found in: 2 containers
        ✓ cgroup v2 compatible: 1
        ✗ cgroup v2 incompatible: 1

   Output file: output/mycluster.example.com-20260309-085455.csv

Analyzing a single namespace

If you want to focus on a specific namespace, use the -n flag:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze -n test-java

📋 Inspecting single namespace: test-java

📦 Collecting container images from namespace: test-java
  (Only top-level controllers are reported, child objects are skipped)
  Collecting images from Deployments...
    Found 2 containers in Deployments
...

🔬 Analyzing images for Java, NodeJS, and .NET binaries...
  Found 2 unique images to analyze

  [1/2] Analyzing: registry.access.redhat.com/ubi8/openjdk-8:1.14...
    Pulling image: registry.access.redhat.com/ubi8/openjdk-8:1.14...
    Exporting container filesystem...
    Extracting filesystem...
    Searching for Java binaries...
    Searching for Node.js binaries...
    Searching for .NET binaries...
      ✗ Java (OpenJDK): 1.8.0_362 at /usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.362.b09-2.el8_7.x86_64/bin/java
      ✗ Java (OpenJDK): 1.8.0_362 at /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.362.b09-2.el8_7.x86_64/bin/java
    💾 Progress saved: 1 rows (1/2 images)

  [2/2] Analyzing: registry.access.redhat.com/ubi8/openjdk-17:latest...
    Pulling image: registry.access.redhat.com/ubi8/openjdk-17:latest...
    Exporting container filesystem...
    Extracting filesystem...
    Searching for Java binaries...
    Searching for Node.js binaries...
    Searching for .NET binaries...
      ✓ Java (OpenJDK): 17.0.18 at /usr/lib/jvm/jre-17-openjdk-17.0.18.0.8-1.el8.x86_64/bin/java
    💾 Progress saved: 2 rows (2/2 images)

✓ Analyzed 2 unique images

📊 Summary:
   Total containers: 2
   Unique images: 2
   Namespaces: 1
   Object types: {'Deployment': 2}

   🔬 Analysis Results:
      Java found in: 2 containers
        ✓ cgroup v2 compatible: 1
        ✗ cgroup v2 incompatible: 1
      Node.js found in: 0 containers
      .NET found in: 0 containers

In this example, one container uses OpenJDK 17.0.18 (compatible) while the other uses OpenJDK 1.8.0_362 (incompatible, since the minimum required for Java 8 is 8u372).

Understanding the CSV output

The tool generates a CSV file in the output/ directory, named after the cluster and timestamp. Here’s what the columns mean:

Column	Description
`container_name`	Name of the container
`namespace`	Kubernetes namespace
`object_type`	Resource type (Deployment, StatefulSet, etc.)
`object_name`	Name of the parent resource
`image_name`	Full image name with tag
`java_binary`	Path to Java binary found (“None” if not found)
`java_version`	Detected Java version
`java_cgroup_v2_compatible`	“Yes”, “No”, or “N/A”
`node_binary`	Path to Node.js binary found
`node_version`	Detected Node.js version
`node_cgroup_v2_compatible`	“Yes”, “No”, or “N/A”
`dotnet_binary`	Path to .NET binary found
`dotnet_version`	Detected .NET version
`dotnet_cgroup_v2_compatible`	“Yes”, “No”, or “N/A”
`analysis_error`	Error message if analysis failed

The CSV can be easily imported into a spreadsheet or processed with standard tools to generate reports for your team.

How the analysis works under the hood

For each unique container image, the tool performs the following steps:

Pull the image using podman pull with the cluster’s pull-secret for authentication
Create a temporary container and export its filesystem as a tar archive
Extract the tar to a local rootfs directory
Search for binaries by walking the extracted filesystem looking for java, node, and dotnet executables (skipping symlinks that point to already-found binaries)
Execute version checks by running podman run with the original image, overriding the entrypoint to call the binary with its version flag (java -version, node --version, dotnet --list-runtimes)
Compare versions against the known minimum compatible versions
Clean up by removing the extracted files, tar archive, and the pulled image

The tool handles several edge cases:

OpenShift internal registry images: Automatically detects the internal registry route and rewrites URLs for external access
Short-name image resolution: Resolves short image names (e.g., eclipse-temurin:17) to their fully qualified domain name by querying pod status
Resumability: The CSV is saved after each image is analyzed, so if the process is interrupted, already-analyzed images are preserved

Additional options

Custom namespace exclusions

By default, openshift-* and kube-* namespaces are excluded. You can customize this:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze \
    --exclude-namespaces "openshift-*,kube-*,staging-*"

Verbose mode

For detailed debugging output, add the -v flag:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze -v

Log to file

To save all output to a log file for later review:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze \
    --log-to-file --log-file my-analysis.log

Skip disk check

If you know your disk has enough space and want to skip the 20GB check:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze --skip-disk-check

Custom internal registry route

If your OpenShift cluster exposes the internal registry with a custom route:

$ ./image-cgroupsv2-inspector --rootfs-path /tmp/images --analyze \
    --internal-registry-route my-registry.apps.mycluster.example.com

Conclusion

Before upgrading your OpenShift cluster to a version that defaults to cgroups v2, it is critical to assess your workloads for compatibility. The image-cgroupsv2-inspector tool automates this process by scanning all your container images, detecting Java, Node.js, and .NET runtimes, and checking whether they meet the minimum version requirements for cgroups v2 support.

The source code is available on GitHub: image-cgroupsv2-inspector.

For a comprehensive explanation of how cgroups v2 impacts these runtimes, check out the Red Hat Developers article: How does cgroups v2 impact Java, .NET, and Node.js on OpenShift 4?.

Change OpenShift Data Foundation OSDs disk flavor (dimension)

2023-02-26T09:00:00+00:00

In this article, I’ll show you how to migrate your OpenShift Data Foundation OSDs (disks) from one flavor to another; in my case, I’ll migrate OSDs and data from 0.5TiB disks to 2TiB disks; this will be a “rolling” migration with no service or data disruption.

Warning: If you are a Red Hat customer, open a support case before going forward, otherwise do the following steps at your own risk!

Requirements

Before starting, you’ll need:

installed and working OpenShift Data Foundation;
this article is based on ODF configured with the replica parameter set to [0], which is usually the default on hyperscalers; otherwise, you’ll need to adapt if you want to do this migration, for example, on bare metal (perhaps you’re using the LocalStorage operator on bare metal).
in this guide, I’ll move data and disks from three OSD disks to three other OSD disks; if you have more than three OSD, you must redo this procedure from beginning to end or check if the destination three disks can store data from more than three source disks.

[0] In this guide, I’ll assume your OpenShift Data Foundation is installed on a hyperscale cloud provider, such as Azure or AWS, with three availability zones, and that you have replica set to three:

$ oc get storagecluster -n openshift-storage ocs-storagecluster -ojson | jq .spec.storageDeviceSets
[
  {
    "count": 1,
...
    "replica": 3,
...
  }
]

Run must-gather

Before applying any change, run an OpenShift must-gather:

$ oc adm must-gather

then create a specific ODF must-gather; in this example, I use ODF in version 4.10:

$ mkdir ~/odf-must-gather
$ oc adm must-gather --image=registry.redhat.io/odf4/ocs-must-gather-rhel8:v4.10 --dest-dir=~/odf-must-gather

Check cluster health

Check if your cluster is healthy:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring | grep HEALTH
    health: HEALTH_OK

WARNING: if your cluster is not in HEALTH_OK, stop any activities and check the ODF state!

Add Capacity

Add new capacity to your cluster using the new OSD flavor; in my case, the original storageDeviceSets is using 0.5TiB disks:

$ oc get storagecluster ocs-storagecluster -n openshift-storage -oyaml
...
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: gp3-csi
        volumeMode: Block
      status: {}
    name: ocs-deviceset
    placement: {}
    preparePlacement: {}
    replica: 3
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi

switch to openshift-storage project and backup storagecluster:

$ oc project openshift-storage

$ oc get storagecluster ocs-storagecluster -oyaml | tee backup-storagecluster-ocs-storagecluster.yaml

add new OSDs, with desired flavor, in my case I’m adding a new storageDeviceSets with 2TiB disks:

$ oc edit storagecluster ocs-storagecluster
...
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: gp3-csi
        volumeMode: Block
      status: {}
    name: ocs-deviceset
    placement: {}
    preparePlacement: {}
    replica: 3
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2000Gi
        storageClassName: gp3-csi
        volumeMode: Block
      status: {}
    name: ocs-deviceset-2t
    placement: {}
    preparePlacement: {}
    replica: 3
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
  version: 4.10.0

wait until ODF will rebalance all data, which means cluster will be in HEALTH_OK status and all placement groups (pgs) must be in active+clean state. To monitor rebalance, you can use a while true infinite loop:

$ while true; do NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring | egrep 'HEALTH_OK|HEALTH_WARN|[0-9]+\s+remapped|[0-9]+\/[0-9]+[ a-z]+misplaced[ ().%a-z0-9]+|' ; sleep 10 ; done
  cluster:                  
    id:     .....
    health: HEALTH_OK                                              
                                                  
  services:                                         
    mon: 3 daemons, quorum a,b,c (age 2h)     
    mgr: a(active, since 2w) 
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 34s), 6 in (since 51s); 54 remapped pgs
                                                    
  data:                                                        
    volumes: 1/1 healthy            
    pools:   4 pools, 97 pgs                
    objects: 17.77k objects, 54 GiB
    usage:   157 GiB used, 7.2 TiB / 7.3 TiB avail
    pgs:     45509/53298 objects misplaced (85.386%)        
             54 active+remapped+backfill_wait
             37 active+clean
             5  active+remapped
             1  active+remapped+backfilling

  io:
    client:   1023 B/s rd, 217 KiB/s wr, 1 op/s rd, 6 op/s wr
    recovery: 86 MiB/s, 1 keys/s, 30 objects/s

in the above example, you can see that Ceph is rebalancing and remapping PGs. Wait until all PGs are in active+clean state:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
  cluster:
    id:     .....
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2w)
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 17m), 6 in (since 18m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 17.02k objects, 51 GiB
    usage:   146 GiB used, 7.2 TiB / 7.3 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   853 B/s rd, 76 KiB/s wr, 1 op/s rd, 7 op/s wr

WARNING: wait until your cluster returns all PGs in active+clean state!

check also CephFS status:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph fs status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
ocs-storagecluster-cephfilesystem - 12 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      ocs-storagecluster-cephfilesystem-b  Reqs:   37 /s  34.8k  27.2k  8369   27.1k  
0-s   standby-replay  ocs-storagecluster-cephfilesystem-a  Evts:   47 /s  82.3k  26.8k  8298      0   

WARNING: one of the two MDS must be in active state!

Identify old OSDs / disks to remove

Take a note of your 3 OSD id to remove, they are based on your old flavor (weight 0.48830), to see ODF OSD topology run:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph osd tree --cluster=${NA
MESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
ID   CLASS  WEIGHT   TYPE NAME                                       STATUS  REWEIGHT  PRI-AFF
 -1         7.32417  root default
 -5         7.32417      region eu-central-1
-14         2.44139          zone eu-central-1a
-13         2.44139              host ip-XX-XX-XX-4-rete
  2    ssd  0.48830                  osd.2                               up   1.00000  1.00000
  5    ssd  1.95309                  osd.5                               up   1.00000  1.00000
-10         2.44139          zone eu-central-1b
 -9         2.44139              host ip-XX-XX-XX-46-rete
  1    ssd  0.48830                  osd.1                               up   1.00000  1.00000
  4    ssd  1.95309                  osd.4                               up   1.00000  1.00000
 -4         2.44139          zone eu-central-1c
 -3         2.44139              host ip-XX-XX-XX-80-rete
  0    ssd  0.48830                  osd.0                               up   1.00000  1.00000
  3    ssd  1.95309                  osd.3                               up   1.00000  1.00000

In my case, the old OSD are osd.0, osd.1 and osd.2. Those OSDs need to be removed / deleted one by one, waiting for HEALTH_OK after every removal / deletion.

Remove the OSD from the old Storage flavor

Switch to the openshift-storage project

First switch to the openshift-storage project:

$ oc project openshift-storage

Copy the Ceph config and keyring files

Copy your Ceph config file and keyring file from rook container pod to your Linux box, and then those files will be transferred to one mon container in order to run Ceph commands after scaling down rook operator.

Copy files from rook container to your Linux box:

$ ROOK=$(oc get pod | grep rook-ceph-operator | awk '{print $1}')
$ echo ${ROOK}
rook-ceph-operator-5767bbc7b9-w8swd

$ oc rsync ${ROOK}:/var/lib/rook/openshift-storage/openshift-storage.config .
WARNING: cannot use rsync: rsync not available in container
openshift-storage.config
$ oc rsync ${ROOK}:/var/lib/rook/openshift-storage/client.admin.keyring .
WARNING: cannot use rsync: rsync not available in container
client.admin.keyring

Copy the openshift-storage.config and openshift-storage.config files from your Linux box to one mon container:

$ MONA=$(oc get pod | grep rook-ceph-mon | egrep '2\/2\s+Running' | head -n1 | awk '{print $1}')
$ echo ${MONA}
rook-ceph-mon-a-769fc864f-btmmr

$ oc cp openshift-storage.config ${MONA}:/tmp/openshift-storage.config
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)
$ oc cp client.admin.keyring ${MONA}:/tmp/client.admin.keyring
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)

NOTE: MONA, in one of Italian regional language means stupid people :smile:

Check the Ceph command on MONA container:

$ oc rsh ${MONA}
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)
sh-4.4# ceph health --cluster=openshift-storage --conf=/tmp/openshift-storage.config --keyring=/tmp/client.admin.keyring
2023-XX -1 auth: unable to find a keyring on /var/lib/rook/openshift-storage/client.admin.keyring: (2) No such file or directory
2023-XX -1 AuthRegistry(0x7fbbb805bb68) no keyring found at /var/lib/rook/openshift-storage/client.admin.keyring, disabling cephx
HEALTH_OK
sh-4.4# exit

Scale down OpenShift Data Foundation operators

We can now scale to zero rook and ocs operators:

$ oc scale deploy ocs-operator --replicas=0
deployment.apps/ocs-operator scaled
$ oc scale deploy rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled

Remove one OSD

Now you can remove one OSD; in my case, I’ll remove osd.0 (zero), but in your case, it could be a different ID.

$ failed_osd_id=0
$ export PS1="[\u@\h \W]\ OSD=$failed_osd_id $ "

$ oc scale deploy rook-ceph-osd-${failed_osd_id} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${failed_osd_id} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

$ JOBREMOVAL=$(oc get pod | grep ocs-osd-removal-job- | awk '{print $1}')

$ oc logs ${JOBREMOVAL} | egrep "cephosd: completed removal of OSD ${failed_osd_id}"
2023-XX I | cephosd: completed removal of OSD 0

NOTE: on the last command you must see cephosd: completed removal of OSD X, where X is your osd id (in my case zero).

Check the Ceph health status, where you can see a degraded state due to one osd removal:

$ oc rsh ${MONA}
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)
sh-4.4# 
sh-4.4# ceph status --cluster=openshift-storage --conf=/tmp/openshift-storage.config --keyring=/tmp/client.admin.keyring
2023-XX -1 auth: unable to find a keyring on /var/lib/rook/openshift-storage/client.admin.keyring: (2) No such file or directory
2023-XX -1 AuthRegistry(0x7f207005bb68) no keyring found at /var/lib/rook/openshift-storage/client.admin.keyring, disabling cephx
  cluster:
    id:     .....
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2w)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 5 up (since 19m), 5 in (since 9m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 17.10k objects, 52 GiB
    usage:   146 GiB used, 6.7 TiB / 6.8 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   1.2 KiB/s rd, 460 KiB/s wr, 2 op/s rd, 7 op/s wr
 
sh-4.4#

wait until Ceph returns HEALTH_OK and all PGs are in active+clean state:

sh-4.4# while true; do ceph status --cluster=openshift-storage --conf=/tmp/openshift-storage.config --keyring=/tmp/client.admin.keyring | egrep --color=always '[0-9]+\/[0-9]+.*(degraded|misplaced)|' ; sleep 10 ; done

WARNING: before proceeding, you must wait for Ceph HEALTH_OK and all PGs in active+clean state!

Delete removal job:

$ oc delete job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

Repeat these steps for each OSD you need to remove (in my case for osd.1 and osd.2)

Remove your old storageDeviceSets pointing to old OSD disks flavor

After removing all OSD from your old storageDeviceSets (in my case, with disk flavor set to 0.5TiB), you can remove it from your storagecluster object:

Make a backup before editing your storagecluster:

$ oc get storagecluster ocs-storagecluster -oyaml | tee storagecluster-ocs-storagecluster-before-remove-500g.yaml

change / edit your storagecluster storageDeviceSets so that only newly created storageDeviceSets remain; in my case, with 2TiB disks flavor:

$ oc edit storagecluster ocs-storagecluster
...
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2000Gi
        storageClassName: gp3-csi
        volumeMode: Block
      status: {}
    name: ocs-deviceset-2t
    placement: {}
    preparePlacement: {}
    replica: 3
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
  version: 4.10.0

Scale up OpenShift Data Foundation operators

At this point, you can scale up ocs-operator:

$ oc scale deploy ocs-operator --replicas=1
deployment.apps/ocs-operator scaled

and then re-check Ceph health status:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring | egrep -i 'remapped|misplaced|active\+clean|HEALTH_OK|'
  cluster:
    id:     .....
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: a(active, since 2w)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 7m), 3 in (since 6m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 17.15k objects, 52 GiB
    usage:   145 GiB used, 5.7 TiB / 5.9 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   853 B/s rd, 246 KiB/s wr, 1 op/s rd, 4 op/s wr

Using eBPF on OpenShift nodes (the quick and dirty way)

2023-01-22T15:00:00+00:00

In this OpenShift article, I’ll show you how to run bcc tools, bpftrace, and the kernel tool bpftool.

Warning: If you are a Red Hat customer and you are in trouble, open a support case before going forward; otherwise, do the following steps at your own risk!

Requirements

Before starting, you’ll need:

Working OpenShift 4.10+ cluster, I’ve only tested this procedure on 4.10, 4.11, and 4.12 OpenShift clusters, but it may work on all supported 4.x versions (please let me know);
cluster-admin grant on the OpenShift cluster;
one subscribed RHEL host with podman and buildah installed or installable; You could use a non-subscribed host, but to use baseos and appstream eus repositories, you’d need to temporarily subscribe to the ubi8 container.
SSH enabled on your OpenShift node(s).

Run must-gather

Before applying any change run an OpenShift must-gather:

$ oc adm must-gather

A Special mention to OpenShift 4.12+

Starting with OpenShift 4.12, the bpftool is included package in a toolbox and can be used without following this procedure:

$ ssh core@worker
[core@worker-0 ~]$ sudo -i

[root@worker-0 ~]# toolbox 
Trying to pull registry.redhat.io/rhel8/support-tools:latest...
...
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@worker-0 /]#

[root@worker-0 /]# bpftool -h
Usage: bpftool [OPTIONS] OBJECT { COMMAND | help }
       bpftool batch file FILE
       bpftool version

       OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter }
       OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} |
                    {-V|--version} }
[root@worker-0 /]#

but, if you want to run the bcc and bpftrace tools, you can continue to follow this guide.

Build a bpf-ocp image for your OpenShift cluster

Before running eBPF on your OpenShift nodes, you need to build a tailored image for your cluster.

This image will include:

kernel-core and kernel-headers for your OpenShift kernel version nodes;
bpftrace, bcc, and bpftool packages;
some other performance troubleshooting packages.

Install tools on RHEL

You need one host, usually a RHEL host, where you can build your new image with eBPF tools installed, in this example, I’ll show you how to install those tools on RHEL host.

Install buildah and podman:

$ sudo dnf install buildah podman -y

Retrieve OpenShift node information / version

Remember to log in to your cluster and then choose a name for one of your ready nodes:

$ OCPNODE=$(oc get node | egrep '\s+Ready\s+' | head -n1 | awk '{print $1}')
$ echo ${OCPNODE}
master-0

obtain the node kernel version:

$ KERNELVERSION=$(oc debug node/${OCPNODE} -- chroot /host uname -r 2> /dev/null )
$ echo ${KERNELVERSION} 
4.18.0-372.26.1.el8_6.x86_64

obtain the node RHEL minor version:

$ RHEL8MINOR=$(oc debug node/${OCPNODE} -- chroot /host sh -c "uname -r | sed -E 's/.+\.el8_([0-9])\..*/\1/g'" 2> /dev/null )
$ echo ${RHEL8MINOR} 
6

Create Dockerfile

Create a Dockerfile for your image:

$ mkdir buildbpf
$ cd buildbpf

$ tee "Dockerfile" > /dev/null <<'EOF'
# Start from ubi8 with minor version RHEL8MINOR
FROM registry.access.redhat.com/ubi8:8.RHEL8MINOR
# Install some useful tools
RUN dnf install --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.RHEL8MINOR \
        redhat-lsb-core curl wget tcpdump vim iproute \
        bind-utils sysstat procps-ng -y
# Install OCP node version of kernel-core and kernel-headers      
RUN dnf install --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.RHEL8MINOR \
        kernel-core-KERNELVERSION kernel-headers-KERNELVERSION -y
# Install bpftrace and bpftool, and their dependencies (bcc, python-bcc)
RUN dnf install --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.RHEL8MINOR \
        bpftrace bpftool -y
# Clean dnf cache
RUN dnf clean all
EOF

replace the RHEL 8 minor version and kernel version with your cluster versions:

$ sed -i "s/RHEL8MINOR/${RHEL8MINOR}/g" Dockerfile
$ sed -i "s/KERNELVERSION/${KERNELVERSION}/g" Dockerfile

review the content:

$ cat Dockerfile 
# Start from ubi8 with minor version 6
FROM registry.access.redhat.com/ubi8:8.6
# Install some useful tools
RUN dnf install --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.6 \
        redhat-lsb-core curl wget tcpdump vim iproute \
        bind-utils sysstat procps-ng -y
# Install OCP node version of kernel-core and kernel-headers      
RUN dnf install --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.6 \
        kernel-core-4.18.0-372.26.1.el8_6.x86_64 kernel-headers-4.18.0-372.26.1.el8_6.x86_64 -y
# Install bpftrace and bpftool, and their dependencies (bcc, python-bcc)
RUN dnf install --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.6 \
        bpftrace bpftool -y
# Clean dnf cache
RUN dnf clean all

WARNING: the above content is an example! Your Dockerfile could have different versions!!!

Build the bpf-ocp image

Run buildah:

$ export BUILDAH_FORMAT=docker

$ buildah bud -t bpf-ocp:8.${RHEL8MINOR}-${KERNELVERSION}
...
COMMIT bpf-ocp:8.6-4.18.0-372.26.1.el8_6.x86_64
Getting image source signatures
Copying blob b4e347eee7c8 skipped: already exists  
Copying blob 724516754461 done  
Copying config 594a7e339c done  
Writing manifest to image destination
Storing signatures
--> 594a7e339cb
Successfully tagged localhost/bpf-ocp:8.6-4.18.0-372.26.1.el8_6.x86_64
594a7e339cb7ee321ad126c776012b29f4c5da2c8d302331906260d67e394ea2
$

Special case for NOT RHEL subscribed buildah host

WARNING: Run these commands only if buildah bud has failed!!!

if your host is not a subscribed RHEL, you can run the following commands:

$ rm Dockerfile
$ buildah from registry.access.redhat.com/ubi8:8.${RHEL8MINOR}
ubi8-working-container

subscribe your container:

$ buildah run ubi8-working-container  subscription-manager register
Registering to: subscription.rhsm.redhat.com:443/subscription
Username: 
Password: 
The system has been registered with ID: 86326611-9b37-4888-b6bb-850007165594
The registered system name is: 84ff9ac8faa2

navigate to the Red Hat Customer Portal, click on Systems, then click on your container hostname (in my case, 84ff9ac8faa2), select Subscriptions, click on the Attach Subscriptions button, Select the subscription you want in the left check box, then click Attach Subscriptions.

go back to the terminal and run:

$ buildah run ubi8-working-container  subscription-manager repos --list | tee -a /tmp/repos.txt

$ buildah run ubi8-working-container dnf install \
        --disablerepo='*' --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms \
        --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.${RHEL8MINOR} \
        redhat-lsb-core curl wget tcpdump vim iproute \
        bind-utils sysstat procps-ng -y

$ buildah run ubi8-working-container dnf install --disablerepo='*' \
        --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms \
        --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.${RHEL8MINOR} \
        kernel-core-${KERNELVERSION} kernel-headers-${KERNELVERSION} -y

$ buildah run ubi8-working-container dnf install --disablerepo='*' \
        --enablerepo=rhel-8-for-x86_64-baseos-eus-rpms \
        --enablerepo=rhel-8-for-x86_64-appstream-eus-rpms --releasever=8.${RHEL8MINOR} \
        bpftrace bpftool -y

$ buildah run ubi8-working-container dnf clean all

$ buildah run ubi8-working-container subscription-manager unregister

$ buildah run ubi8-working-container subscription-manager clean

execute buildah commit:

$ buildah commit ubi8-working-container bpf-ocp:8.${RHEL8MINOR}-${KERNELVERSION}

remove the buildah image:

$ buildah rm ubi8-working-container

check image:

$ podman image ls | egrep '^REPOSI|bpf-ocp'
REPOSITORY                                                               TAG                               IMAGE ID      CREATED             SIZE
localhost/bpf-ocp                                                        8.6-4.18.0-372.26.1.el8_6.x86_64  30712fe599dd  About a minute ago  1.54 GB

Save the image as a tar file

Save the just created bpf-ocp image as a tar file:

$ podman save --quiet --format docker-archive \
    -o bpf-ocp-8.${RHEL8MINOR}-${KERNELVERSION}.tar \
    localhost/bpf-ocp:8.${RHEL8MINOR}-${KERNELVERSION}

Transfer bpf-ocp image to desired OpenShift node

In this example, I want to run bpf tools on the worker-1 node, but change this to your real OpenShift node.

First, obtain the IP address of the node:

$ IPNODE=$(oc get node -owide | grep worker-1 | awk '{print $6}')
$ echo ${IPNODE} 
192.168.203.57

transfer image using scp:

$ scp bpf-ocp-8.${RHEL8MINOR}-${KERNELVERSION}.tar \
      core@${IPNODE}:/tmp/bpf-ocp-8.${RHEL8MINOR}-${KERNELVERSION}.tar
bpf-ocp-8.6-4.18.0-372.26.1.el8_6.x86_64.tar 100% 1464MB 109.9MB/s   00:13

Load image from tar file

If you’re running OpenShift 4.11+, you can simply run:

$ ssh core@${IPNODE}
[core@worker-1 ~]$ sudo -i
[root@worker-1 ~]# 

[root@worker-1 ~]# podman load --input /tmp/bpf-ocp-8.6-4.18.0-372.26.1.el8_6.x86_64.tar
Getting image source signatures
Copying blob 724516754461 done
Copying blob b4e347eee7c8 done
Copying config 594a7e339c done
Writing manifest to image destination
Storing signatures
Loaded image(s): localhost/bpf-ocp:8.6-4.18.0-372.26.1.el8_6.x86_64
[root@worker-1 ~]#

Instead, if you’re trying to load image on OpenShift 4.10 with podman 3.x / CoreOS 8.4, you can get this error (loglevel debug):

[root@worker-1 ~]# podman load --log-level=debug -i /tmp/bpf-ocp-8.4-4.18.0-305.65.1.el8_4.x86_64.tar
...
DEBU[0001] Error loading /tmp/bpf-ocp-8.4-4.18.0-305.65.1.el8_4.x86_64.tar: Source image rejected: Running image docker-archive:/tmp/bpf-ocp-8.4-4.18.0-305.65.1.el8_4.x86_64.tar:localhost/bpf-ocp:8.4-4.18.0-305.65.1.el8_4.x86_64 is rejected by policy.

in this case, create a permissive policy file:

[root@worker-1 ~]# echo '{ "default": [{"type": "insecureAcceptAnything"}]}' > /tmp/policy-permissive.json

and use this permissive signature file in order to load the image:

[root@worker-1 ~]# podman load --signature-policy /tmp/policy-permissive.json -i /tmp/bpf-ocp-8.4-4.18.0-305.65.1.el8_4.x86_64.tar 
Getting image source signatures
Copying blob d46291327397 done  
Copying blob 5bc03dec6239 done  
Copying blob 525ed45dbdb1 done  
Copying config 17ae9469bd done  
Writing manifest to image destination
Storing signatures
Loaded image(s): localhost/bpf-ocp:8.4-4.18.0-305.65.1.el8_4.x86_64

Run the bpf-ocp container

Finally, you can spin up a new bpf-ocp container:

[root@worker-1 ~]# podman run --privileged --name bpf-ocp \
    --mount type=bind,source=/sys/kernel/debug,target=/sys/kernel/debug \
    -it localhost/bpf-ocp:8.6-4.18.0-372.26.1.el8_6.x86_64 
[root@d18d4f16d28a /]#

and test a bcc tool biolatency in order to see if it is working properly (press Ctrl-C to end tracing):

[root@d18d4f16d28a /]# /usr/share/bcc/tools/biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 7        |                                        |
        32 -> 63         : 386      |**************************              |
        64 -> 127        : 452      |******************************          |
       128 -> 255        : 585      |****************************************|
       256 -> 511        : 501      |**********************************      |
       512 -> 1023       : 161      |***********                             |
      1024 -> 2047       : 18       |*                                       |
      2048 -> 4095       : 58       |***                                     |
      4096 -> 8191       : 56       |***                                     |
      8192 -> 16383      : 55       |***                                     |
     16384 -> 32767      : 14       |                                        |
     32768 -> 65535      : 3        |                                        |
     65536 -> 131071     : 5        |                                        |
[root@d18d4f16d28a /]#

Conclusion

This is a quick and dirty way to run eBPF on your OpenShift cluster, but you can build your bpf-ocp image for your Cluster(s), publish it / them to your registry, and for example, deploy it as DaemonSet on all your cluster nodes.

Migrate Elasticsearch shards to a new StorageClass

2023-01-08T09:00:00+00:00

In this OpenShift / Kubernetes article, I’ll show you how to migrate your Elasticsearch data (shards), from one cloud StorageClass, such as Azure managed-premium, to another cloud StorageClass, such as Azure managed-csi, this will be a “rolling” migration, with no service and data disruption.

Warning: If you are a Red Hat customer, open a support case before going forward; otherwise do the following steps at your own risk!

Requirements

Before starting you’ll need:

Elasticsearch is installed and operational[0];
configured new / destination Storage Class => in this article I’ll use managed-csi;
Elasticsearch running on three different worker nodes (typically from three different availability zones);
Elasticsearch is currently working on three different CDMs, each with its own PVC on an old StorageClass (in my case, Azure managed-premium) [1];
Elasticsearch is configured with minimum_master_nodes set to 2;

[0] In this guide, I’ll assume you’ve installed Elasticsearch on a hyperscale cloud provider, such as Azure or AWS, with three availability zones.

[1] Three elasticsearch CDMs running on OpenShift as examples:

$ oc get pods -l component=elasticsearch -o wide -n openshift-logging
NAME                                            READY   STATUS    RESTARTS   AGE   IP            NODE                                          NOMINATED NODE   READINESS GATES
elasticsearch-cdm-19ibb0br-1-f58b8f764-6dnvg    2/2     Running   0          42d   100.65.8.8    nodename-xxxx-elastic-northeurope3-vk7cf              
elasticsearch-cdm-19ibb0br-2-787fd9c4c5-r88lc   2/2     Running   0          41d   100.65.6.8    nodename-xxxx-elastic-northeurope1-rd545              
elasticsearch-cdm-19ibb0br-3-6bc8c8f98-w6jh9    2/2     Running   0          42d   100.65.10.8   nodename-xxxx-elastic-northeurope2-mbtc9              

Run must-gather

Before applying any change run an OpenShift must-gather:

$ oc adm must-gather

then, create a specific cluster-logging must-gather:

$ oc adm must-gather --image=$(oc -n openshift-logging get deployment.apps/cluster-logging-operator -o jsonpath='{.spec.template.spec.containers[?(@.name == "cluster-logging-operator")].image}')

Check cluster health / green state

Check if your cluster is in green state:

$ oc project openshift-logging
$ export ELKCDM=$(oc get pods -l component=elasticsearch -o wide | egrep '2\/2\s+Running' | head -n1 | awk '{print $1}')

$ oc exec ${ELKCDM} -c elasticsearch -- health | egrep 'green|'
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1661176875 14:01:15  elasticsearch green           3         3    428 214    0    0        0             0                  -                100.0%

WARNING: if your cluster is not in green, and active_shards_percent is not equal to 100%, stop any activities and check the elasticsearch state first!

Check cluster routing allocation parameter

The cluster.routing.allocation.enable parameter must be “all”, for example, if you have “primaries”, you need to change it to “all” and wait for shards creation / relocation.

This is the correct value:

$ oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : { }
}

instead, if you have primaries:

$ oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "primaries"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : { }
}

in this case, you have to overwrite to all:

$ oc exec -c elasticsearch ${ELKCDM} -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -H "Content-Type: application/json" -XPUT "https://localhost:9200/_cluster/settings" -d '{ "persistent":{ "cluster.routing.allocation.enable" : "all" }}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"all"}}}},"transient":{}}

$ oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : { }
}

then wait until the relocation column (relo column) reaches zero (0)

$ while true ; do oc exec ${ELKCDM} -c elasticsearch -- health | egrep 'green\s+3\s+3|' ; sleep 10 ; done
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1661259701 13:01:41  elasticsearch green           3         3    428 214    2    0        0             0                  -                100.0%
...........

Check for shards not started

Check that all your shards are in STARTED state:

$  oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cat/shards?v | grep -v STARTED
index                          shard prirep state      docs   store ip          node

Warning: the above command must return only one line, which is the column header; if you have some shards that are NOT in STARTED state, stop any activities and check first the elasticsearch state!

Replace the StorageClass in the clusterlogging instance

Edit your clusterlogging instance with your new StorageClass name (in my case, managed-csi).

Backup it before editing:

$ oc get clusterlogging instance -oyaml | tee -a clusterlogging-instance-before-storageclass-change.yaml

then edit it changing only the storageClassName parameter:

$ oc edit clusterlogging instance
clusterlogging.logging.openshift.io/instance edited

check for proper storageClassName value:

$ oc get clusterlogging instance -ojson | jq -r '.spec.logStore.elasticsearch.storage.storageClassName'
managed-csi

Remove shards from one elastic CDM pod

Now, you can identify one elastic CDM and its overlay IP in order to relocate all shards from it:

$  oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cat/nodes?v
ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
100.65.8.8            48          99  15    2.36    2.21     2.40 mdi       -      elasticsearch-cdm-19ibb0br-1
100.65.6.8            26          99  10    0.90    1.32     1.72 mdi       *      elasticsearch-cdm-19ibb0br-2
100.65.10.8           24          99  18    2.21    2.63     3.30 mdi       -      elasticsearch-cdm-19ibb0br-3

In this example, I’ll move shards from CDM-1 elasticsearch-cdm-19ibb0br-1, which has the IP address 100.65.8.8

Exclude CDM IP

To move shards from an identified elasticsearch CDM, you must exclude its IP, in my case 100.65.8.8, but you must change the IP of your CDM:

$  oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cluster/settings?pretty -X PUT -d '{"transient":{"cluster.routing.allocation.exclude._ip": "100.65.8.8"}}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "100.65.8.8"
          }
        }
      }
    }
  }
}

wait until all shards are relocated to the other two elasticsearch nodes:

$ oc rsh ${ELKCDM} 
Defaulted container "elasticsearch" out of: elasticsearch, proxy
sh-4.4$ while true ; do es_util --query=_cat/shards?v | grep 100.65.8.8 | wc -l ; sleep 10 ; done
142
140
... (cut)
0
(Ctrl-c)
sh-4.4$ exit

Scale elasticsearch CDM deploy to zero

Scale to zero (0) elasticsearch CDM deploy, in my case elasticsearch-cdm-19ibb0br-1

$ oc get deploy
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator       1/1     1            1           137d
elasticsearch-cdm-19ibb0br-1   1/1     1            1           137d
elasticsearch-cdm-19ibb0br-2   1/1     1            1           137d
elasticsearch-cdm-19ibb0br-3   1/1     1            1           137d
kibana                         1/1     1            1           89d

$ oc scale deploy elasticsearch-cdm-19ibb0br-1 --replicas=0
deployment.apps/elasticsearch-cdm-19ibb0br-1 scaled

Delete elasticsearch CDM PVC

Delete the PVC corresponding to your elasticsearch CDM, in my case elasticsearch-elasticsearch-cdm-19ibb0br-1 but you need to change with your pvc:

$ oc get pvc
NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
elasticsearch-elasticsearch-cdm-19ibb0br-1   Bound    pvc-3d378adc-901a-4198-9ff0-f720d32eaa4d   500Gi      RWO            managed-premium    137d
elasticsearch-elasticsearch-cdm-19ibb0br-2   Bound    pvc-318f2d40-2580-486d-a6ac-cb1822427fd3   500Gi      RWO            managed-premium    137d
elasticsearch-elasticsearch-cdm-19ibb0br-3   Bound    pvc-7052bed6-09aa-4789-b86e-9d68616b6401   500Gi      RWO            managed-premium    137d

$ oc delete pvc elasticsearch-elasticsearch-cdm-19ibb0br-1
persistentvolumeclaim "elasticsearch-elasticsearch-cdm-19ibb0br-1" deleted

$ oc get pvc
NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
elasticsearch-elasticsearch-cdm-19ibb0br-2   Bound    pvc-318f2d40-2580-486d-a6ac-cb1822427fd3   500Gi      RWO            managed-premium    137d
elasticsearch-elasticsearch-cdm-19ibb0br-3   Bound    pvc-7052bed6-09aa-4789-b86e-9d68616b6401   500Gi      RWO            managed-premium    137d

Check cluster health / green state

Check if your cluster is still in green state:

$ export ELKCDM=$(oc get pods -l component=elasticsearch -o wide | egrep '2\/2\s+Running' | head -n1 | awk '{print $1}')
$ oc exec ${ELKCDM} -c elasticsearch -- health | egrep 'green|'
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1661258957 12:49:17  elasticsearch green           2         2    416 208    0    0        0             0                  -                100.0%

WARNING: if your cluster is not in green, and active_shards_percent is not equal to 100%, stop any activities and check the elasticsearch state first!

Remove exclude IP

Remove your previously exclude ip parameter:

$ oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cluster/settings?pretty -X PUT -d '{"transient":{"cluster.routing.allocation.exclude._ip" : null}}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : { }
}

Scale elasticsearch CDM deploy to one

Scale back to one (1) elasticsearch CDM deploy, in my case, elasticsearch-cdm-19ibb0br-1

$ oc get deploy
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator       1/1     1            1           137d
elasticsearch-cdm-19ibb0br-1   0/0     0            0           137d
elasticsearch-cdm-19ibb0br-2   1/1     1            1           137d
elasticsearch-cdm-19ibb0br-3   1/1     1            1           137d
kibana                         1/1     1            1           89d

$ oc scale deploy elasticsearch-cdm-19ibb0br-1 --replicas=1
deployment.apps/elasticsearch-cdm-19ibb0br-1 scaled

check if there are 3 nodes:

$ oc exec ${ELKCDM} -c elasticsearch -- health | egrep 'green\s+3\s+3|'
Tue Aug 23 12:58:52 UTC 2022
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1661259532 12:58:52  elasticsearch green           3         3    416 208    2    0        0             0                  -                100.0%

Re-set cluster routing to all

Re-set cluster.routing.allocation.enable parameter to all:

$ oc exec -c elasticsearch ${ELKCDM} -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -H "Content-Type: application/json" -XPUT "https://localhost:9200/_cluster/settings" -d '{ "persistent":{ "cluster.routing.allocation.enable" : "all" }}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"all"}}}},"transient":{}}

$ oc exec ${ELKCDM} -c elasticsearch -- es_util --query=_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : { }
}

and then wait for relocation:

$ while true ; do oc exec ${ELKCDM} -c elasticsearch -- health | egrep 'green\s+3\s+3|' ; sleep 10 ; done
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1661259701 13:01:41  elasticsearch green           3         3    428 214    2    0        0             0                  -                100.0%

Repeat the previous steps for the remaining elasticsearch CDM

Repeat all the steps from “Remove shards from one elastic CDM pod” to “Re-set cluster routing to all” for all the remaining elasticsearch CDM.

Migrate OSDs from one storage class to another

2022-11-28T20:00:00+00:00

In this article, I’ll explain how to migrate your OpenShift Data Foundation OSDs (disks), residing on one cloud storage class, for example Azure managed-premium, to another storage class, for example Azure managed-csi, this will be a “rolling” migration, with no service and data interruption.

Warning: If you are a Red Hat customer, open a support case before going forward, otherwise doing the following steps at your own risk!

Requirements

Before starting you’ll need:

installed and working OpenShift Data Foundation;
configured new / destination Storage Class => in this article I’ll use managed-csi;
ODF configured with replica parameter set to 3[0];
in this guide I’ll move data / disks, only from 3 OSD disks to other 3 OSD disks, in different storage classes, if you have more than 3 OSD you have to redo this procedure from start to finish.

[0] In this guide I’ll assume your OpenShift Data Foundation is installed on the hyperscaler cloud provider, for example Azure or AWS, with 3 availability zones, and with this configuration you should have replica set to 3:

$ oc get storagecluster -n openshift-storage ocs-storagecluster -ojson | jq .spec.storageDeviceSets
[
  {
    "count": 1,
...
    "replica": 3,
...
  }
]

Run must-gather

Before applying any change run an OpenShift must-gather:

$ oc adm must-gather

then, create a specific ODF must-gather, in this example I use ODF in version 4.10:

$ mkdir ~/odf-must-gather
$ oc adm must-gather --image=registry.redhat.io/odf4/ocs-must-gather-rhel8:v4.10 --dest-dir=~/odf-must-gather

Check cluster health

Check if your cluster is healthy:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring | grep HEALTH
    health: HEALTH_OK

WARNING: if your cluster is not in HEALTH_OK, stop any activities and check first ODF state!

Add Capacity

Add new capacity to your cluster using new StorageClass, in my case managed-csi navigating to:

select Left menu Operators => Installed Operators => OpenShift Data Foundation (selecting project openshift-storage on left corner);
select “Storage System” tab;
click on the three dot on the right and then select Add Capacity

select your desired storage class and then click Add

wait until ODF will rebalance all data, which means cluster will be in HEALTH_OK status and all placement groups (pgs) must be in active+clean state, to monitor rebalance you can use a while true infinite loop:

$ while true; do NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring | egrep 'HEALTH_OK|HEALTH_WARN|[0-9]+\s+remapped|[0-9]+\/[0-9]+[ a-z]+misplaced[ ().%a-z0-9]+|' ; sleep 10 ; done
  cluster:
    id:  ....
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 4d)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 6m), 6 in (since 6m); 135 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 13.34k objects, 49 GiB
    usage:   147 GiB used, 2.8 TiB / 2.9 TiB avail
    pgs:     17928/40008 objects misplaced (44.811%)
             134 active+remapped+backfill_wait
             58  active+clean
             1   active+remapped+backfilling
 
  io:
    client:   4.8 KiB/s rd, 328 KiB/s wr, 2 op/s rd, 5 op/s wr
    recovery: 13 MiB/s, 3 objects/s
 
  progress:
    Global Recovery Event (6m)
      [========....................] (remaining: 14m)

in the above example, you can see that ceph is rebalancing / remapping PGs, wait until all PGs are in active+clean state:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
  cluster:
    id:     ....
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 2d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 105m), 6 in (since 106m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 45.33k objects, 73 GiB
    usage:   223 GiB used, 2.7 TiB / 2.9 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   82 KiB/s rd, 12 MiB/s wr, 3 op/s rd, 113 op/s wr

WARNING: wait until your cluster returns all PGs in active+clean state!

check also CephFS status:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph fs status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
ocs-storagecluster-cephfilesystem - 12 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      ocs-storagecluster-cephfilesystem-b  Reqs:   37 /s  34.8k  27.2k  8369   27.1k  
0-s   standby-replay  ocs-storagecluster-cephfilesystem-a  Evts:   47 /s  82.3k  26.8k  8298      0   
                   POOL                       TYPE     USED  AVAIL  
ocs-storagecluster-cephfilesystem-metadata  metadata   758M   697G  
 ocs-storagecluster-cephfilesystem-data0      data     212G   697G  
MDS version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

WARNING: one of the two MDS must be in active state!

Identify old OSDs / disks to remove

Take a note of your 3 OSD id to remove, they are based on your old StorageClass, to see ODF OSD topology run:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph osd tree --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
ID   CLASS  WEIGHT   TYPE NAME                                                      STATUS  REWEIGHT  PRI-AFF
 -1         2.92978  root default                                                                            
 -5         2.92978      region westeurope                                                                   
-14         0.97659          zone westeurope-1                                                               
-13         0.48830              host clustername-ocs-westeurope1-qpdqh                               
  2    hdd  0.48830                  osd.2                                              up   1.00000  1.00000
-17         0.48830              host ocs-deviceset-managed-csi-1-data-0tfdrp                           
  3    hdd  0.48830                  osd.3                                              up   1.00000  1.00000
-10         0.97659          zone westeurope-2                                                               
 -9         0.48830              host clustername-ocs-westeurope2-46789                               
  1    hdd  0.48830                  osd.1                                              up   1.00000  1.00000
-19         0.48830              host ocs-deviceset-managed-csi-0-data-0zzxzr                           
  4    hdd  0.48830                  osd.4                                              up   1.00000  1.00000
 -4         0.97659          zone westeurope-3                                                               
 -3         0.48830              host clustername-ocs-westeurope3-9wsjs                               
  0    hdd  0.48830                  osd.0                                              up   1.00000  1.00000
-21         0.48830              host ocs-deviceset-managed-csi-2-data-0bc889                           
  5    hdd  0.48830                  osd.5                                              up   1.00000  1.00000

In my case the old OSD are osd.0, osd.1 and osd.2, those OSDs needs to be removed / deleted one by one, waiting for HEALTH_OK after every removing / deleting.

Remove OSD in old StorageClass

Switch to openshift-storage project

First switch to openshift-storage project:

$ oc project openshift-storage

Copy Ceph config and keyring files

Copy your Ceph config file and keyring file from rook container pod to your Linux box, then those files will be transferred to one mon container in order to run ceph commands after scaling down rook operator.

Copy files from rook container to your Linux box:

$ ROOK=$(oc get pod | grep rook-ceph-operator | awk '{print $1}')

$ oc rsync ${ROOK}:/var/lib/rook/openshift-storage/openshift-storage.config .
WARNING: cannot use rsync: rsync not available in container
openshift-storage.config
$ oc rsync ${ROOK}:/var/lib/rook/openshift-storage/client.admin.keyring .
WARNING: cannot use rsync: rsync not available in container
client.admin.keyring

Copy openshift-storage.config and openshift-storage.config files from your Linux box to one mon container:

$ MONA=$(oc get pod | grep rook-ceph-mon | egrep '2\/2\s+Running' | head -n1 | awk '{print $1}')

$ oc cp openshift-storage.config ${MONA}:/tmp/openshift-storage.config
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)
$ oc cp client.admin.keyring ${MONA}:/tmp/client.admin.keyring
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)

NOTE: MONA, in one of Italian regional language means stupid people :smile:

Check ceph command on MONA container:

$ oc rsh ${MONA}
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)

sh-4.4# ceph health --cluster=openshift-storage --conf=/tmp/openshift-storage.config --keyring=/tmp/client.admin.keyring
2022-XX -1 auth: unable to find a keyring on /var/lib/rook/openshift-storage/client.admin.keyring: (2) No such file or directory
2022-XX -1 AuthRegistry(0x7fa63805bb68) no keyring found at /var/lib/rook/openshift-storage/client.admin.keyring, disabling cephx
HEALTH_OK
sh-4.4#

Scale down OpenShift Data Foundation operators

Now we can scale to zero rook and ocs operators:

$ oc scale deploy ocs-operator --replicas=0
deployment.apps/ocs-operator scaled
$ oc scale deploy rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled

Remove one OSD

Now you can remove one OSD, in my case I’ll remove osd.0 (zero), but in your case could be a different ID.

$ failed_osd_id=0
$ export PS1="[\u@\h \W]\ OSD=$failed_osd_id $ "

$ oc scale deploy rook-ceph-osd-${failed_osd_id} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled

$ oc process -n openshift-storage ocs-osd-removal  -p FAILED_OSD_IDS=${failed_osd_id} | oc create -f -
job.batch/ocs-osd-removal-job created

$ JOBREMOVAL=$(oc get pod | grep ocs-osd-removal-job- | awk '{print $1}')

$ oc logs ${JOBREMOVAL} | egrep "cephosd: completed removal of OSD ${failed_osd_id}"
2022-XX I | cephosd: completed removal of OSD 0

NOTE: on the last command you must see cephosd: completed removal of OSD X, where X is your osd id (in my case zero).

check ceph health status, where you can see a degraded state due to one osd removal:

$ oc rsh ${MONA}
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)
sh-4.4# ceph status --cluster=openshift-storage --conf=/tmp/openshift-storage.config --keyring=/tmp/client.admin.keyring
  cluster:
    id:     ....
    health: HEALTH_WARN
            Degraded data redundancy: 19562/138537 objects degraded (14.120%), 96 pgs degraded, 96 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 2d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 5 up (since 6m), 5 in (since 3m); 110 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 46.18k objects, 73 GiB
    usage:   192 GiB used, 2.3 TiB / 2.4 TiB avail
    pgs:     19562/138537 objects degraded (14.120%)
             6098/138537 objects misplaced (4.402%)
             95 active+undersized+degraded+remapped+backfill_wait
             83 active+clean
             14 active+remapped+backfill_wait
             1  active+undersized+degraded+remapped+backfilling
 
  io:
    client:   131 KiB/s rd, 14 MiB/s wr, 4 op/s rd, 151 op/s wr
    recovery: 1023 KiB/s, 14 keys/s, 7 objects/s
 
sh-4.4#

wait until ceph returns HEALTH_OK and all PGs are active+clean:

sh-4.4# while true; do ceph status --cluster=openshift-storage --conf=/tmp/openshift-storage.config --keyring=/tmp/client.admin.keyring | egrep --color=always '[0-9]+\/[0-9]+.*(degraded|misplaced)|' ; sleep 10 ; done
  cluster:
    id:     ....
    health: HEALTH_WARN
            Degraded data redundancy: 17957/136521 objects degraded (13.153%), 91 pgs degraded, 91 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 2d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 5 up (since 8m), 5 in (since 6m); 105 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 45.51k objects, 73 GiB
    usage:   194 GiB used, 2.3 TiB / 2.4 TiB avail
    pgs:     17957/136521 objects degraded (13.153%)
             5767/136521 objects misplaced (4.224%)
             90 active+undersized+degraded+remapped+backfill_wait
             88 active+clean
             14 active+remapped+backfill_wait
             1  active+undersized+degraded+remapped+backfilling
 
  io:
    client:   90 KiB/s rd, 14 MiB/s wr, 2 op/s rd, 145 op/s wr
    recovery: 1023 KiB/s, 20 keys/s, 10 objects/s


...... (cut)

  cluster:
    id:     ....
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 2d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 5 up (since 61m), 5 in (since 58m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 45.63k objects, 74 GiB
    usage:   226 GiB used, 2.2 TiB / 2.4 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   115 KiB/s rd, 14 MiB/s wr, 4 op/s rd, 139 op/s wr

WARNING: before going forward you must wait for ceph HEALTH_OK and all PGs in active+clean state!

Delete removal job:

$ oc delete job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

Repeat these steps for each OSD you need to remove (in my case for osd.1 and osd.2)

Remove your old storage class

After removing all OSD that belongs to your old storageclass (in my case Azure managed-premium), you can edit your storagecluster object to remove any pointer to the old storage class.

Make a backup before editing your storagecluster:

$ oc get storagecluster ocs-storagecluster -oyaml | tee storagecluster-ocs-storagecluster-before-remove-managed-premium.yaml

change / edit your storagecluster storageDeviceSets, from having OSD from both “old” storageclass (in my case managed-premium) and “new” storageclass (in my case managed-csi):

$ oc get storagecluster ocs-storagecluster -oyaml
....
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: managed-premium
        volumeMode: Block
      status: {}
    name: ocs-deviceset
    placement: {}
    preparePlacement: {}
    replica: 3
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
  - count: 1
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: managed-csi
        volumeMode: Block
    name: ocs-deviceset-managed-csi
    placement: {}
    portable: true
    replica: 3
    resources: {}

to have only “new” storageclass (in my case managed-csi):

$ oc edit storagecluster ocs-storagecluster
....        
  storageDeviceSets:
  - count: 1
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: managed-csi
        volumeMode: Block
    name: ocs-deviceset-managed-csi
    placement: {}
    portable: true
    replica: 3
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi

Scale up OpenShift Data Foundation operators

At this point you can scale up ocs-operator:

$ oc scale deploy ocs-operator --replicas=1
deployment.apps/ocs-operator scaled

and then re-check Ceph health status:

$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring | egrep -i 'remapped|misplaced|active\+clean|HEALTH_OK|'
  cluster:
    id:     ....
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3d)
    mgr: a(active, since 3d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 2h), 3 in (since 2h)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 45.87k objects, 74 GiB
    usage:   226 GiB used, 1.2 TiB / 1.5 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   134 KiB/s rd, 60 MiB/s wr, 4 op/s rd, 322 op/s wr

Install OpenShift IPI for homelab on Hetzner Root servers

2022-10-02T08:00:00+00:00

Warning: This document / project / repository / playbooks should be used only for testing OpenShift Container Platform 4.x and NOT for production environments.

In this article, I’ll explain how to deploy Red Hat OpenShift Container Platform using the Hetzner Root Server(s), this guide is similar to my previously Install OpenShift baremetal IPI on homelab, using nested virtualization, but on this case the KVM hosts will be Hetzner Root Servers, upon them (or it) you will have OpenShift baremetal IPI installed using nested virtualization.

Install CentOS Stream 8

On all your Hetzner Root Server install CentOS Stream 8 operating system.

Before going further I’d suggest to upload your SSH public key to Hetzner robot navigating to Server => click on Key management and then you can click to New key; in the new window set a name and paste your public key content in Key data;

activate Rescue system by logging in to your robot account, then navigate to Server tab => select your Root Server => select the Rescue sub tab => click to Activate rescue system button:

Reset your server in order to reboot in rescue mode; for this you can simply run reboot on SSH, or you can send a reset using Hetzner robot by selecting Reset sub menu => set Reset type to Execute an automatic hardware reset => click on the Send button:

wait until your root server is rebooted in rescue mode;
connect in SSH to your Root server and wipe all your disks:

root@rescue ~ # dd if=/dev/zero of=/dev/nvme0n1 bs=1M count=10000 oflag=direct status=progress
root@rescue ~ # dd if=/dev/zero of=/dev/nvme1n1 bs=1M count=10000 oflag=direct status=progress

root@rescue ~ # dd if=/dev/zero of=/dev/sda bs=1M count=10000 oflag=direct status=progress
root@rescue ~ # dd if=/dev/zero of=/dev/sdb bs=1M count=10000 oflag=direct status=progress

run installimage:

root@rescue ~ # installimage

select CentOS Stream 8:

press enter:

disable software RAID:

REMEMBER: this guide is for automating you OpenShift Lab and NOT for production environment(s)

create two partitions, one for boot with 2G dimension and second one with lvm using all remaining space; then create four logical volume, but remember to use xfs for file system and to give at least 20G for /tmp file system (sushy-emulator will use /tmp to build iso files); finally comment default Hetzner three partitions:

press ESC and then with arrow keys select Save before close:

press enter to confirm:

wait until the installation is completed and then reboot:

repeat above installation steps for all your Hetzner Root Servers.

Create baremetal VLAN

On Hetzner Root Server you can use vSwitch to connect multiple servers to the same VLAN.

select Server => vSwitches:

create a baremetal vSwitch with 4000 VLAN ID, then click on Create vSwitch:

add your servers to baremetal vSwitch:

wait until all servers are added to baremetal vSwitch:

Create your custom variables

These steps must not be run directly on your Root Server(s) because them will be rebooted in order to configure properly.

clone ocp4-in-the-jars project in your box:

$ git clone https://github.com/amedeos/ocp4-in-the-jars

cd into ocp4-in-the-jars directory:

$ cd ocp4-in-the-jars

create your hosts-kvmhost file, where most important you need to choose one single baremetal_ip for each host; an example could be:

$ cat hosts-kvmhost 
[kvmhost]
hetlab01 baremetal_ip=192.168.203.3 ansible_ssh_user=root ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
hetlab02 baremetal_ip=192.168.203.4 ansible_ssh_user=root ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'

in my case, I use an ssh config file to resolve hetlab01 and hetlab02, but you can set ansible_ssh_host variable to default IP of your root server:

$ cat hosts-kvmhost 
[kvmhost]
hetlab01 baremetal_ip=192.168.203.3 ansible_ssh_host=x.x.x.x ansible_ssh_user=root ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
hetlab02 baremetal_ip=192.168.203.4 ansible_ssh_host=y.y.y.y ansible_ssh_user=root ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'

test Ansible connection:

$ ansible -m ping all -i hosts-kvmhost -o -b
hetlab01 | SUCCESS => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"},"changed": false,"ping": "pong"}
hetlab02 | SUCCESS => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"},"changed": false,"ping": "pong"}

create custom-variables.yaml file:

$ tee "custom-variables.yaml" > /dev/null <<'EOF'
baremetal_net:
  net: 192.168.203.0
  netmask: 255.255.255.0
  prefix: 24
  reverse: 203.168.192
  gateway: 192.168.203.1
  ntp: "103.16.182.23,103.16.182.214"
  dhcp_start: 192.168.203.90
  dhcp_end: 192.168.203.110
  mtu: 1400
  vlan: 4000
kvmhost:
  enable_selinux: True
  reboot_timeout: 1200
  enable_portfw: True
  replace_ddns_duckdns: False
  provisioning_bridge_create: True
  provisioning_bridge_isolated: False
  baremetal_bridge_create: True
  baremetal_bridge_isolated: False
  enable_baremetal_gw: True
  set_hostname: True
  set_hosts: True
  additional_hosts: personal_hosts.j2
  create_ssh_key: True
secure_password: XXXXXXX
rh_subcription_user: XXXXXXX
rh_subcription_password: XXXXXXX
rh_subcription_pool: XXXXXXX
EOF

replace last four variables secure_password, rh_subcription_user, rh_subcription_password and rh_subcription_pool with your correct data, if you have any doubt have a look at Install OpenShift baremetal IPI on homelab, using nested virtualization

create custom-bm-ansible-nodes.json file, where inside it, you can choose where every VMs will be created (across your Root Servers);

for example master-0 will be created on hetlab01.example.com host, where redfish_ip match baremetal_ip into hosts-kvmhost file:

        {
            "name": "master-0",
            "state": "present",
            "hypervisor_name": "hetlab01.example.com",
            "hypervisor_user": "root",
            "hypervisor_ssh_key": "~/.ssh/id_rsa",
            "hypervisor_image_dir": "/var/lib/libvirt/images",
            "provisioning_mac": "52:54:00:00:32:00",
            "baremetal_mac": "52:54:00:00:33:00",
            "vbmc_pre_cmd": "",
            "vbmc_ip": "192.168.201.102",
            "vbmc_port": "623",
            "redfish_ip": "192.168.203.3",
            "redfish_port": "8000",
            "baremetal_ip": "192.168.203.53",
            "baremetal_last": "53"
        },

instead worker-0 will be created on hetlab02.example.com host:

        {
            "name": "worker-0",
            "state": "present",
            "is_odf": "true",
            "hypervisor_name": "hetlab02.example.com",
            "hypervisor_user": "root",
            "hypervisor_ssh_key": "~/.ssh/id_rsa",
            "hypervisor_image_dir": "/var/lib/libvirt/images",
            "provisioning_mac": "52:54:00:00:32:03",
            "baremetal_mac": "52:54:00:00:33:03",
            "vbmc_pre_cmd": "",
            "vbmc_ip": "192.168.201.13",
            "vbmc_port": "623",
            "redfish_ip": "192.168.203.4",
            "redfish_port": "8000",
            "baremetal_ip": "192.168.203.56",
            "baremetal_last": "56"
        },

have a look at custom-bm-ansible-nodes.json example file.

download pull-secret.txt from Red Hat Console and place it as pull-secret.txt; if you have any doubt have a look at Install OpenShift baremetal IPI on homelab, using nested virtualization
download RHEL 8.6 qcow2 file from Red Hat Downloads; if you have any doubt have a look at Install OpenShift baremetal IPI on homelab, using nested virtualization

Run Ansible playbook prepare-hypervisor.yaml

Now you can run the ansible playbook prepare-hypervisor.yaml:

ansible-playbook -i hosts-kvmhost --extra-vars "@custom-variables.yaml" prepare-hypervisor.yaml

this playbook will configure your Root Server.

Copy Custom variables

copy RHEL 8.6 qcow2 file to all your Root Servers:

$ scp /tmp/rhel-8.6-x86_64-kvm.qcow2 hetlab01:/root/images/
$ scp /tmp/rhel-8.6-x86_64-kvm.qcow2 hetlab02:/root/images/

clone repository on one Root Server:

$ ssh hetlab01
[root@hetlab01 ~]# git clone https://github.com/amedeos/ocp4-in-the-jars.git

copy custom-variables.yaml file:

$ scp custom-variables.yaml hetlab01:/root/ocp4-in-the-jars/

copy custom-bm-ansible-nodes.json file:

$ scp custom-bm-ansible-nodes.json hetlab01:/root/ocp4-in-the-jars/

copy pull-secret.txt file:

scp pull-secret.txt hetlab01:/root/ocp4-in-the-jars/

Run the installation

Finally you can install OpenShift by running Ansible playbook main.yaml in a tmux session:

$ ssh hetlab01
[root@hetlab01 ~]# tmux

[root@hetlab01 ~]# cd /root/ocp4-in-the-jars
[root@hetlab01 ocp4-in-the-jars]# ansible-playbook --extra-vars "@custom-variables.yaml" --extra-vars "@custom-bm-ansible-nodes.json" main.yaml

wait 1-3 hours until the installation completes.

Install OpenShift baremetal IPI on homelab, using nested virtualization

2022-08-20T08:00:00+00:00

Warning: This document / project / repository / playbooks should be used only for testing OpenShift Container Platform 4.x and NOT for production environments.

In this article, I’ll explain how to deploy Red Hat OpenShift Container Platform using the installer-provisioned cluster on bare metal (IPI), but instead of using bare metal nodes, for my homelab I use nested virtualization simulating bare metal nodes.

Update 18-09-2022

Now, with the use of redfish emulator sushy-tools, by default, only one baremetal network should be used, with the advantage of removing the provisioning network.

Introduction

I use Ansible playbooks to install OpenShift Container Platform 4.x on a couple of (similar) Intel NUC, to test IPI bare metal installation; but instead of using bare metal nodes, I use virtual machines on NUC hosts.

The advantages of using this approach, is spanning resource requirements to multiple, little and usually cheaper hosts, instead of using only one, bigger host with embedded BMC; but the playbook is flexible to be used also against one, bigger host; for example, I used on Hetzner a “bigger” host to deploy all in one OpenShift master and worker nodes.

All OpenShift hosts will be created as a virtual machine with nested virtualization upon your NUCs.

Architecture using multiple hosts

In the following example, multiple hosts are used and could be added in the future, for example to add more worker nodes.

Architecture using only one host

In the following example, only one host is used. For example, you can rent a dedicated server on Hetzner, with CentOS Stream 8, and running against it the playbook prepare-hypervisor.yaml you will have a single KVM hypervisor, reachable on the internet, with iptables rules to route api and apps to OpenShift and NAT rules to allow master and worker nodes to reach “Internet”

Requirements

Networks

If you want to run on only one host all virtual machines, you can skip this task, otherwise, if you want to use multiple NUC hosts, you need to setup your switch with one baremetal network, where baremetal network could be a native VLAN or tagged by your NUC Linux bridge, . This is required if you use a trunked network/cable.

The default configuration will use these L2 and L3 settings:

VLAN	Name	Subnet	Native	Bridge	Gateway
2003	Baremetal	192.168.203.0/24		bm	192.168.203.1

Operating System and packages

Your Linux NUC hosts require the following packages installed and working:

libvirt
qemu
nested virtualization
libguestfs
sushy-tools
ssh

there is no constraint on which Linux distribution to use. For example, I use Gentoo, but you can use RHEL 8, CentOS Stream 8, Ubuntu, Arch…

If you’re using CentOS Stream 8 on your NUCs, you can use the Ansible playbook prepare-hypervisor.yaml to properly setup your NUC(s):

Clone ocp4-in-the-jars repository:

$ git clone https://github.com/amedeos/ocp4-in-the-jars
$ cd ocp4-in-the-jars

Create an Ansible inventory host for kvmhost group, where for each host you have to specify a single, free baremetal_ip, the content could be something like this:

$ cat hosts-kvmhost 
[kvmhost]
centos01 baremetal_ip=192.168.203.3 ansible_ssh_user=root ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
centos02 baremetal_ip=192.168.203.4 ansible_ssh_user=root ansible_ssh_host=192.168.201.10 ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'

create a custom-variables.yaml file:
```
$ touch custom-variables.yaml
```

review baremetal network in variables.yaml file, where, if you’re running all VMs on only one host you can leave as is, otherwise adapt to your trunked network and set new values into custom-variables.yaml; in the following example, I’ve changed bridge names, networks CIDR and MTU:

$ vi custom-variables.yaml
bridge_prov: br0
bridge_bm: baremetal
baremetal_net:
  net: 192.168.243.0
  netmask: 255.255.255.0
  prefix: 24
  reverse: 243.168.192
  gateway: 192.168.243.1
  ntp: "103.16.182.23,103.16.182.214"
  dhcp_start: 192.168.243.90
  dhcp_end: 192.168.243.110
  mtu: 3400
  vlan: 2003

review kvmhost variables, where, if you’re running all VMs on only one host, you can leave as is, otherwise adapt to your needs, setting new values into custom-variables.yaml file; for example if you need to configure your multiple NUC bridges with your correct L2+L3 settings change provisioning_bridge_isolated and baremetal_bridge_isolated variables from True to False, instead if you want that your NUC act as baremetal network default gateway change enable_baremetal_gw from True to False:
```
$ vi custom-variables.yaml
kvmhost:
  enable_selinux: True
  reboot_timeout: 1200
  enable_portfw: True
  replace_ddns_duckdns: False
  provisioning_bridge_create: True
  provisioning_bridge_isolated: False
  baremetal_bridge_create: True
  baremetal_bridge_isolated: False
  enable_baremetal_gw: False
  set_hostname: True
  set_hosts: True
  additional_hosts: personal_hosts.j2
  create_ssh_key: True
```

run prepare-hypervisor.yaml playbook:

$ ansible-playbook -i hosts-kvmhost --extra-vars "@custom-variables.yaml"  prepare-hypervisor.yaml

In order to run Ansible playbooks, you need to pass it your Red Hat login/password, if you don’t have it subscribe yourself to Red Hat Developer Program

Create a Red Hat Developer Program membership

go to Red Hat Developer Program and create a new user. Once you have done, click on tab Subscriptions and then click on Red Hat Developer Subscription for Individuals

click on sub tab Subscriptions:

click on the Subscription number and copy the Pool ID;
now you can fill in your custom-variables.yaml file your rh_subcription_user, rh_subcription_password and rh_subcription_pool:
```
$ vi custom-variables.yaml
rh_subcription_user: 
rh_subcription_password: 
rh_subcription_pool: 
```

Download a pull-secret

Now you need to download a valid pull-secret.

go to Red Hat Console, click on OpenShift and then click on Create cluster:

click on Datacenter tab and then on Bare Metal(x86_64) link:

click on Installer-provisioned infrastructure:

click on Copy pull secret:

and paste it into pull-secret.txt file, removing the last blank line:

$ vi pull-secret.txt
$ wc -l pull-secret.txt 
1 pull-secret.txt

Download RHEL 8.6 qcow2

Two virtual machines, utility and bastion are based on standard RHEL 8.6 qemu KVM file.

go to Red Hat Downloads and click on Red Hat Enterprise Linux:

select version 8.6:

click Download Now button for Red Hat Enterprise Linux 8.6 KVM Guest Image:

remember to put this qcow2 file on all your NUC host under /root/images/rhel-8.6-x86_64-kvm.qcow2:

$ scp rhel-8.6-x86_64-kvm.qcow2 :/root/images/rhel-8.6-x86_64-kvm.qcow2

Edit Ansible inventory

If you’re installing all VMs in only one host / hypervisor / NUC, skip this chapter, otherwise if you want to balance your VMs across multiple hosts, hypervisors, NUCs, you need to specify how many workers you want and on which KVM host / NUC system, every virtual machines (utility, bastion, masters and workers) will be created; for doing this you need to create a custom-bm-ansible-nodes.json file where you can specify hypervisor (NUC), redfish IP and port and MAC addresses, where redfish ip usually is the baremetal_ip defined on hosts-kvmhost inventory file.

copy from all in one file bm-ansible-nodes.json:

$ cp bm-ansible-nodes.json custom-bm-ansible-nodes.json

edit custom-bm-ansible-nodes.json file with your customization:
```
$ vi custom-bm-ansible-nodes.json
```

where, for example, if you want to run master-0 node (VM), on hypervisor centos01.exameple.com, establish Ansible SSH connection with root user:

...
    "master_nodes": [
        {
            "name": "master-0",
            "state": "present",
            "hypervisor_name": "centos01.example.com",
            "hypervisor_user": "root",
            "hypervisor_ssh_key": "~/.ssh/id_rsa",
            "hypervisor_image_dir": "/var/lib/libvirt/images",
            "provisioning_mac": "52:54:00:00:32:00",
            "baremetal_mac": "52:54:00:00:33:00",
            "vbmc_pre_cmd": "",
            "vbmc_ip": "192.168.201.102",
            "vbmc_port": "623",
            "redfish_ip": "192.168.203.1",
            "redfish_port": "8000",
            "baremetal_ip": "192.168.203.53",
            "baremetal_last": "53"
        },
...

Run the Installation

If you haven’t created customization file variables custom-variables.yaml file, nor custom inventory custom-bm-ansible-nodes.json file, just run the main.yaml playbook:

$ ansible-playbook main.yaml

otherwise pass them as Ansible extra variable file:

$ ansible-playbook --extra-vars "@custom-variables.yaml" --extra-vars "@custom-bm-ansible-nodes.json" main.yaml

then wait 1-3 hours until the installation completes.

Post installation checks

Connect to your bastion virtual machine:

$ ssh kni@

# if you haven't changed IP this should be 192.168.203.50

$ ssh kni@192.168.203.50

check clusterversion:

$ export KUBECONFIG=/home/kni/ocp-lab/auth/kubeconfig

$ oc get clusterversion

check cluster operator:

# output should be empty
$ oc get co | egrep -v '4\.[0-9]+\.[0-9]+\s+True\s+False\s+False'

Optional - Clean up

If you want to clean up everything, run the cleanup.yaml playbook:

WARNING: the following command will delete all resources created!

$ ansible-playbook cleanup.yaml

if you have custom variable files, pass them to ansible-playbook command:

$ ansible-playbook --extra-vars "@custom-variables.yaml" --extra-vars "@custom-bm-ansible-nodes.json" cleanup.yaml

Optional - Set dynamic dns and valid certificate

If you want to update your cluster with a dynamic DNS entry, and with it, create a valid certificate for your cluster, first create a valid token and domain on Duck DNS; after this, edit your custom-variables.yaml with:

$ vi custom-variables.yaml
....
duckdns_token: YOURTOKEN       #### <= put here your valid token on duckdns.org
cluster_name: YOURDOMAINONDD   #### <= put here your valid domain on duckdns.org
base_domain: duckdns.org
domain: "YOURDOMAINONDD.duckdns.org"
enable_ddns_duckdns: True
enable_letsencrypt: True

then you can run the installation.

Use btrbk for remote backup solution with btrfs

2021-08-18T08:00:00+00:00

Introduction

In this “simple” post I’ll show you how to configure btrbk to send to a remote Linux box your subvolume, in order to backup your data, also I’ll show you how to limit permissions to btrbk using sudo and ssh_filter_btrbk.sh script file.

btrbk program uses the btrfs send / receive feature, but it simplifies the management of subvolumes and the ability to send from your source box and receive in your target box the subvolumes over ssh.

Requirements

You will need:

Your source Linux box must use btrfs subvolume, in this post I’ll use the @home subvolume to backup / send to the remote box, but you could adapt to your needs and use another btrfs subvolume
Your destination Linux box must have one btrfs device / file system;
Ability to install btrbk on your preferred Linux distro, on both source and destination Linux boxes;
Ability to create a dedicated user for backup, on both source and destination Linux boxes, in this post I’ll use backupuser;
Ability to configure sudo permissions on both source and destination Linux boxes.

Install btrbk

Install btrbk on both source and destination hosts, using the package manager of your preferred distro; regarding the destination host we’ll use only the provided script ssh_filter_btrbk.sh, for this reason we’ll install also on destination host the btrbk program.

On Gentoo:

# emerge --ask app-backup/btrbk

On Fedora:

# dnf install btrbk

After the installation identifies where your distro install the ssh_filter_btrbk.sh script, on Gentoo and Fedora, this script is located on /usr/share/btrbk/scripts/ssh_filter_btrbk.sh

Configure your source box to backup

Mount btrfs volume

Mount your primary btrfs volume under the directory /mnt/btrbk_pool, doing this you’ll be able to backup all subvolumes

Create the directory:

# mkdir -p /mnt/btrbk_pool

Identify your btrfs UUID, in my case is a000eea9-d97c-4107-ae39-602049a6acaa:

# blkid | egrep 'TYPE=\"btrfs\"' | sed -E 's/.+\s+UUID=\"([0-9a-z\-]+)\"\s+.+/\1/g'
a000eea9-d97c-4107-ae39-602049a6acaa

Now edit your /etc/fstab in order to mount your btrfs volume under /mnt/btrbk_pool:

# vi /etc/fstab
# grep btrbk_pool /etc/fstab
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /mnt/btrbk_pool                 btrfs           noatime,relatime,compress=no,ssd,space_cache,discard=async                      0 0

NOTE 1: remove ssd option if you’re using rotational disks

NOTE 2: remove discard=async if your’re using Kernel < 5.6

Mount the volume and check if the subvolume @home is present:

# mount -a
# btrfs subvolume list /mnt/btrbk_pool | egrep -E '\@home$'
ID 257 gen 84832 top level 5 path @home

Create backupuser

Now you can create on your source box the new user backupuser:

# useradd backupuser

Add sudo permission for backupser creating a new file /etc/sudoers.d/backupuser:

# cat /etc/sudoers.d/backupuser 
%backupuser ALL=(ALL) NOPASSWD: /sbin/btrfs, /bin/readlink, /usr/bin/readlink

Create ssh Key

Create a new ssh key, which will be trusted on the destination box:

# mkdir /etc/btrbk/ssh
# chown backupuser. /etc/btrbk/ssh/
# chmod 0700 /etc/btrbk/ssh
# su - backupuser
backupuser@sourcebox ~ $ ssh-keygen -t rsa -b 4096 -f /etc/btrbk/ssh/id_rsa -C backuser@$(hostname) -N ""

Configure /etc/btrbk/btrbk.conf

In this example, I’ll backup and send to the remote Linux box only the @home subvolume, but you can adapt it based on your needs.

# cat /etc/btrbk/btrbk.conf
timestamp_format        long
ssh_identity /etc/btrbk/ssh/id_rsa
ssh_user backupuser

backend_remote btrfs-progs-sudo
backend btrfs-progs-sudo

snapshot_preserve_min   2d
snapshot_preserve      14d

target_preserve_min    no
target_preserve        20d 10w *m

volume /mnt/btrbk_pool
  subvolume @home
    target ssh:///ssddata/backup/lapdog

Change the FQDN with your target box IP or FQDN

Configure your target box to receive backup

Now we can configure the target box in order to receive the btrfs subvolume coming from our source box.

Create a new @backup subvolume

Identify your btrfs volume and create a new @backup subvolume, personally I’ve been using a luks device named “ssddata”, but you could use for example an hdd disk(s) /dev/sdX1.

Create a new subvolume:

# mount /dev/mapper/ssddata /mnt/ssddata
# cd /mnt/ssddata
# btrfs subvolume create @backup

update your /etc/fstab with the entry for subvolume @backup mounting it under /ssddata/backup

# mkdir -p /ssddata/backup
# vi /etc/fstab
# grep backup /etc/fstab
UUID=aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee       /ssddata/backup         btrfs   noatime,relatime,compress=lzo,ssd,space_cache,discard=async,subvol=@backup      0 0

mount the subvolume:

# mount -a

and create the lapdog directory (if you want to change the name, remember to change it also on btrbk.conf on source box):

# mkdir -p /ssddata/backup/lapdog

Create backupuser

Now you can create on your target box the new user backupuser (same as we have done on source box :smile: ):

# useradd backupuser

Add sudo permission for backupser creating a new file /etc/sudoers.d/backupuser:

# cat /etc/sudoers.d/backupuser 
%backupuser ALL=(ALL) NOPASSWD: /sbin/btrfs, /bin/readlink, /usr/bin/readlink

Trust ssh key

Copy the content of the ssh pub file from your source box:

# cat /etc/btrbk/ssh/id_rsa.pub

Put the content of the file /etc/btrbk/ssh/id_rsa.pub in your clipboard and then go to your target box and run:

# su - backupuser
$ mkdir -p ~/.ssh
$ chmod 0700 ~/.ssh

and the edit the file vim ~/.ssh/authorized_keys of the backupuser adding first the command=”/usr/share/btrbk/scripts/ssh_filter_btrbk.sh -l –sudo –target –delete –info” and then, on the same line with only a space dividing them, the content of id_rsa.pub coming from your source box;

below an example of the ~/.ssh/authorized_keys file:

$ cat /home/backupuser/.ssh/authorized_keys
command="/usr/share/btrbk/scripts/ssh_filter_btrbk.sh -l --sudo --target --delete --info" ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDwRPIBZMgomFVfXOyOwYm+CuSdWfWR7tMIh+aJgWGv1pK8zuTiZtoaCSnobrRVJNkWWNIeL672o9zgn8y5N2nb64pWxDCcJWFKHuxCZk3ZN1i70JPTZ25sUZ0YUQ8YCd4YtLIujPdIdCMNTESrB0QYe0CCyD6HnX2DRR36G3EVRbNmBpzeLLthIoZLzRpGXFeHMLIz3W9v5VrIwDYZGWdUptyqbh9YQd7x9+lqmaCSlAzRttMVk6HiH8hUuJLgseNtvamqqsEQcZGk3j4v3EbYR+oCQqb4njcxQ3YbPuKtc88PREIezNt/rcoo4m720nXOeKZCad5Ob0/gd9CnBPY3xo8Po1UZdOSrvUxr46moAhMMBVy8c9LO32AlJ7oKjgt2UFelOdWlx69vCZ7TezYRCSj5DS2ZtlYe4KN1pRfLwe1h+h/tt4QVMmKpbl771VKaTzb1xM3TwR8SSXRqct/NeXGWNm7CPrsPx1qK6NFsqx0KH/Wc93uIqbucC5fUhd7rc5yYX43yO3vDon+Omlc9OIAmfxtssTK8/XU7C9fQDMACgwcFxh7JixdPzqlVvJxNiJiSjpWSkixXubBRwgWlTf/L7hZppFcnS0j+ZtgsTiBofm5QiMHskQIUoZ0WmdyuzwJQVQMBV58rZdB7goxvLaW63SM/b6FVk5c0m9dV4w== backuser@lapdog

Run your first backup and send it via ssh to target box

Now you can run the first backup using btrbk and it will automagically send the btrfs subvolumes through ssh to the target box:

From your source box switch to backupuser and run it:

# su - backupuser
$ btrbk -c /etc/btrbk/btrbk.conf -v run

after it ends, you can run the list all command in order to see all backups:

# su - backupuser
$ btrbk -c /etc/btrbk/btrbk.conf -n list all

Configure Gentoo to unlock LUKS root file system with FIDO2 key

2021-04-25T08:00:00+00:00

Introduction

In this post I’ll describe how to unlock your LUKS device, which contains the root file system, using a FIDO2 hardware key.

UPDATE: 03/08/2023: Update /etc/dracut.conf.d/fido2.conf file with new systemd’ fido2 requirement

Requirements

In order to unlock your LUKS device at boot using your FIDO2 hardware key your Gentoo box need to meet these conditions:

only LUKS2 device is supported, if your device is LUKS1 you can’t unlock it with FIDO2 key; to check if your device is LUKS2 you can simply run a cryptsetup status command:
```
# cryptsetup status  | grep type
type: LUKS2
```

systemd version 248 or higher;

# qlist -Idv | grep sys-apps/systemd
sys-apps/systemd-248

FIDO2 hardware key; for this I bought the Solokeys Solo USB-A but I hope you can use any FIDO2 available in the market;
sys-kernel/dracut will be used for building the initial ram file system.

Software installation

sys-apps/systemd

Recently, Gentoo systemd mantainer has added fido2 support, just add fido2 USE flag:

# mkdir -p /etc/portage/package.use
# echo "sys-apps/systemd fido2" >> /etc/portage/package.use/systemd

after enabling fido2 USE flag re-emerge systemd:

# time emerge --ask --verbose --update --deep --with-bdeps=y --newuse  --keep-going --autounmask-write=y --backtrack=30  @world

check if your systemd has FIDO2 capability:

# systemctl --version | grep FIDO | sed -E 's/.+([+-]FIDO2).+/\1/'
+FIDO2

reboot the system in order to use your new systemd:

# reboot

sys-kernel/dracut

Install dracut if you haven’t done before:

# emerge --ask sys-kernel/dracut

Enroll your FIDO2 key in your LUKS device

Now you can plug your FIDO2 hardware token and enroll a FIDO2 key on your LUKS device, in my case the device is /dev/nvme0n1p3, but change accordingly to your environment (for example /dev/sda3):

# systemd-cryptenroll --fido2-device=auto /dev/nvme0n1p3 
🔐 Please enter current passphrase for disk /dev/nvme0n1p3: (no echo)               
Initializing FIDO2 credential on security token.
👆 (Hint: This might require verification of user presence on security token.)
Generating secret key on FIDO2 security token.
New FIDO2 token enrolled as key slot 1.

when requested type your current passphrase for your LUKS device, and then, in case you’re using Solokeys, press the button on the token when the led becomes red

If you want to check if your LUKS device now has the FIDO2 key you can run:

# cryptsetup luksDump /dev/nvme0n1p3
...
Tokens:
  0: systemd-fido2
        Keyslot:  1
...

Configure /etc/crypttab

Configure your /etc/crypttab to point to your LUKS device, but giving a human name like rootvolume, instead of classical UUID:

# cat /etc/crypttab 
rootvolume /dev/nvme0n1p3 - fido2-device=auto

Configure dracut

Dracut, by default doesn’t install libfido2 and crypttab file, for this you can create a new file /etc/dracut.conf.d/fido2.conf with the following contents:

# cat /etc/dracut.conf.d/fido2.conf 
add_dracutmodules+=" fido2 "
install_items+=" /etc/crypttab "

now you can rebuild your initial ram file system for your current kernel:

# dracut --force --kver $(uname -r)

instead, if you want to rebuild it for a specific kernel, for example 5.11.16-gentoo-amedeo05, run this command:

# dracut --force --kver 5.11.16-gentoo-amedeo05

Configure grub

Edit your grub GRUB_CMDLINE_LINUX parameter removing all luks options which refer to your LUKS UUID, you should do this in order to make easier the boot procedure without the FIDO2 key; your GRUB_CMDLINE_LINUX should be something like this:

GRUB_CMDLINE_LINUX="init=/usr/lib/systemd/systemd rd.luks.allow-discards root=UUID=a000eea9-d97c-4107-ae39-602049a6acaa rootflags=subvol=@"

Note: the above root=UUID refers to my btrfs UUID and not to the LUKS UUID

now you can run the grub-mkconfig:

# grub-mkconfig -o /boot/grub/grub.cfg

Reboot

Plug your FIDO2 key and reboot, and check if your newly initial ram file system can unlock your root file system:

# reboot

(Bonus) Unlock your root file system using the passphrase

If you lose, or forget your FIDO2 key you can boot your system using a LiveCD, or using the following trick.

When grub starts, press e to edit the boot parameter, then go to the linux entry and after all your boot parameters (in my case init=/usr/lib/systemd/systemd rd.luks.allow-discards root=UUID=a000eea9-d97c-4107-ae39-602049a6acaa rootflags=subvol=@) add the following:

rd.break=initqueue

Press Ctrl+x or F10 and wait for the following messages:

Press Enter for maintenance
(or press Control-D to continue):

now press Enter and at prompt first mask your systemd service systemd-cryptsetup@rootvolume.service:

# systemctl mask systemd-cryptsetup@rootvolume.service

mount your rootvolume by typing your passphrase:

# /lib/systemd/systemd-cryptsetup attach rootvolume /dev/nvme0n1p3

now you can press Control-D to tell systemd to continue the boot process.

How to install Gentoo with UEFI LUKS Btrfs and systemd

2020-12-26T08:00:00+00:00

Introduction

In this post I’ll describe how to install Gentoo with systemd stage3 tarball on UEFI LUKS partition and Btrfs filesystem, using the standard de facto @ subvolume as root file system.

I’ve also written two different guides to install Gentoo on LUKS, but using LVM Volume group, and ext4 filesystem, if you’re interested in those you can find here a guide to install on BIOS partition, and here a guide to install on UEFI partition.

UPDATE 23/08/2023: Correct typo on GRUB_PLATFORMS

Disk partitions

I’m using to create gpt partitions, with a small BIOS boot partition (2 MiB) to be used by grub for second stages of itself.

Partition scheme

This is the quite simple partition scheme used in this guide. In this guide I’m using an NVME disk /dev/nvme0n1, but if you have a scsi disk like /dev/sda, its simple as run sed ‘s/nvme0n1/sda/g’

Partition	Filesystem	Size	Description
/dev/nvme0n1p1	fat	2MiB	bios boot
/dev/nvme0n1p2	fat32	800MiB	Boot partition
/dev/nvme0n1p3	LUKS	rest of the disk	LUKS partition

Creating the partitions

Using parted utility now we can create all required partitions

livecd ~# parted /dev/nvme0n1 
GNU Parted 3.3
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print 
Error: /dev/nvme0n1: unrecognised disk label
Model: WDC PC SN730 SDBQNTY-256G-1001 (nvme)                              
Disk /dev/nvme0n1: 256GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 

(parted) mklabel gpt                                                      

(parted) mkpart primary fat32 1MiB 3MiB                                   

(parted) mkpart primary fat32 3MiB 803MiB

(parted) mkpart primary 803MiB -1                                         

(parted) name 1 grub                                                      
(parted) name 2 boot                                                      
(parted) name 3 luks                                                      

(parted) set 1 bios_grub on
(parted) set 2 boot on                                                    

(parted) print                                                            
Model: WDC PC SN730 SDBQNTY-256G-1001 (nvme)
Disk /dev/nvme0n1: 256GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  3146kB  2097kB  fat32        grub  bios_grub
 2      3146kB  842MB   839MB   fat32        boot  boot, esp
 3      842MB   256GB   255GB                luks

(parted) quit

Create fat filesystems

For both bios boot we’ll create a fat filesystem, instead, for boot partition we’ll create a fat32 filesystem:

livecd ~# mkfs.vfat /dev/nvme0n1p1 
mkfs.fat 4.1 (2017-01-24)
livecd ~# mkfs.vfat -F32 /dev/nvme0n1p2 
mkfs.fat 4.1 (2017-01-24)

Encrypt partition with LUKS

Now we can crypt the third partition /dev/nvme0n1p3 with LUKS

livecd ~# cryptsetup luksFormat -c aes-xts-plain64 -s 512 /dev/nvme0n1p3

WARNING!
========
This will overwrite data on /dev/nvme0n1p3 irrevocably.

Are you sure? (Type 'yes' in capital letters): YES
Enter passphrase for /dev/nvme0n1p3: 
Verify passphrase: 
livecd ~# 

Open the LUKS device

livecd ~# cryptsetup luksOpen /dev/nvme0n1p3 luksdev
Enter passphrase for /dev/nvme0n1p3: 
livecd ~#

Create Btrfs filesystem and subvolume

I’ll create this Btrfs volume/filesytem and subvolumes

Volume	Subvolume name	Parent Volume	Mount point	Description
/dev/nvme0n1p3	-	-	-	Btrfs primary volume / device
-	@	/dev/nvme0n1p3	/	Root filesystem
-	@home	/dev/nvme0n1p3	/home	home filesystem
-	@snapshots	/dev/nvme0n1p3	/.snapshots	snapshots filesystem

Create the Btrfs filesystem with Gentoo label and mount under /mnt/gentoo

livecd ~# mkfs.btrfs -L Gentoo /dev/mapper/luksdev 
btrfs-progs v5.4.1 
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want to force metadata duplication.
Label:              Gentoo
UUID:               a000eea9-d97c-4107-ae39-602049a6acaa
Node size:          16384
Sector size:        4096
Filesystem size:    237.67GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB
SSD detected:       yes
Incompat features:  extref, skinny-metadata
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1   237.67GiB  /dev/mapper/luksdev

livecd ~# mount /dev/mapper/luksdev /mnt/gentoo

Create all subvolumes:

livecd ~# btrfs subvolume create /mnt/gentoo/@
Create subvolume '/mnt/gentoo/@'

livecd ~# btrfs subvolume create /mnt/gentoo/@home
Create subvolume '/mnt/gentoo/@home'

livecd ~# btrfs subvolume create /mnt/gentoo/@snapshots
Create subvolume '/mnt/gentoo/@snapshots'

livecd ~# btrfs subvolume list /mnt/gentoo
ID 256 gen 7 top level 5 path @
ID 257 gen 8 top level 5 path @home
ID 258 gen 9 top level 5 path @snapshots
livecd ~#

umount /mnt/gentoo, because for root filesystem we’ll use the @ subvolume:

livecd ~# umount /mnt/gentoo

and finally we can mount our root filesystem based on @ subvolume:

livecd ~#  mount -t btrfs -o noatime,relatime,compress=lzo,ssd,space_cache,subvol=@ /dev/mapper/luksdev /mnt/gentoo

WARNING: if you have a recent kernel, for example 5.19 space_cache option must be space_cache=v2:

livecd ~#  mount -t btrfs -o noatime,relatime,compress=lzo,ssd,space_cache=v2,subvol=@ /dev/mapper/luksdev /mnt/gentoo

Gentoo installation

Now it’s time to get your hands dirty.

Install systemd stage3

Download the systemd stage3 tarball from Gentoo

livecd ~# cd /mnt/gentoo/
livecd /mnt/gentoo # wget https://bouncer.gentoo.org/fetch/root/all/releases/amd64/autobuilds/20201222T005811Z/stage3-amd64-systemd-20201222T005811Z.tar.xz

Unarchive the tarball

livecd /mnt/gentoo # tar xpvf stage3-*.tar.xz --xattrs-include='*.*' --numeric-owner

Configuring compile options

Open /mnt/gentoo/etc/portage/make.conf file and configure the system with your preferred optimization variables. Have a look at Gentoo Handbook

livecd /mnt/gentoo # vi /mnt/gentoo/etc/portage/make.conf

For example, below you can find my make.conf optimization variables

# These settings were set by the catalyst build script that automatically
# built this stage.
# Please consult /usr/share/portage/config/make.conf.example for a more
# detailed example.
COMMON_FLAGS="-march=skylake -O2 -pipe"
CFLAGS="${COMMON_FLAGS}"
CXXFLAGS="${COMMON_FLAGS}"
FCFLAGS="${COMMON_FLAGS}"
FFLAGS="${COMMON_FLAGS}"

CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt sse sse2 sse3 sse4_1 sse4_2 ssse3"

MAKEOPTS="-j8"

GRUB_PLATFORMS="efi-64"

L10N="en it"

USE="-gtk -gnome systemd networkmanager pulseaudio spice usbredir udisks offensive cryptsetup ocr bluetooth bash-completion opengl opencl vulkan v4l x265 theora policykit vaapi vdpau lto cec cameras_ptp2 wayland"

POLICY_TYPES="targeted"
INPUT_DEVICES="libinput"

ACCEPT_KEYWORDS="~amd64"
ACCEPT_LICENSE="* -@EULA"

VIDEO_CARDS="intel i965 iris"

LLVM_TARGETS="BPF WebAssembly"

PYTHON_TARGETS="python2_7 python3_7 python3_8 python3_9"

# NOTE: This stage was built with the bindist Use flag enabled
PORTDIR="/var/db/repos/gentoo"
DISTDIR="/var/cache/distfiles"
PKGDIR="/var/cache/binpkgs"

# This sets the language of build output to English.
# Please keep this setting intact when reporting bugs.
LC_MESSAGES=C

Chrooting

Copy DNS configurations:

livecd /mnt/gentoo # cp --dereference /etc/resolv.conf /mnt/gentoo/etc/

Mount proc, dev and shm filesystems:

livecd /mnt/gentoo # mount -t proc /proc /mnt/gentoo/proc
livecd /mnt/gentoo # mount --rbind /sys /mnt/gentoo/sys
livecd /mnt/gentoo # mount --make-rslave /mnt/gentoo/sys
livecd /mnt/gentoo # mount --rbind /dev /mnt/gentoo/dev
livecd /mnt/gentoo # mount --make-rslave /mnt/gentoo/dev
livecd /mnt/gentoo # test -L /dev/shm && rm /dev/shm && mkdir /dev/shm
livecd /mnt/gentoo # mount -t tmpfs -o nosuid,nodev,noexec shm /dev/shm
livecd /mnt/gentoo # chmod 1777 /dev/shm

chroot to /mnt/gentoo:

livecd /mnt/gentoo # chroot /mnt/gentoo /bin/bash 
livecd / # source /etc/profile
livecd / # export PS1="(chroot) $PS1"
(chroot) livecd / #

mounting the boot partition:

(chroot) livecd / # mount /dev/nvme0n1p2 /boot

and then mount /home and /.snapshots subvolumes:

(chroot) livecd / # mkdir /.snapshots

(chroot) livecd / # mount -t btrfs -o noatime,relatime,compress=lzo,ssd,subvol=@snapshots /dev/mapper/luksdev /.snapshots

(chroot) livecd / # mount -t btrfs -o noatime,relatime,compress=lzo,ssd,subvol=@home /dev/mapper/luksdev /home

Updating the Gentoo ebuild repository

Update the Gentoo ebuild repository to the latest version:

(chroot) livecd ~# emerge --sync

Update portage

If at the end of emerge sync you see a message like this:

Action: sync for repo: gentoo, returned code = 0

 * An update to portage is available. It is _highly_ recommended
 * that you update portage now, before any other packages are updated.

 * To update portage, run 'emerge --oneshot sys-apps/portage' now.

if you are on this case emerge the portage:

(chroot) livecd ~# emerge --oneshot sys-apps/portage

Choosing the right profile (with systemd)

Choose one of the systemd profile available, for example for my system I have selected desktop/plasma/systemd

(chroot) livecd ~ # eselect profile list
...
(chroot) livecd ~ # eselect profile set 9
(chroot) livecd ~ # eselect profile list
Available profile symlink targets:
...
  [9]   default/linux/amd64/17.1/desktop/plasma/systemd (stable) *
...

Timezone

Update the timezone, for example Europe/Rome

(chroot) livecd ~# echo Europe/Rome > /etc/timezone
(chroot) livecd ~# emerge --config sys-libs/timezone-data

Configure locales

If you want only few locales on your system, for example C, en_us and it_IT

(chroot) livecd /etc # cat locale.gen 
...
C.UTF8 UTF-8
en_US ISO-8859-1
en_US.UTF-8 UTF-8
it_IT ISO-8859-1
it_IT.UTF-8 UTF-8

(chroot) livecd /etc # locale-gen

now you can choose your preferred locale with

(chroot) livecd /etc # eselect locale list
(chroot) livecd /etc # eselect locale set 10
(chroot) livecd /etc # eselect locale list
...
[10]  C.UTF8 *

reload the environment

(chroot) livecd /etc # env-update && source /etc/profile && export PS1="(chroot) $PS1"

Updating the world

If you change your profile, or if you change your USE flags run the update

(chroot) livecd ~# mkdir -p /etc/portage/package.{accept_keywords,license,mask,unmask,use}
(chroot) livecd ~# time emerge --ask --verbose --update --deep --with-bdeps=y --newuse  --keep-going --backtrack=30 @world

now you can take a coffee :coffee::coffee::coffee:

Optional - GCC Upgrade

If you are in amd64 testing, most probably, updating the world, you have installed a new version of GCC, so from now we can use it

(chroot) livecd ~# gcc-config --list-profiles
 [1] x86_64-pc-linux-gnu-9.3.0 *
 [2] x86_64-pc-linux-gnu-10.2.0

set the default profile to 2, corresponding to the above example to the GCC 10.2

(chroot) livecd ~# gcc-config 2
 * Switching native-compiler to x86_64-pc-linux-gnu-10.2.0 ...
>>> Regenerating /etc/ld.so.cache...                                                                                                                                                                                                                                    [ ok ]

 * If you intend to use the gcc from the new profile in an already
 * running shell, please remember to do:

 *   . /etc/profile

(chroot) livecd ~# source /etc/profile
livecd ~# export PS1="(chroot) $PS1"

after that re-emerge the libtool:

(chroot) livecd ~# emerge --ask --oneshot --usepkg=n sys-devel/libtool

Optional - Install vim

If you hate nano editor like me, now you can install vim:

(chroot) livecd ~# emerge --ask app-editors/vim

Configure fstab

This file contains the mount points of partitions to be mounted. Run blkid to see the UUIDs:

(chroot) livecd ~# blkid 
/dev/loop0: TYPE="squashfs"
/dev/nvme0n1p1: SEC_TYPE="msdos" UUID="1C3B-C680" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="grub" PARTUUID="39bd8ab8-3708-4ecf-b4e3-f5714a6e4ea1"
/dev/nvme0n1p2: UUID="1D97-3854" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="boot" PARTUUID="b44718e2-d7fe-4eba-a6c7-1d1beee11806"
/dev/nvme0n1p3: UUID="1ce717f4-5a82-49e7-ae1c-9a92e4c20251" TYPE="crypto_LUKS" PARTLABEL="luks" PARTUUID="c271c93e-6f59-446f-9139-a0b98afab820"
/dev/sda1: BLOCK_SIZE="2048" UUID="2020-12-22-12-44-06-70" LABEL="Gentoo amd64 20201222T005811Z" TYPE="iso9660" PTUUID="7437c9e0" PTTYPE="dos" PARTUUID="7437c9e0-01"
/dev/sda2: SEC_TYPE="msdos" LABEL_FATBOOT="GENTOOLIVE" LABEL="GENTOOLIVE" UUID="A168-D76E" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="7437c9e0-02"
/dev/mapper/luksdev: LABEL="Gentoo" UUID="a000eea9-d97c-4107-ae39-602049a6acaa" UUID_SUB="d45b2afd-7250-4ba1-a896-b0e81a20fa4b" BLOCK_SIZE="4096" TYPE="btrfs"

copy the UUID for the root filesystem upon luksdev device, (in the above example is a000eea9-d97c-4107-ae39-602049a6acaa), also copy the UUID for the boot filesystem which resides on /dev/nvme0n1p2 partition (on the above example is 1D97-3854).

This is my fstab

(chroot) livecd /etc # cat fstab
# /etc/fstab: static file system information.
UUID=1D97-3854                                  /boot                           vfat            noatime                                                                         0 1
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /                               btrfs           noatime,relatime,compress=lzo,ssd,discard=async,subvol=@            0 0
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /home                           btrfs           noatime,relatime,compress=lzo,ssd,discard=async,subvol=@home        0 0
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /.snapshots                     btrfs           noatime,relatime,compress=lzo,ssd,discard=async,subvol=@snapshots   0 0
# tmps
tmpfs                                           /tmp                            tmpfs           defaults,size=4G                                                                0 0
tmpfs                                           /run                            tmpfs           size=100M                                                                       0 0
# shm
shm                                             /dev/shm                        tmpfs           nodev,nosuid,noexec                                                             0 0

WARNING I’ve been using the option discard=async since I’m using kernel greater then 5.6, if you’re using kernel 5.4.x (or older) don’t use the discard=async option!!!

Installing the sources

Install the kernel, genkernel and cryptsetup:

(chroot) livecd ~# emerge --ask sys-kernel/gentoo-sources
(chroot) livecd ~# emerge --ask sys-kernel/genkernel
(chroot) livecd ~# emerge --ask sys-fs/cryptsetup

Optional - Installing firmware

Some drivers require additional firmware, if you use some of those you need to install the firmware packages:

(chroot) livecd ~# emerge --ask sys-kernel/linux-firmware

Optional - Installing intel microcode

If you have an Intel cpu and you want to upgrade the microcode you could install the intel-microcode package:

(chroot) livecd ~# mkdir -p /etc/portage/package.use
(chroot) livecd ~# echo "sys-firmware/intel-microcode initramfs" > /etc/portage/package.use/intel-microcode
(chroot) livecd ~# emerge --ask sys-firmware/intel-microcode

Configure genkernel.conf

Configure genkernel for systemd, LUKS and Btrfs:

(chroot) livecd ~# cd /etc/
(chroot) livecd /etc # cp -p genkernel.conf genkernel.conf.ORIG

(chroot) livecd /etc # vim genkernel.conf
...
MAKEOPTS="$(portageq envvar MAKEOPTS)"
...
LUKS="yes"
...
BTRFS="yes"
...

Run genkernel

Configure your kernel with the preferred options, and then

(chroot) livecd ~# time genkernel all

Install and configure grub

Install grub

Configure grub to use efi-64 platform:

(chroot) livecd ~# echo 'GRUB_PLATFORMS="efi-64"' >> /etc/portage/make.conf

emerge grub:

(chroot) livecd ~# emerge --ask sys-boot/grub

Install grub

just run grub-install:

(chroot) livecd ~# grub-install --target=x86_64-efi --efi-directory=/boot

Configure grub

First find the LUKS UUID and the root filesystem UUID:

(chroot) livecd ~# blkid  | egrep '(crypto_LUKS|luksdev)'
/dev/nvme0n1p3: UUID="1ce717f4-5a82-49e7-ae1c-9a92e4c20251" TYPE="crypto_LUKS" PARTLABEL="luks" PARTUUID="c271c93e-6f59-446f-9139-a0b98afab820"
/dev/mapper/luksdev: LABEL="Gentoo" UUID="a000eea9-d97c-4107-ae39-602049a6acaa" UUID_SUB="d45b2afd-7250-4ba1-a896-b0e81a20fa4b" BLOCK_SIZE="4096" TYPE="btrfs"

In my case the LUKS UUID is 1ce717f4-5a82-49e7-ae1c-9a92e4c20251 and the root UUID is a000eea9-d97c-4107-ae39-602049a6acaa

Edit /etc/default/grub

(chroot) livecd ~# cd /etc/default/
(chroot) livecd /etc/default # cp -p grub grub.ORIG
(chroot) livecd /etc/default # vim grub

and most important configure GRUB_CMDLINE_LINUX with

GRUB_CMDLINE_LINUX="init=/usr/lib/systemd/systemd crypt_root=UUID=1ce717f4-5a82-49e7-ae1c-9a92e4c20251 root=UUID=a000eea9-d97c-4107-ae39-602049a6acaa rootflags=subvol=@"

where

parameter	description
init	set init to systemd -> /usr/lib/systemd/systemd
crypt_root	put here the LUKS UUID (from blkid) of the third partition /dev/nvme0n1p3
root	put here the Btrfs UUID (kept from luksdev device on blkid)
rootflags	tell the kernel what Btrfs’ subvolume contains root filesystem (@)

finally we can run grub-mkconfig:

(chroot) livecd /etc/default # cd
(chroot) livecd ~# grub-mkconfig -o /boot/grub/grub.cfg

Set the root password

Remember to set the root password:

(chroot) livecd ~# passwd

Rebooting the system

Exit the chrooted environment and reboot:

(chroot) livecd ~# exit
livecd /mnt/gentoo # shutdown -r now

Optional: Add swap subvolume and swapfile

If you want to add a swapfile, inside your Btrfs filesystem, after the first reboot, check at first, the LUKS device mapper created by your initramfs, if you’re using genkernel this device mapper is called root:

~# blkid | grep Gentoo
/dev/mapper/root: LABEL="Gentoo" UUID="a000eea9-d97c-4107-ae39-602049a6acaa" UUID_SUB="d45b2afd-7250-4ba1-a896-b0e81a20fa4b" BLOCK_SIZE="4096" TYPE="btrfs"

now mount it under /mnt/gentoo directory:

~# mkdir -p /mnt/gentoo
~# mount /dev/mapper/root /mnt/gentoo

create the @swap subvolume:

~# btrfs subvolume create /mnt/gentoo/@swap
Create subvolume '/mnt/gentoo/@swap'

create the swapfile, in my case 4G of swapfile, but you can adapt it as your needs:

~# truncate -s 0 /mnt/gentoo/@swap/swapfile

disable the copy on write and Btrfs compression:

~# chattr +C /mnt/gentoo/@swap/swapfile

~# btrfs property set /mnt/gentoo/@swap/swapfile compression none

and create the 4G swapfile:

~# fallocate -l 4G /mnt/gentoo/@swap/swapfile

~# chmod 0600 /mnt/gentoo/@swap/swapfile

~# mkswap /mnt/gentoo/@swap/swapfile
Setting up swapspace version 1, size = 4 GiB (4294963200 bytes)

create the mountpoint /.swap:

~# mkdir /.swap

and add two rows in the fstab, one for the @swap subvolume, and the second one for the swapfile /.swap/swapfile

~# cd /etc

/etc# cat fstab
# /etc/fstab: static file system information.
UUID=1D97-3854                                  /boot                           vfat            noatime                                                                         0 1
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /                               btrfs           noatime,relatime,compress=lzo,ssd,discard=async,subvol=@            0 0
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /home                           btrfs           noatime,relatime,compress=lzo,ssd,discard=async,subvol=@home        0 0
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /.snapshots                     btrfs           noatime,relatime,compress=lzo,ssd,discard=async,subvol=@snapshots   0 0
UUID=a000eea9-d97c-4107-ae39-602049a6acaa       /.swap                          btrfs           noatime,relatime,compress=no,ssd,discard=async,subvol=@swap         0 0
/.swap/swapfile                                 none                            swap            sw                                                                              0 0
# tmps
tmpfs                                           /tmp                            tmpfs           defaults,size=4G                                                                0 0
tmpfs                                           /run                            tmpfs           size=100M                                                                       0 0
# shm
shm                                             /dev/shm                        tmpfs           nodev,nosuid,noexec                                                             0 0

remember to set compress=no to speeding up the swapfile and set discard=async only if have a kernel > 5.6.

Final umount /mnt/gentoo and reboot:

~# umount /mnt/gentoo
~# reboot