Fernando Cejas

Rust cross-platform… The Android part…

2023-07-27T00:00:00+00:00

“Nothing is impossible. The word itself says ‘I’m possible!”

Introduction and Whys

Even in a world of Kotlin Multiplatform, there are other options, which might cover other specifc use case scenarios (more about it along this post).

This is actually the main reason, why I would like to present Rust as a candidate for code reusability across different platforms… in this case for Android Development.

DISCLAIMER: The idea is NOT to develop a full android application entirely in Rust, but to delegate specific functionality by integrating this language, which brings high performance and memory safety between its main characteristics.

Our Goal

Our project consists of an Android Application that will call Rust code in order to encrypt/decrypt a given String:

Our Android App calling Rust code.

Where is the code?

Before continuing, it is worth mentioning that the entire codebase sits in a Github Repository containig extra documentation and code comments to facilitate UNDERSTANDING and LEARNING.

The Big Picture

In a nutshell, our project will follow this flow:

Our global project overview.

Rust and Android interaction involves a bunch of parts (my approach is to have 2 separated projects that we can independently evolve):

Rust compilation takes place in the first place.
JNI artifacts (libraries) are generated for different android cpu architectures and instruction sets.
These artifacts (extension .so) should be placed in jniLibs folder inside the Android Project.
Android consumes them via Java Native Interface (JNI).

As a next step, let’s run the project, break things down and dive deeper into each part.

Running the Project

After cloning the repo, follow the steps below.

Requirements

Android SDK and NDK installed.
ANDROID_HOME env variable pointing to the Android Sdk location: mine is at /home/fernando/Android/Sdk.
Android NDK version should match the one inside the jni_crypto/build.rs file.
- In my case $ANDROID_HOME/ndk/25.2.9519653 matches with ANDROID_NDK_VERSION = "25.2.9519653".
Rust latest edition. If in trouble, check the project Cargo.toml file for the correct one.
Your IDE and Editor of preference.

Generating Rust artifacts (crates)

Go to rust-library/jni_cryptor folder.
Run cargo run --bin release.
Run cargo run --bin publish.
OPTIONAL: cargo test.

Running the Android App

Import in Android Studio the android-sample folder (build.gradle.kts file).
Run app via IDE.

Crypto: The Rust Project

The Rust project structure (called crypto) looks like this:

Our Rust ‘crypto’ project overview.

cryptor: Our core crate where we perform string encryption/decryption.
cryptor_global: As it name established, a global crate for code reusability.
cryptor_jni: Our JNI exposed API that act as a proxy by calling cryptor functions.

NOTE: We focus on the sub-projects that involve Android, so do not worry about the content of the other folders, since each of them is independent and they do not affect each other.

Crypto: Show me the code

Let’s use the example of text encryption (it is simplified by only base64-encoding a string). So here is our encrypt function in Rust, part of the crypto crate inside the cryptor/src/libs.rs file:

use base64::{
    Engine as _, 
    engine::general_purpose::STANDARD as base64Engine
};

///
/// Encrypts a String.
/// 
pub fn encrypt(to: &str) -> String {
    base64Engine.encode(String::from(to))
}

And a tiny test for it:

use cryptor;

#[test]
fn test_encrypt_string() {
    let to_encrypt = "hello_world_from_rust";
    let str_encoded_b64 = "aGVsbG9fd29ybGRfZnJvbV9ydXN0";

    let encrypted_result = cryptor::encrypt(&to_encrypt);
    
    assert_eq!(str_encoded_b64, encrypted_result);
}

Now we need our JNI Api in place, which makes use of our crypto crate (as showcased in the crypto project structure picture above). This sits inside the cryptor_jni/scr/libs.rs file:

///
/// [cfg(target_os = "android")]: Compiler flag ("cfg") which exposes
/// the JNI interface for targeting Android in this case
/// 
/// [allow(non_snake_case)]: Tells the compiler not to warn if
/// we are not using snake_case for a variable or function names.
/// For Android Development we want to be consistent with code style. 
/// 
#[cfg(target_os = "android")]
#[allow(non_snake_case)]
pub mod android {

    extern crate jni;
    
    // This is the interface to the JVM 
    // that we'll call the majority of our
    // methods on.
    // @See https://docs.rs/jni/latest/jni/
    use self::jni::JNIEnv;

    // These objects are what you should use as arguments to your 
    // native function. They carry extra lifetime information to 
    // prevent them escaping this context and getting used after 
    // being GC'd.
    use self::jni::objects::{JClass, JString};
    
    // This is just a pointer. We'll be returning it from our function. 
    // We can't return one of the objects with lifetime information 
    // because the lifetime checker won't let us.
    use self::jni::sys::jstring;
    
    use cryptor::encrypt;

    ///
    /// Encrypts a String.
    /// 
    #[no_mangle] // This keeps Rust from "mangling" the name so it is unique (crate).
    pub extern "system" fn Java_com_fernandocejas_rust_Cryptor_encrypt<'local>(
        mut env: JNIEnv<'local>,
        // This is the class that owns our static method. It's not going to be used,
        // but still must be present to match the expected signature of a static
        // native method.
        _class: JClass<'local>,
        input: JString<'local>,
    ) -> jstring {

        // First, we have to get the string out of Java. Check out the `strings`
        // module for more info on how this works.
        let to_encrypt: String = env.get_string(&input)
                                    .expect("Couldn't get java string!").into();

        // We encrypt our str calling the cryptor library
        let encrypted_str = encrypt(&to_encrypt);
        
        // Here we have to create a new Java string to return. Again, more info
        // in the `strings` module.
        let output = env.new_string(&encrypted_str)
                        .expect("Couldn't create Java String!");

        // Finally, extract the raw pointer to return.
        output.into_raw()
    }
}

Something to pay a bit of attention to, is the function signature, which we will cover in our android project part. But for now, let’s leave it here and focus on our artifact (crate) generation.

NOTE: I have used the jni crate for this purpose, which has excellent documentation.

Crypto: Artifact Generation

At this point, our Rust code is in place, and we need to generate our .so artifacts via cargo (Rust package manager).

When building the crypto_jni crate with the cargo build command (inside our crypto_jni foler), cargo first searches for a build script file (build.rs) in the root folder of the project in order to execute it.

AND HERE IS WHERE THE MAGIC HAPPENS!!!… so let’s have a look at what is inside our build.rs file:

...
static ANDROID_NDK_VERSION: &str = "25.2.9519653";
...
fn main() {
    system::rerun_if_changed("build.rs");

    create_android_targets_config_file();
    add_android_targets_to_toolchain();
}

Basically we are creating a cargo config file containing android targets information, needed by cargo to perform cross compilation.

Run cargo build inside the cryptor_jni folder and once done open the generated file at rust-library/cryptor_jni/.cargo/config, which should look similar to this:

[target.armv7-linux-androideabi]
ar = ".../ndk/25.2.9519653/.../linux-x86_64/bin/arm-linux-androideabi-ar"
linker = ".../ndk/25.2.9519653/.../linux-x86_64/bin/armv7a-linux-androideabi21-clang"

[target.i686-linux-android]
ar = ".../ndk/25.2.9519653/.../linux-x86_64/bin/i686-linux-android-ar"
linker = ".../ndk/25.2.9519653/.../linux-x86_64/bin/i686-linux-android21-clang"

[target.aarch64-linux-android]
ar = ".../ndk/25.2.9519653/.../linux-x86_64/bin/aarch64-linux-android-ar"
linker = ".../ndk/25.2.9519653/.../linux-x86_64/bin/aarch64-linux-android21-clang"

[target.x86_64-linux-android]
ar = ".../ndk/25.2.9519653/.../linux-x86_64/bin/x86_64-linux-android-ar"
linker = ".../ndk/25.2.9519653/.../linux-x86_64/bin/x86_64-linux-android21-clang"

Each target in the above config file derives from the Official Android Documentation on “Using NDK with other build systems” which basically states that in order to build for an specific cpu architecture/type and instruction set (ABI), the Android NDK provides pre-compiled toolchains that need to be used (Ex. arm-linux-androideabi-ar and armv7a-linux-androideabi21-clang).

Now cargo knows what to build and how, for instance, the next step is to add those targets to the rust toolchain, which is basically what this line of code is doing:

fn main() {
    ...
    /// ## Examples
    /// `rustup target add arm-linux-androideabi`
    ///
    /// Reference:
    /// - https://rust-lang.github.io/rustup/cross-compilation.html
    add_android_targets_to_toolchain();
}

The above code enables us to run the following commands if we wanted to individually build our targets:

cargo build --target armv7-linux-androideabi
cargo build --target i686-linux-android 
cargo build --target aarch64-linux-android
cargo build --target x86_64-linux-android

Although this is perfect valid, it is tedious… that is why it is a good practice to AUTOMATE ALL THE THINGS (as much as possible). And this is done by the cryptor_jni/src/bin/release.rs file, relying on cargo binary targets, which are basically programs that can be executed after comopilation:

cargo run --bin release

Last but not least, there is another binary target call publish (publish.rs file) that we can execute:

cargo run --bin publish

This will copy all the generated targets to its corresponding android directories in our android-sample project.

Crypto: Android ABIs

We have been mentioning ABIs previously along this article, but what is that exactly and how does an ABI relate to a target?

ABI stands for Application Binary Interface, which is a combination of a CPU type/architecture and instruction set. In Android Development any NDK target must be mapped to a specific directory in the project. This relationship is as following according to the documentation:

 -------------------------------------------------------------
  ANDROID TARGET                ABI (folder inside `jniLibs`)
 -------------------------------------------------------------
  armv7a-linux-androideabi ---> armeabi-v7a  
  aarch64-linux-android    ---> arm64-v8a    
  i686-linux-android       ---> x86	        
  x86_64-linux-android     ---> x86_64       
 -------------------------------------------------------------

Crypto: Infrastructure Improvements

So far, evertyhing is ready for development on the Rust side with some automation… But of course, there are a couple of IMPROVEMENTS that I did not want to skip… and even though these are OUT OF SCOPE of this article, they are definitely worth highlighting:

There is NO Semantic Versioning for crates, which is something required as soon as the project grows in complexity.
Artifacts are copied and overriden directly inside the android project (jniLibs directory): ideally these ones should be properly versioned (mentioned above) and uploaded to a crates repository or similar.

Android: Implementation Details

On the Android side of things, there are a couple of moving parts that we have to take into consideration.

Android: Setting Up the Build System

In the build.gradle.kts we need to add NDK configuration:

android {
    ...
    ndk {
      // Specifies the ABI configurations of your native
      // libraries Gradle should build and package with your APK.
      // Here is a list of supported ABIs:
      // https://developer.android.com/ndk/guides/abis
      abiFilters.addAll(
        setOf(
          "armeabi-v7a",
          "arm64-v8a",
          "x86",
          "x86_64"
        )
      )
    }
    ...
}

Android: Loading Rust Libraries

This is done at Android Application Class level:

class AndroidApplication : Application() {

    override fun onCreate() {
        super.onCreate()
        loadJNILibraries()
    }

    private fun loadJNILibraries() {
        /**
         * Loads the Crypto C++/Rust (via JNI) Library.
         *
         * IMPORTANT:
         * The name passed as argument () maps to the
         * original library name in our Rust project.
         */
        System.loadLibrary("cryptor_jni")
    }
}

Android: Calling Rust from Kotlin

In order to call Rust via JNI, we have to respect method/function signature. This is essential and a MUST so that classes, functions and methods can be found by the android runtime.

Remember this piece of code from our cryptor_jni project that encrypts a String:

...
pub extern "system" fn Java_com_fernandocejas_rust_Cryptor_encrypt<'local>(
    mut env: JNIEnv<'local>,
    _class: JClass<'local>,
    input: JString<'local>,
) -> jstring {
    ...
}
...

Invoking it from Kotlin, would mean creating a kotlin class RESPECTING package and function naming:

package com.fernandocejas.rust

/**
 * Helper that acts as an interface between native
 * code (in this case Rust via JNI) and Kotlin.
 *
 * By convention the function signatures should respect
 * the original ones from Rust via JNI Project.
 */
class Cryptor {

    /**
     * Encrypt a string.
     *
     * This is an external call to Rust using
     * the Java Native Interface (JNI).
     *
     * @link https://developer.android.com/ndk/samples/sample_hellojni
     */
    @Throws(IllegalArgumentException::class)
    external fun encrypt(string: String): String
...
}

We are done!!! Now we can inject our Cryptor class where it is needed and encrypt/decrypt Strings:

private val cryptor = Cryptor()
val encryptedString = cryptor.encrypt("something")

Use Case Scenarios

You might be wondering what is the real purpose of all this wiring for integrating Rust with Android… At the moment I can think of some real use cases:

Music Player: this comes from my experience at SoundCloud, where we had our project in C++ cross platform compiled.
Video Player or Media Library: similar use case as a Music Player but with video encoding/decoding.
Encryption Library: as showcased in this post but with more advanced functionality and low level details.
Existent Project: sometimes we cannot control 100% our environment, so how does Mozilla integrates its Gecko Engine on Firefox Android?

Conclusion

As a conlusion, I would say… WE DO NOT WANT to replace Kotlin with Rust. Kotlin is very good as what it is meant to be: Android Development in THIS CASE. But keep in mind that PICKING THE BEST TOOL FOR THE RIGHT JOB is essential to fulfill project requirements, and it is in this context where we can count on Rust as ONE MORE TOOL IN OUR TOOLBOX.

Ufff… that was a loooong post… but if you made it to this point, you should definitely feel proud of yourself… Till the next time and do not forget to provide FEEDBACK!!!

References

An over-engineered Home Lab with Docker and Kubernetes.

2023-01-06T00:00:00+00:00

“Instead of WORRYING about what you cannot control, SHIFT your ENERGY to what you can create.”

Introduction

After a long time of procrastination, I finally finished this blog post where I would like to share my journey on setting up my personal HOME LAB. This includes 2 different approaches: Kubernetes and Docker.

I will also highlight some of the problems I bumped into, the cost of maintenance and some tips and tricks.

SPOILER: I started with Kubernetes and ended up with a pure/plain Docker approach. Both are great tools and should be used for their intended purpose. So in order to understand what happened here, let’s start this journey and jump on this train!!!

DISCLAIMER: This article is OPINIONATED but also PROVIDES THE TECHNICAL KNOWLEDGE to get started with Docker and Kubernetes, so I definitely encourage you TO READ TILL THE END (with pacience).

But…Why???

I strongly believe that this is the first question we have to ask ourselves whenever we decide to go for such a complex project.

We have to establish our goals, so here are mine:

LEARNING PURPOSE: Kubernetes and Docker are mature tools that have been around for years and they are widely used.
DATA PRIVACY: Keeping and owning my own data is a HIGH PRIORITY for me.

Whenever we make such a decision, we need to keep in mind the commitment to the project, which includes the following time investment:

MAINTENANCE: we have to keep our infrastructure constantly ‘up to date’… and yes, automate all the things, but still checking release notes, incompatible breaking changes, migrations, etc, which could lead to headaches and extra time consumption.
TROUBLESHOOTING: there is always going to be issues, not only for the items mentiones above, but also because of the server(s) hardware, network, internet connection, etc. This is a project with complex moving parts, so we have to be prepared for it too.

I hope you are not already freaking out… just KEEP READING, there is always light at the end of the tunnel, and this is, by the way, the main reason why we are here too :).

Bare Metal

Software does not exists without Hardware backing it up… and to be honest, I could have done pretty much all this via some cloud providers, but then, neither this is no longer a HOME LAB nor a fully PRIVATE HOME SERVER, so I opted for the following bare metal:

2 Intel NUC Mini PCs.
1 Modem/Router hosting my Internet Connection with Port Forwarding Support.
1 Router with OpenWRT OS support.
1 Network Attached Storage (NAS).

Do not worry if you are not familiar with some of these mentioned concepts, I will try to make them clear and explain their responsibilities in the project’s architecture.

Arch Linux with LTS Kernel is my choice as my Home Lab Server.

Main Software Stack

Kubernetes: for managing containerized workloads and services.
Docker: for managing containerized applications.
Linux.
Arch: with LTS Kernel (for stability) as the main OS for the Servers.
OpenWRT: for network handling (DNS, Port Forwarding, DHCP, etc) inside my router.
Wireguad: for the encrypted virtual private network (VPN).

The Kubernetes Approach

This was my first approach, based on Kubernetes, so in order to understand what tha means, here is what the official website writes down:

“Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.”

At first I set up the official Kubernetes (k8s) but then I realized that there is a more lightweight version of it, which fulfilled my needs. It is also called Kubernetes (k3s), and it is basically a single binary that only takes up around 50 MB of space and has low resource usage of around 300 MB of RAM. Even thought there are tiny differences, they are both mostly compatible, so learning one will pretty much cover the other. That is perfect!!!

Kubernetes: The Main Components

Before continuing, it is VERY IMPORTANT to understand some of the concepts or fundamental blocks that are part of Kubernetes. Here is a summary in a very simplistic way:

Cluster: When we deploy Kubernetes, we get a cluster.
Node: A node is a working machine in Kubernetes.
Pod: A Pod represents a set of running containers in our cluster.
Control Plane: The container orchestration layer that exposes the API and interfaces to define, deploy and manage the lifecycle of containers.

In essence, a Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node. The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.

Of course we are just scratching the surface here, but for our purpose, that is enough. There is way more, and for a deeper explanation, refer to the official Kubernetes Documentation.

Kubernetes: Home Lab Architecture

Now that we understand the Kubernetes main concepts, here is a raw picture of my Home Lab Infrastructure with Kubernetes (k3s):

Home Lab General Architecture with Kubernetes.

WHAT IS GOING ON? In a nutshell, here is the normal flow when accessing any of the hosted services in my Arch Linux servers:

Traffic comming from the Internet (via Dynamic DNS) is received by the Router.
The router runs OpenWRT as OS and hosts and manages the VPN using Wireguard (more on this in the Security Section).
Once the request passes the security checks of the VPN, we are inside our Local Area Network (LAN), zone of Linux Servers and for instance, the Kubernetes Cluster.
The Kubernetes Cluster is composed by 2 Nodes running Linux: one node being the master (Kubernetes Control Plane) and the other is a worker.
In Reality both Kubernetes Nodes can act as workers, meaning that the load is distributed between both nodes via Ingress (usually NGINX), which acts as a Load Balancer and Reverse Proxy.
Persistence is handled by NFS (Network File System), which means that there is only one single point where I store my data/information.
The NAS (Network Attached Storage) contains some services specific to it (from Synology) and acts as a drive (via NFS) that both Linux Servers see as a local drive.

Kubernetes: Application Flow

Now that we have the big picture on what is going on, mostly at hardware level (mentioned in the previous section), the next step would be to answer the following question:

What happens when I reach any hosted app contained in the Kubernetes Cluster?

A picture is worth a thousand words:

Kubernetes Application Flow.

As we can see, this is the flow:

A request enters our Kubernetes cluster from the outside (either from the Internet or LAN).
As mentioned before, an Ingress has these main functions:
- Load Balancer: when it is declared as Service.Type=LoadBalancer.
- Reverse Proxy: exposes HTTP and HTTPS routes from outside the cluster to services within the cluster.
A Service is a method for exposing a network application that is running as one or more Pods in our cluster (if we skip setting up a service, we are not gonna be able to reach our containerized apps).
Pods are the smallest deployable units of computing that we can create and manage. Each of them is a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers.

Pods in Kubernetes are EPHIMERAL: they are intended to be disposable and replaceable. We cannot add a container to a Pod once it has been created. Instead, we usually delete and replace Pods in a controlled fashion using deployments.

Kubernetes: Cluster setup

Now we have to get our hands dirty and start setting up our cluster.

At this point in time, I assume that we have the minimum set of requirements in place:

A LINUX SERVER up and running (if it is just for testing, we could also use a couple of VMs).
SSH (or alternative/){: target=”_blank” }) properly configure in our headless server in order to manage it.
OPTIONAL: NFS up and running in our server, in case we want to store our data/information in an external network drive outside of our Linux Server.

DISCLAIMER: Since documentation tends to get out of date, this initial Kubernetes setup will be done by pointing to the official documentation for each of the components we have to configure/install.

These are the steps and list of ingredients we need for our recipe:

1. Install k3s MASTER and WORKER nodes.

2. Install kubectl (if not already installed after Step 1) in order to connect remotely to the cluster.

3. Install Helm if necessary, a package manager for Kubernetes, which will facilitate actually installing packages in our cluster.

4. Setup a Load Balancer. k3s already comes with ServiceLB, but I found MetalLB the right option for bare metal, because it makes the setup easier on clusters that are not running on cloud providers (we have to disable ServiceLB though){: target=”_blank” }.

5. Install Nginx Web/Reverse Proxy, which is our Ingress, in order to expose HTTP and HTTPS routes from outside the cluster to services within the cluster. k3s recommends another option too: Traefik, so up to you.

6. Install and configure cert-manager. I would label this as OPTIONAL but I guess we want to be able to have valid SSL/TSL Certificates to avoid our browser warning us when accessing our hosted applications.

7. Deploy and configure Kubernetes Dashboard, which is a web-based Kubernetes user interface.

If everything went well so far, we should be able to see information about our cluster by running:

$ kubectl get nodes -o wide

NAME           STATUS   ROLES    AGE     VERSION         INTERNAL-IP    EXTERNAL-IP
kube-master    Ready    master   44h     v1.25.0+k3s.1   192.168.0.22   
kube-worker    Ready       2m47s   v1.25.0+k3s.1   192.168.0.23   

Or we can also access our Kubernetes Dashboard (sample picture):

The kubernetes-dashboard provides a great UI to manage our cluster.

Kubernetes: Administration

We have a variety of tools in this area:

kubectl (already installed).
Kubernetes Dashboard (already installed).
Lens.
kuvenav.
k9s.

I would say that it is up to you, to choose the one that better fulfills your requirements.

Also, let’s not forget to check the Addons sections in the Kubernetes Official Documentation.

k9s is such a powerful Kubernetes Terminal Client.

Kubernetes: Application Example

This is a simple example where we will be deploying https://draw.io/ to our Kubernetes cluster.

First, we create a namespace for our draw.io application called home-cloud-drawio

$ kubectl create namespace home-cloud-drawio

Second, we create a file called drawio-app.yml with the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: home-cloud-drawio
  name: drawio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: drawio
  template:
    metadata:
      labels:
        app: drawio
    spec:
      containers:
      - name: drawio
        image: jgraph/drawio
        resources:
          limits:
            memory: "256Mi"
            cpu: "800m"
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  namespace: home-cloud-drawio
  name: drawio-service
spec:
  selector:
    app: drawio
  ports:
  - port: 5001
    targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: home-cloud-drawio
  name: drawio-ingress
  labels:
    name: drawio-ingress
spec:
  rules:
  - host: home-cloud-drawio
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: drawio-service
            port: 
              number: 5001
  ingressClassName: "nginx"

As a third step, we apply the configuration contained in the drawio-app.yml file:

$ kubectl apply -f drawio-app.yml

BOOM!!! We have basically created a Deployment, which includes a Service and Ingress configuration to access our hosted app from outside the cluster.

Now let’s check the running services to corroborate that everything works as expected:

kubectl get services -o wide --all-namespaces

NAMESPACE            NAME              TYPE           CLUSTER-IP      EXTERNAL-IP
default              kubernetes        ClusterIP      10.43.0.1                  
kube-system          kube-dns          ClusterIP      10.43.0.10                 
kube-system          metrics-server    ClusterIP      10.43.33.97                
kube-system          nginx-ingress     LoadBalancer   10.43.196.229   192.168.0.200   
home-cloud-drawio    drawio            ClusterIP      10.43.35.88     

We can access our application by visiting http://192.168.0.200 in our browser (avoid the SSL/TSL warning).

In this example we have not added any extra complexity (for learning purpose), but if a hosted app requires Storage, we will have to create Kubernetes Persistent Volumes too. Same with, for example Let’s Encrypt Certificates

TIP: As a rule of thumb, all our infrastructure logic and files should be in a VCS like git.

Kubernetes: Useful Commands

kubectl is a very powerful CLI, it has great documentation and a very useful cheatsheet.

These are some of the most common commands I use:

# Cluster information
$ kubectl cluster-info
$ kubectl get nodes -o wide

# Check runnint Services
$ kubectl get services -o wide --all-namespaces

# Check running Ingress
$ kubectl get ingresses --all-namespaces

# Display all the running Pod
$ kubectl get pods -A -o wide

# Get logs for an specific Pod
$ kubectl logs -f  -n 

# Get information about an specific Pod
$ kubectl describe pod  -n 

Rules of (Over)-Complexity

Ok, so at this point in time…I LEARNED A LOT (and invested a lot of time too)…but I also HAD HEADACHES, and this is where the Rule of Seven applied:

THE RULE OF SEVEN: never try to juggle more than seven mental balls.

In the end, I had a bunch of moving parts (with Kubernetes) which turned to be super complicated for what I really needed, plus I had a cluster with a lot of capacity that I was barely using (refer to the Monitoring Section for more on this).

That is why I decided to apply what I ALWAYS encourage in my daily work life:

Reduce complexity by removing balls.
Do not reinvent the wheel.
YAGNI: You Aren’t Gonna Need It.

A simpler Docker Approach

Based on my previous points, then a pure Docker approach (with docker compose) was the way to go:

Home Lab General Architecture with Docker.

Upfront, this infrastructure architecture seems very similar to the one defined with Kubernetes, and indeed it is, the flow is the same as described above and server configuration is equal too. The biggest changeset has to do with implementation details:

I only need one server (load distribution is off here).
Handling configuration files with docker is easier.
Less moving parts, les complexity, for instance, less to maintain.
I do not need a system for Microservices Orchestration.

As an example, we will setup the same application as above: draw.io with docker compose:

We add this content to our docker compose file called home-lab.yml:

version: "3.8"

services:

  traefik:
    image: traefik:latest
    container_name: traefik
    command:
      # Dynamic Configuration: mostly used for TSL certificates
      - --providers.file.filename=/etc/traefik/dynamic_conf.yml
      # Entrypoints configuration
      - --entrypoints.web-secure.address=:443
    labels:
      - traefik.http.routers.traefik_route.rule=Host(`traefik.home.lab`)
      - traefik.http.routers.traefik_route.tls=true
      - traefik.http.routers.traefik_route.service=traefik_service
      - traefik.http.services.traefik_service.loadbalancer.server.port=8080
    ports:
      - 80:80
      - 443:443
    volumes:
      - ~/traefik/dynamic_conf.yml:/etc/traefik/dynamic_conf.yml
      - ~/traefik/_wildcard.home.lab.pem:/etc/traefik/_wildcard.home.lab.pem
      - ~/traefik/_wildcard.home.lab-key.pem:/etc/traefik/_wildcard.home.lab-key.pem
    networks:
      - home-lab-network
    restart: always

  drawio:
    image: jgraph/drawio:latest
    container_name: drawio
    labels:
      - traefik.http.routers.drawio_route.rule=Host(`drawio.home.lab`)
      - traefik.http.routers.drawio_route.tls=true
      - traefik.http.routers.drawio_route.service=drawio_service
      - traefik.http.services.drawio_service.loadbalancer.server.port=8080
    networks:
      - home-lab-network
    restart: always

Let’s understand first what is going on within this file:

We define 2 services: traefik and drawio.
Traefik is our reverse proxy:
- Acts as our home lab entry point and forward requests to app containers.
- Manages SSL/TSL Certificates: I use self-signed certificates with mkcert for my custom domain: home.lab.
Traefik SSL/TLS configuration uses the dynamic_conf.yml defined in our docker home-lab.yml file, volumes section, which looks like this:

tls:
  certificates:
    - certFile: /etc/traefik/_wildcard.home.lab.pem
      keyFile: /etc/traefik/_wildcard.home.lab-key.pem
      stores:
        - default

  stores:
    default:
      defaultCertificate:
        certFile: /etc/traefik/_wildcard.home.lab.pem
        keyFile: /etc/traefik/_wildcard.home.lab-key.pem

As a next step, we execute the following command to run our containers:

$ docker compose -f home-lab.yml

BOOM!!! Working!!! Let’s double check:

$ docker ps -a

CONTAINER ID   IMAGE           STATUS       PORTS                    NAMES
de20745cda65   traefik:latest  Up 5 hours   0.0.0.0:80->80/tcp       traefik
as24545tda76   drawio:latest   Up 5 hours   0.0.0.0:8080->8080/tcp   drawio

To access our hosted app, let’s just open a browser and go to https://drawio.home.lab.

Useful Docker Commands

First, it is mandatory to check the official documentation and the docker CLI cheatsheet.

# Running containers
$ docker ps -a 
$ docker container ls -a

# Container management/handling
$ docker container stop 
$ docker container restart 
$ docker container rm 

# Image management/handling
$ docker images 
$ docker image rm 

# Existent Volumes
$ docker volume ls

Monitoring

We can use 4 main services for Alerting and Monitoring:

Prometheus: an open-source systems monitoring and alerting toolkit originally built at SoundCloud.
Grafana: allows us to query, visualize, alert on and understand metrics.
cAdvisor: provides an understanding of the resource usage and performance characteristics of running containers.
Portainer: is one of the most popular container management platform nowadays.

Useful official setup guides:

Here a screenshot of my Home Lab Monitoring/Alerting via the mentioned services/tools, where Prometheus scraps cAdvisor performance data and it is display on a Grafana dashboard:

Grafana - Prometheus - cAdvisor combo for Alerting and Monitoring.

Extra ball: We can use ctop locally in our Linux Server:

ctop provides a concise overview of real-time metrics for multiple containers.

Security

There is ‘NO 100%’ secure system, but we can always reduce risk. Personally:

I do not expose my NAS or Services to the Internet.
I only have a random port open in my router for my Wireguard VPN access.
Server and NAS are both encrypted.
I apply the latest security updates/patches (OS, Services and Infrastructure).

Alternatives to a VPN?

So far, I have mentioned that probably the safest way to access our Home Lab is to setup a Wireguard VPN, but there are a couple of alternatives to still setup our Home Lab for external access:

Cloudflare Tunnel: an encrypted tunnel between our origin web server and Cloudflare’s nearest data center, all without opening any public inbound ports.
Tailscale: in the end a bit of a Zero-Config VPN.

Honestly, I have no experience with them, since one of my main goals is PRIVACY, and it would be hard to proof whether they store META-DATA or INFORMATION about traffic.

Fault Tolerance and Resilience

Fault Tolerance simply means a system’s ability to continue operating uninterrupted despite the failure of one or more of its components.

A system is resilient if it continues to carry out its mission in the face of adversity.

Revisiting these concepts trigger a couple of questions we need to answer…

How can we make sure our Home Lab is highly available?

No silver bullets here, and I also gotta say that in this space our approach with Kubernetes clearly wins, especially due to the capacity of having multiple worker nodes (high availability by nature), thus if one of them fails, the other could continue operating and take the load of the down one. The downside is whether our Kubernetes control plane fails, then we are in the same situation as with our single Server approach with docker (check docker swarm for high availability).

In case of failure with our simpler docker approach, we have an ADVANTAGE too: it is relatively easy to re-run the entire infrastructure, which means only ONE COMMAND. And as this happened to me (so far once and fingers crossed), I just grabbed a backup of my data and configured everything in NO TIME on my local computer until I figured out the issue.

How can we keep our data/information safe?

Data redundancy occurs when the same piece of data is stored in two or more separate places.

My approach for DATA REDUNDACY includes 2 practices:

I use btrfs as file system with RAID 5 (4 hard drives) for data redundacy in case of a hardware failure.
3-2-1 Backup Strategy.

Server Administration

NOTE: The server should be HEADLESS, meaning that we should be able to fully CONTROL and RESTART it REMOTELY without the need of peripherals like a mouse or keyboard.

Assuming that our Server/NAS hard drives are encrypted and need to be decrypted remotely when restarting our Linux Server, we have a couple of options:

Start an SSH Service in our Server Initial RAM Disk (FREE).
Use a KVM Switch (MOSTLY EXPENSIVE).
Open and inexpensive DIY IP-KVM on Raspberry Pi (CHEAP).

Maintenance

The final Result

My Home Lab Dashboard using Homer.

Tips and Tricks

Always document: a WIKI is our friend:
Have a troubleshooting section.
Runbook approach.
Automate ALL the things.

Alternatives

If you reachead this point of the article and you are not convinced of using one or the other, then here I have a couple more alternatives to explore:

Proxmox: is an open-source virtualization platform.
Docker Swarm: is for natively managing a cluster of Docker Engines called a swarm.

Other Infrastructure tooling

I would not finish this article without mentioning some of the biggest players in IT-Infrastructure:

Terraform: it enables infrastructure automation for provisioning, compliance, and management of any cloud, datacenter, and service.
Ansible: is the simplest solution for automating routine IT tasks.
Packer: used for creating identical machine images for multiple platforms from a single source configuration.

Conclusion

Well, after many months or hard work, I’m finally writing this conclusion: It has been (and still is) a long journey, which let me dive into this amazing world of Infrastructure, full of challenges but also with tons of lessons learned. I can only say that this post aims to be a time saver for you, and a source of knowledge share and struggles.

As ALWAYS, any feedback is more than welcome! See you soon :).

References

Arch Linux System Maintainance.

2022-03-30T00:00:00+00:00

“No matter what you’re going through, there’s a light at the end of the tunnel.”

Introduction

System maintainance (and software maintainance in general) is an ongoing process that requires attention and responsibility.

So in this blog post I will summarize the key actions we can take in order to keep our arch linux installation healthy, optmized and fully working.

BTW, If you are NOT using Arch yet, I have a guide explaining how to install it from scracth and also a tiny wiki with information about daily tasks, process and guides.

DISCLAIMER: There is NO better place to everything related to Arch than the Arch Linux Wiki, but in this ocassion, I would like to save us some time, and summarize and pin point the most basic/important stuff.

System update/upgrade

It is very important to have the latest version of the system up and running (including users apps and packages). I gotta say that sometimes things get broken due to the nature of the rolling release model, but since each installation is different, we are responsible for checking Arch Linux latest news in the Arch Linux website..

Once done, we can proceed to perform a system update/upgrade by running:

$ sudo pacman -Syu 

or if you are using any AUR helper (in my case yay):

$ yay -Syu

Troubleshooting

In case of package as marginal trust:

error: : signature from "Someone " is marginal trust
 ...
Do you want to delete it? [Y/n] 

Then update the keyrings as following and run again the full system upgrade command:

$ sudo pacman -Sy archlinux-keyring

Clean pacman cache

The package manager is our source of truth when it comes to what we use in our system but its cache grows exponentially since it keeps ALL versions that we are installing/upgrading. This is of course usefull when it comes to system stability and rolling things back (by using pacman -U /var/cache/pacman/pkg/name-version.pkg.tar.gz) but it requires maintenance.

Let’s perform a bunch of checks before:

$ sudo ls /var/cache/pacman/pkg/ | wc -l  //cached packages
$ du -sh /var/cache/pacman/pkg/           //space used

We can use paccache for this purpose, so let’s install it first (if we do not already have it):

$ sudo pacman -Sy pacman-contrib

Now we can easily clean everything up and keep the latest 3 versions (default behavior):

$ sudo paccache -r

Remove orphan packages

Orphans are no more than unneeded dependencies that are a result of a package which was uninstalled. They waste storage space so they required attention too.

Let’s list all the orphans in our system:

$ sudo pacman -Qdtq

To remove all orphans let’s run:

$ sudo pacman -Qtdq | sudo pacman -Rns -

Troubleshooting

In case of error: argument '-' specified with empty stdin:

We do not have to worry, that means there are no orphans in our system. :)

Remove unwanted packages

Let’s list all the installed packages first in order to check whether we have software we are no longer using:

$ pacman -Qei | awk '/^Name/{name=$3} /^Installed Size/{print $4$5, name}' | sort -h

We can also list the ones installed from the AUR:

$ pacman -Qim | awk '/^Name/{name=$3} /^Installed Size/{print $4$5, name}' | sort -h

If we want to unistall all unneeded packages and their unused dependencies and configuration files:

$ sudo pacman -Rns $(pacman -Qdtq)

In case we want to individually uninstall packages, we use this command instead:

$ sudo pacman -Rns 

Clean /home directory cache

Our cache takes a lot of space as long as we use our system, so it is a good idea to check it out and clean it up accordingly. With the following command we can check its size:

$ sudo du -sh ~/.cache
$ 32G  /home/fernando/.cache

If we want to clear it up, we just remove its content:

$ rm -rf ~/.cache/*

System logs clean-up

System logs are always important to fix issues and to know what is going on within our Linux distro but again, they need a bit of maintenance.

Let’s first perform a system check to see how much space is being consumed by our logs:

$ journalctl --disk-usage

In order to remove logs we use the same command by limiting it by time (check the man for size limit and other alternatives):

$ sudo journalctl --vacuum-time=7d

If we want to permantely set this up by size, we can uncomment SystemMaxUse in /etc/systemd/journald.conf configuration file, to limit the disk usage used by these files by, for example: SystemMaxUse=500M in my case.

Conclusion

That is it… at the least the minimum and basic things…. Just know, that not all the mentioned steps are mandatory and should be done in one shot one after the other, but we want to make sure we care about the healthiness of our system by from time to time giving it a bit of love.

References

Cooking Effective Code Reviews.

2021-08-04T00:00:00+00:00

“HAPPINESS is not something ready made. It comes from your OWN ACTIONS.”

Introduction

Reviewing code is not an easy task. Most of the time code reviews come in a format of Pull Requests (PRs from now on). We all have our own style and guidelines when addressing them, so in this post I will share my own tips in order to make them effective and valuable.

As a starting point I would like to bring up a couple of inspirational coding quotes, which I keep in mind when writing code:

“Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.” - John F. Woods

“Programs must be written for people to read, and only incidentally for machines to execute.” - Harold Abelson

“I’m not a great programmer; I’m just a good programmer with great habits.” - Kent Beck

IT WORKS ON MY MACHINE…

Let’s get started by exploring a bunch of areas, which will help us to create structure and organization within our code reviews, plus things to pay attention to within the process.

Purpose and Importance

Before jumping into the process of reviewing code, we need to understand the whys behind code reviews and how they contribute to better software development.

Let’s enumerate them:

They ensure code quality: four eyes see more than two.
They act as documentation: they could be used to understand, learn and go back in time to check technical decision making.
They encourage collaboration and contribution: team work for the win.
They cultivate engineering culture: a great opportunity to ask questions, share expertise and suggest changes and fixes.

Effective code reviews ensure code quality.

PR Size

The first and most important part of a code review is its PR size. It is key to provide a concise size in order to facilitate the review: keep it short and straight to the point.

As a rule of thumb and in my experience:

200 lines of code are great.
400 lines of code are fine and manageable.
More than 500 lines of code is where things start to become overwhelming.

Remember that effective code reviews are small and often and if you feel you are breaking this rule, try to break code down into tinier chunks.

In situations like renamings or refactors which involve a bigger changeset (due to coupling, legacy code or tech debt: we have already been there right?), you can specify in the PR description or even *pair with someone and commit the changes directly.

PR Description

A good PR description is key in order to rapidly acquire context on what needs to be reviewed. Descriptions and PR themselves should follow the Single Responsibility Principle: do one thing and do it well.

They should basically contain (as short as possible):

Ticket: Link to a ticket from the issue tracker you are using.
Purpose: What?
Reason: Why?
Implementation Details: How?
Testing: Quick explanation of the test cases.
Documentation: Link to the documentation if required.

Boy Scout Rule: I know it is tempting to refactor something out of scope of a PR and include it in it, but… try to not fall into this trap and be gentle with the person reviewing your code. Being strict in this sense is important: you can always create another small PR with this changeset.

PRO TIP: If you are using Github, you can create templates (for consistency) with your desired fields to include within the PR description. This template will be triggered every time you create a PR.. Here is an example

Attitude and Communication

Communication in PRs is by nature asynchronous and because of this fact, we need to be careful to not block people. Reviewing code is a responsibility, it is not a “touch and go” process, so let’s ensure that we monitor our conversations and/or requested changes.

When it comes to attitude my tips are:

Always provide constructive feedback: Positive comments.
Remain objective: try to remove personal taste and always have technical reasons that back up your technical arguments.
Do not take any comment personally.
Impersonate comments and discussions: “Could you rename this constant?” Should be “this constant needs to be renamed”.

We should always have a positive attitude and provide constructive feedback.

Code Quality

In the beginning of the article we have highlighted code quality as one of the strongest benefits of reviewing code, which rises the questions of: What do we need to pay attention to at this level?

Tests

The first thing I do when reviewing code is to scroll down to the bottom of the PR in order to see the existence of tests:

if I do not find any, then I will automatically request changes.
If I do find them, then I make sure:
- I understand them, which helps to better understand the scope and purpose of the PR (tests by nature act as documentation).
- They are well designed.

PRO TIP: There is existent tooling to measure code coverage overall in your codebase or per PR, like codecov:

It is very important to include code coverage within PRs.

Design

To detect issues at code level, even though it is key to be proficient with the technology we are evaluating, it is even more important to be familiar with code anti-patterns, code smells, software engineering design principle and best practices, thus, we have this knowledge which will provide us with extra tools to easily detect potential issues.

In this aspect, I check:

Code smells: God classes/functions, magic numbers, naming conventions, code intention, etc.
Code is well-designed and consistent with the rest of the codebase.
There is no unnecessary complexity.
The functionality is friendly for the rest of developers to use/read/understand.
Changes look good If there is UI involved.
Parallel programming, security and any other aspect that belong to the scope of the PR.
Code Style: indentation, spaces/tabs/lines or any other aspect that conforms to the style guidelines we have.
Code documentation: for example an api or open source project.

Effective Code Reviews contribute to increase code quality.

Tooling

Your IDE: The best way is to actually pull your PR branch and run it locally.
Your Git repository hosting service: For example Github or Gitlab, they offer friendly web tooling to compare code, comment, highlight, track changes, commit, etc.
If you manage your own Git Service, then Gerrit is a well known open source alternative developed by Google.
A couple of paid tools: Crucible or Upsource.

Conclusion

Code reviews done wrong could become a big evil, leading to critical issues, not only at a technical level but also around collaboration and team morale. If we create consistency, structure and organization when reviewing our code, we will be contributing to better processes, discussions and higher software quality.

There’s a lot to gain out of conducting effective code reviews and I hope this post has shaded some light on the topic. As usual, any feedback or tips are more than welcome! Happy Coding!

Writing First-Class Features: BDD and Gherkin.

2021-01-23T00:00:00+00:00

“Simple things should be SIMPLE, complex things should be POSSIBLE”

Introduction

The title of this article might be a bit confusing…but in this post we are not going to talk about programming languages or architecture…

I do not want to break your expectations though, in essence, we are going to go a bit technical…but we will mostly focus on a Core Part of Product Development:

How to write First Class Features, always keeping in mind Engineering and its impact on the rest of the organization.

We are going to use the term Functionality, Feature and User Story interchangeably.

You can find more info on these definitions here.

The search of a common language

One of the problems that arise and that I see in organizations is global communication. Something that on paper should be easy to manage is most of the time compromised because of the lack of frameworks, tools or common language/vocabulary, thus, leading to misundestandings, incoordination and of course creating stress, pressure and friction.

Dealing with communication is one of the most challenging parts in an organization.

As our product evolves, there is the need to adopt a common vocabulary/language, interpreted by all the moving parts of our organization: business users, analysts, managers, engineers, etc. The idea is to effectively bridge communication gaps between different areas of an organization.

BDD (framework) and Gherkin (language) could help us to achieve this goal by favoring a more consistent communication channel, so let’s define both and see how we can make a good use of them.

What is Behavior Driven Development?

Let me quote Wikipedia here, which perfectly describes this concept:

Behavior-driven development (BDD) is a process that encourages collaboration among developers, QA and non-technical or business participants in a software project. It encourages teams to use conversation and concrete examples to formalize a shared understanding of how the application should behave.

Fundamentally BDD advocates the usage of a common vocabulary to create a domain specific language (DSL) in order to convert structured natural language statements into scenarios with acceptance criteria for a given function, and the tests used to validate that functionality.

Here is a representation if you are comming from the technical side of things:

GIVEN-WHEN-THEN are fundamental in BDD.

What is Gherkin?

Gherkin is a Business Readable, Domain Specific Language created especially for behavior descriptions. It gives us the ability to remove logic details from behavior tests, which turns it into a language that could be understood by anyone without getting deep into implementation details (from an Engineering Perspective).

It serves two main purposes:

Project’s documentation.
Automated tests.

As we can see, there is a strong relationship between Gherkin and BDD, so for instance, we can also agree that Gherkin is an implementation of BDD, thus, reponding very well to the approach GIVEN-WHEN-THEN and THREE AMIGOS collaboration:

Three amigos working together to get the best possible outcome.

Do not worry if you are a bit confused and you did not get it yet, an example and a real case scenario will make it more clear. Keep reading :).

Basic Syntax

The most basic building block consists of a feature description plus an scenario. This scenario consists of a list of steps, which must start with one of the keywords Given, When, Then. But or And are also allowed keyboards.

Here a quick example for a login functionality:

FEATURE: User Login
  In order to use our mobile client, users should be 
  able to authenticate.  

  SCENARIO 1: Login with email 
    GIVEN there is a login screen
    AND I have introduce my email and password
    WHEN I press the login button
    THEN I should be authenticated
    AND taken to the welcome screen

TIP: I tend to use Gherkin keyboards in uppercase in order to distinguish between COMMON LANGUAGE and DSL.

Multiple Scenarios

It is very common to have multiple scenarios that satisfy a functionality. Let’s take our example above to a new level by adding 2 more scenarios:

FEATURE: User Login
  In order to use our mobile client, users should be 
  able to authenticate.  

  SCENARIO 1: Login with email 
    GIVEN there is a login screen
    AND I have introduce my email and password
    WHEN I press the login button
    THEN I should be authenticated
    AND taken to the welcome screen

  SCENARIO 2: Login with phone number 
    GIVEN there is a login screen
    AND I have introduce my phone number and pin
    WHEN I press the login button
    THEN I should be authenticated
    AND taken to the welcome screen

Scenario Outlines

When we have similar scenarios with similar information, copying and pasting can become tedious and repetitive. There is a way to avoid this given the following example:

FEATURE: Tip Calculator
  After calculaing the total of the check, users
  should be able to optionally provide a tip. 

  SCENARIO 1: Tip out 5% of the total
    GIVEN the total of the bill is 100 euros
    WHEN I tip out 5% of the total
    THEN I should pay 105 euros

  SCENARIO 2: Tip out 10% of the total
    GIVEN the total of the bill is 200 euros
    WHEN I tip out 10% of the total
    THEN I should pay 220 euros 

  SCENARIO 2: Tip out 15% of the total
    GIVEN the total of the bill is 100 euros
    WHEN I tip out 15% of the total
    THEN I should pay 115 euros 

By using scenario outlining, we translate our previous example into:

FEATURE: Tip Calculator
  ...

  SCENARIO OUTLINE: Calculating tips
    GIVEN the total of the bill is 
    WHEN I tip out  of the total
    THEN I should pay  euros

    EXAMPLES:
      | TOTAL | TIP  | PAYMENT |
      |  100  |  5%  |   105   |
      |  200  | 10%  |   220   |
      |  100  | 15%  |   115   |

Backgrounds

Backgrounds allows us to add some context to all scenarios in a single feature:

FEATURE: Conversation Administrator Role
  ...

  BACKGROUNG:
    GIVEN A global administrator named "Fernando"
    AND A conversation group called "Android"
    AND a user called "Antje" not belonging to any conversation

  SCENARIO 1: Fernando rename conversation
    GIVEN I am logged in as Fernando
    WHEN I change the conversation name to "iOS"
    THEN I should see the new conversation name "iOS"
    AND I should see a message "Conversation name changed"

  SCENARIO 2: Fernando add member to conversation
    GIVEN I am logged in as Fernando
    WHEN I add "Antje" to the "Android" conversation
    THEN I should see a message "Antje added to the conversation"

Real World Example

@Wire is secure collaboration platform and as part of leadership, one of my responsibilities is to contribute with product coordination between stakeholders and mobile engineering. Currently we are in the process of improving the platform, and in this case, I wanted to share the re-writing of one of our functionalities: Email Verification.

Step 1: Feature Description

The global description of the functionality is in plain english and has an overview of it. There is no scenario definition and we can add here any kind of useful information that we consider useful for the understanding and further development.

Human-readable feature description with extra information.

Step 2: Tasks Break Down

It is worth mentioning that level of granularity in terms of breaking down tasks into smaller ones, will depend on the complexity of the feature. Always keep in mind the Divide and Conquer approach.

Diving and Conquer and Keep it simple are very important in sub-diving tasks.

Step 3: Define Scenarios

At this point, we can fully apply what we have learned so far: Scenario definition and acceptance criteria with Gerkin at sub-task level.

Gherkin is a Business Readable, Domain Specific Language created especially for behavior descriptions.

Tips for Writing User Stories

Writing features is not straightforward: many different people profiles are involved and they ALL should understand what they mean.

As an Extra Ball, here are some useful tips, which could help in order to write better Feature/Functionalities/Issues/User Stories, apart from the ones reviewed so far:

Create Features in a collaborative way. You might start by yourself but involve stakeholders as much as possible without creating a communication overhead.
Users Come first. A Feature/User Story describes how a potential user utilizes the functionality.
Keep Features simple and concise. As mentioned, they should be short and easy to understand using a commong language.
Start with Epics. You might start simple and move towards complexity, you get the global picture and afterwards you can refine it by breaking it down into smaller Features/User Stories.
Refine the Stories until they are ready. A good technique would be by using a Product Backlog Refinement Session.
Add Acceptance Criteria. This is a must since it will allow you to describe the conditions that have to be fulfilled so that the story/feature is done.
Keep everything visible. Use a board or a tool like an Isuues Tracker which is accessible and visible for everyone in the organization.

Writing Features in a collaborative way is a must.

Conclusion

Communications is very important and it dictates how well structured and coordinated an organization is.

In this post we have seen how a Cross-Communication Framework like BDD in combination with A DSL like Gherkin can help us to mitigate communication issues. Gherkin is also adaptable and flexible, we can even establish your own rules to take it to the next level.

Are you using BDD or Gherkin? As usual, I will finish this way: Any Feedback is more than welcome, feel free to ping me in order to share your thoughts and ideas.

References

Learn Linux… Install Arch with Full Disk Encryption.

2020-12-28T00:00:00+00:00

“There are 10 kinds of people in the world: those who understand binary numerals, and those who don’t.”

UPDATES

APRIL 2023:
- Some typos and commands fixed. Thanks @juantelez for the feedback.
NOVEMBER 2022:
- Arch has a built-in installer (it is basically a python script) which can facilitate part of this process. So after booting, we can run it with this command:

$ archinstall --script guided

It is ALSO VERY IMPORTANT to use the Arch Linux Official Installation Guide and follow this guide to understand what is going on, thus, you acquire knowledge and polish/customize your installation. Trust me, the lessons learned here are inmense!!!

Introduction

I’m a Linux fan, and the main reason is because of its open source nature: I have been using it for years and I gotta say a lot has changed since the early days… If you remember re-compiling the kernel in order to install an application, you know what I’m talking about… Fortunately that does not happen anymore(?), so do not freak out, not yet :).

This article will act as a looooong guide, which is goint to help you to install (and understand) Arch Linux with full disk encryption. We will review some of the concepts involved during our process, thus, we have a better picture of what we are doing.

DISCLAIMER: Installing ARCH LINUX is about learning the OS, so give yourself time, it is a process, and of course you will need patience, but I promise you will learn, have fun and in the end it will pay off. Keep also in mind that if you are not a DO-IT-YOURSELF person, then ARCH LINUX might not be the right distro for you and other based on it could fit way better(Ex. Manjaro).

Why Arch Linux?

In the past, I used SuSe, Red Hat, Debian, Ubuntu and Arch, in that order. I gotta say with Arch… was Love at First Sight (also thanks to my friend Oriol).

Here are some of the reasons which motivated me:

100% Community based, built from scratch independent of any other Linux distribution.
The Arch Wiki.
The Arch Linux Community.
Perfect Learning Base.
Community driven Arch User Repository.
Rolling release with always the latest versions of everything.
Pacman Package Manager(pacman).
Full Flexibility and Customization.
Stability and Reliability.

Assumptions

You have basic knowledge about using the command line.
You have already tried out any other linux distro: I will do my best to explain but I might take basic concepts for granted.
You know how to flash a USB Device with a .iso image in order to create a bootable disk for an Operating System.

Hardware for this Guide

I have an Intel based system, in this case a Dell XPS 13 (9310) where we will install everything from scratch. I have also used this guide for installing my Intel NUC too, so most of the content in this article would apply to other hardware. In case there are some specifics I will mention them.

It is important that you check your hardware in the Official Arch Linux Wiki for tips, tricks, troubleshooting and extra specific steps when setting up your linux environment.

Preparing the Terrain

As a first step we need a bootable USB Disk, so in order to create it we need an .iso we can download from here.

Plug your USB Drive Stick and check its location by running lsblk:

NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda               8:0    1   7,6G  0 disk  
├─sda1            8:1    1   621M  0 part  /run/media/fernando/ARCH_202012
├─sda2            8:2    1    61M  0 part  
└─sda3            8:3    1   300K  0 part  

In my computer that was /dev/sda, so let’s burn the .iso with the dd tool:

dd bs=4M if=//archlinux-2020.12.01-x86_64.iso of=/dev/sda status=progress oflag=sync

That is ALL WE NEED at the moment, so let’s get to the next section to start learning :).

Why Disk Encryption?

There is a short answer for this: Security. In a time where (almost) all our information is binary and for instance, our lives, are mostly inside of devices, I personally want to ensure that my sensitive information is hard to get even if my laptop lands on the street, due to it being stolen or lost (hopefully not but never say never…).

So it is time to jump deeper in the core of this article:

The result is going to be a Full Arch Linux installation with Disk Encryption(FDE).

What is Block Device Encryption?

Block device encryption encrypts/decrypts the data transparently as it is written/read from block devices, the underlying block device sees only encrypted data. To mount encrypted block devices we must provide a passphrase to activate the decryption key.

Some systems require the encryption key to be the same as for decryption, and other systems require a specific key for encryption and specific second key for enabling decryption.

Encrypting with dm-crypt/LUKS

LUKS (Linux Unified Key Setup) is a specification for block device encryption (nowadays a standard for Linux). It establishes an on-disk format for the data, as well as a passphrase/key management policy.

LUKS uses the kernel device mapper subsystem via the dm-crypt module. This arrangement provides a low-level mapping that handles encryption and decryption of the device’s data. User-level operations, such as creating and accessing encrypted devices, are accomplished through the use of the cryptsetup utility.

What LUKS does:
- LUKS encrypts entire block devices
  - LUKS is thereby well-suited for protecting the contents of mobile devices such as:
    - Removable storage media
    - Laptop disk drives
- The underlying contents of the encrypted block device are arbitrary.
  - This makes it useful for encrypting swap devices.
  - This can also be useful with certain databases that use specially formatted block devices for data storage.
- LUKS uses the existing device mapper kernel subsystem.
  - This is the same subsystem used by LVM, so it is well tested.
- LUKS provides passphrase strengthening.
  - This protects against dictionary attacks.
- LUKS devices contain multiple key slots.
  - This allows users to add backup keys/passphrases.
What LUKS does not do:
- LUKS is not well-suited for applications requiring many (more than eight) users to have distinct access keys to the same device.
- LUKS is not well-suited for applications requiring file-level encryption.

Fedora Project

LVM: Logical Volume Manager

Logical Volume Management utilizes the kernel’s device-mapper feature to provide a system of partitions independent of underlying disk layout. With LVM you abstract your storage and have “virtual partitions”, making extending/shrinking easier (subject to potential filesystem limitations).

Virtual partitions allow addition and removal without worry of whether you have enough contiguous space on a particular disk, getting caught up fdisking a disk in use (and wondering whether the kernel is using the old or new partition table), or, having to move other partitions out of the way.

LVM on LUKS

The straightforward method is to set up LVM on top of the encrypted partition. Technically the LVM is setup inside one big encrypted blockdevice:

+-----------------------------------------------------------------------+ +----------------+
| Logical volume 1      | Logical volume 2      | Logical volume 3      | | Boot partition |
|                       |                       |                       | |                |
| [SWAP]                | /                     | /home                 | | /boot          |
|                       |                       |                       | |                |
| /dev/MyVolGroup/swap  | /dev/MyVolGroup/root  | /dev/MyVolGroup/home  | |                |
|_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _| | (may be on     |
|                                                                       | | other device)  |
|                         LUKS encrypted partition                      | |                |
|                           /dev/sda1                                   | | /dev/sdb1      |
+-----------------------------------------------------------------------+ +----------------+

Lesson Summary

LUKS is the encryption type; dm-crypt is the device mapper target mechanism which encrypts/decrypts LVM volumes; cryptsetup is the utility you use to configure it all.

Example of encrypted disk layout using LVM on LUKS

Booting from Arch Linux Live

That is enough theory for now and it is (finally?) time to dip our toe in practical linux water.

As a first step we need to boot our system with our already created Arch Linux Bootable USB Disk.

Note: Arch Linux installation images do not support Secure Boot.

We will have to disable TPM and SecureBoot, otherwise our USB drive with the Arch Linux .iso image will not be recognized. Do not worry, you can enable it later.
We also have to Disable RAID and enable AHCI/NVMe (or disable all Operating Mode of the integrated storage device controller). Apparently in many DELL Laptops with Windows, this is only for compatibility and some Intel Features which depend on this functionality under Windows. By the way, RAID mode offers no benefit in this case (on an XPS 13 that only supports a single SSD). Check this official thread for more info.

If you want to have a dual-boot with Windows, disabling RAID will make it unusable but you can follow the next steps to avoid this.

On Windows in order to switch RAID to AHCI (AVOID THIS if you do not want dual-boot Linux-Windows):

Open the Command Prompt as an administrator. Right Click and Run as administrator.
Type the following command: bcdedit/set safeboot minimal.
Reboot the system pressing the F2 key to open the BIOS menu.
Under System Configuration->SATA Operation, you’ll observe RAID on.
Switch to AHCI mode, ignoring the warnings and applying and rebooting.
Repeat step 1 by typing: bcdedit/deletevalue safeboot.
Reboot Windows and Voila! You have finally switched from RAID to AHCI.

If we have reached this point, that means that we have loaded the Arch Linux Live USB and booted from it. The proof is that we find ourselves at a prompt: root@archiso ~ #. Well done!

First steps

TIP: Even though I try to keep this article up to date for reference, it is a good practice to check the Official Arch Linux Intallation Guide and have it as a reference.

At this point we should be in front of a prompt:

root@archiso ~ #

This is our root prompt and I’ll be shortening that to $ in this post.

This is an OPTIONAL step but if the console font is too small or not readable, we can set it up:

$ setfont latarcyrheb-sun32

We need an internet connection, so let’s configure the network. I connected via ethernet so everything worked out of the box. if you need WiFi, you can set it up by launching iwctl (interaction mode with autocompletion). Here are some useful commands:

`iwctl`
`station list`                        # Display your wifi stations
`station  scan`            # Start looking for networks with a station
`station  get-networks`    # Display the networks found by a station
`station  connect `  # Connect to a network with a station

We also need to update our system clock. Let’s use timedatectl(1) to ensure the system clock is accurate:

$ timedatectl set-ntp true

To check the service status, we can use timedatectl status.

My hostname for my system is android10-xps-arch. I tend to use this personal naming convention to identify my hardware with different operating systems, you will see this name around when setting things up in a few places, especially when setting up the volumes in LVM. Swap it out for your own :).

Once that’s done, we can start building up to the installation.

Disk Partitioning

Note: If we want a dual-boot setup with Windows, it is very likely that we already have an EFI Boot Partition, so we can AVOID ITS CREATION and we also DO NOT HAVE TO WIPE OUT THE ENTIRE PARTITION TABLE, only create the linux ones by using the remaining empty space. On Windows we can shrink our C:\ by going to the Device Manager, right click and Shrink Hard Drive.

This is my disk layout (run lsblk to get this output):

NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1         259:0    0 476,9G  0 disk  
├─nvme0n1p1     259:1    0   512M  0 part  /boot
└─nvme0n1p2     259:2    0 476,4G  0 part  
  └─luks        254:0    0 476,4G  0 crypt 
    ├─main-root 254:1    0    50G  0 lvm   /
    └─main-home 254:2    0   110G  0 lvm   /home

This results in a System with Full Disk Encryption (FDE), aside from the boot partition.

For this I used the parted utility for manipulating the partition table:

$ parted /dev/nvme0n1

(parted) mklabel gpt  # WARNING: wipes out existing partitioning
(parted) mkpart ESP fat32 1MiB 513MiB  # create the UEFI boot partition
(parted) set 1 boot on  # mark the first partition as bootable
(parted) mkpart primary  # turn the remaining space in one big partition
      File system type: ext2  # don't worry about this, we'll format it after anyway
      Start: 514MiB
      End: 100%

Now you can check the created layout:

(parted) print
  Model: Unknown (unknown)
  Disk /dev/nvme0n1: 512GB
  Sector size (logical/physical): 512B/512B
  Partition Table: gpt
  Disk Flags: 
      
  Number  Start   End    Size   File system  Name  Flags
    1      1049kB  538MB  537MB  fat32              boot, esp
    2      539MB   512GB  512GB  ext2

(parted) exit

Setting up Disk Encryption

This will encrypt the second partition, which we’ll then hand off to LVM to manage the rest of our partitions. Doing it this way means everything is protected by a single password.

$ cryptsetup luksFormat /dev/nvme0n1p2
    WARNING!
    ========
    This will overwrite data on /dev/nvme0n1p2 irrevocably.
    
    Are you sure? (Type uppercase yes): YES
    Enter passphrase: 
    Verify passphrase:

Now we need to open the encrypted disk so LVM can do its thing:

$ cryptsetup open /dev/nvme0n1p2 luks

Enter passphrase for /dev/nvme0n1p2: 

Setting up LVM

In this section (since we already know about LVM) we will need:

A Physical Volume: mandatory as a container for LVM.
A Volume Group: where we will add our partitions:
- Root Partition.
- Home Partition.
- Swap.

Let’s proceed with the commands then:

$ pvcreate /dev/mapper/luks  # create the physical volume
Physical volume "/dev/mapper/luks" (Fernando) successfully created.? 

$ vgcreate main /dev/mapper/luks  # create the volume group (Fernando)
 Volume group "luks" successfully created

$ lvcreate -L 100G main -n root  # create a 100GB root partition
 Logical volume "root" created.

$ lvcreate -L 18G main -n swap  # create a RAM+2GB swap, bigger than RAM for hibernate
 Logical volume "swap" created.

$ lvcreate -l 100%FREE main -n home  # assign the rest to home
 Logical volume "home" created.

We can check the layout by running lvs:

$ lvs
  LV   VG    Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  home main      -wi-a----- 308.43g                                                    
  root main      -wi-a----- 100.00g                                                    
  swap main      -wi-a-----  18.00g                                                    

Format All The Partitions

Now we’re going to format all the partitions we’ve created so we can actually use them.

First the root partition.

$ mkfs.ext4 /dev/mapper/main-root
...
Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (65536 blocks): done
Writing superblocks and filesystem accounting information: done

Now our home partition.

$ mkfs.ext4 /dev/mapper/main-home
...
Writing superblocks and filesystem accounting information: done

Time for the Swap.

$ mkswap /dev/mapper/main-swap
Setting up swapspace version 1, size = 18 GiB (19327348736 bytes)
...

Finally boot partition ONLY when you DO NOT WANT a DUAL-BOOT setup. This must be a FAT32 formatted partition b/c UEFI.

$ mkfs.fat -F32 /dev/nvme0n1p1
...

Installing The Base System

It’s time to install the base system, which we can then chroot into in order to further customise our installation.

A chroot is an operation that changes the apparent root directory for the current running process and their children. A program that is run in such a modified environment cannot access files and commands outside that environmental directory tree. This modified environment is called a chroot jail.

Mounting All The Partitions

Before we can install the OS we need to mount all the partitions and then chroot into the mountpoint of the root partition.

mount /dev/mapper/main-root /mnt
mount /dev/mapper/main-home /mnt/home
mount /dev/nvme0n1p1 /mnt/boot
swapon /dev/mapper/main-swap

Setting Up The Mirrorlist

Next step is to edit /etc/pacman.d/mirrorlist and put the mirrors closest to us at the top. This’ll help speed up the installation.

It is highly recommended that we generate a mirrorlist and unckeck the http checkbox so we only use mirrors we can fetch from over https. (Feel free to mark IPv6 if your connection supports it.)

In my case I generated it for Germany and used curl to get them. Here the steps:

mv /etc/pacman.d/mirrorlist /etc/pacman.d/mirrorlist.bak #Backup just in case.

curl https://archlinux.org/mirrorlist/?country=DE&protocol=https&ip_version=4&ip_version=6 
      >> /etc/pacman.d/mirrorlist #Get the mirror list.

rm /etc/pacman.d/mirrorlist.backup #Success: remove the backup file.

TIP: Uncomment 8 favorite mirrors and place them at the top of the mirrorlist file. That way it’s easy to find them and move them around if the first mirror on the list has problems. It also makes merging mirrorlist updates easier. HTTP mirrors are faster than FTP due to persistent HTTP connection.

vim /etc/pacman.d/mirrorlist

We can also have a look at the status of the mirrors and even for more info, as usual, we can go to the Arch Linux Wiki.

Installing Basic Components

Now that everything is set up we need to bootstrap the OS:

# In case the below command FAILS, we can first run: 
# pacman-key --init
# pacman-key --populate archlinux`

pacstrap -i /mnt base linux linux-firmware base-devel lvm2 vim

Let’s break down all these packages we are installing:

base: Minimal package set to define a basic Arch Linux installation.
linux: The Linux kernel.
linux-firmware: Firmware files for Linux
base-devel: Package group that includes tools needed for building (compiling and linking).
lvm2: Logical Volume Manager 2 utilities.
vim: Vim editor for customising configurations.

It’ll now prompt us to confirm our package selection and then start with the installation of the base system. Picking the defaults should be safe and fine.

Configuring the new installation

Now that the base system is there, we can chroot into it to customise our installation and finish it.

Fstab

DEFINITION:: The fstab(5) file can be used to define how disk partitions, various other block devices, or remote filesystems should be mounted into the filesystem.

First we generate an fstab file (use -U or -L to define by UUID or labels, respectively):

$ genfstab -U /mnt >> /mnt/etc/fstab

Check the resulting /mnt/etc/fstab file, and edit it in case of errors.

$ cat /mnt/etc/fstab

Here is an example of my /etc/fstab:

$ cat etc/fstab

# Static information about the filesystems.
# See fstab(5) for details.

#      
# /dev/mapper/main-root
UUID=xxxxxxx-3c01-xxxx-xxxx-ab120fexxxxx	/ ext4 rw,relatime	0 1

# /dev/nvme0n1p1
UUID=52CE-47A9 /boot vfat  rw,relatime,fmask=0022,dmask=0022,
                           codepage=437,iocharset=iso8859-1,
                           shortname=mixed,utf8,errors=remount-ro	0 2

# /dev/mapper/main-home
UUID=xxxxxxx-3c01-xxxx-xxxx-ab120xxxxxxx	/home ext4 rw,relatime	0 2

Now change root into the new system:

$ arch-chroot /mnt

Your prompt will now change to: [root@archiso /]#.

Locale

DEFINITION:: Locales are used by glibc and other locale-aware programs or libraries for rendering text, correctly displaying regional monetary values, time and date formats, alphabetic idiosyncrasies, and other locale-specific standards.

Let’s edit our Locale Information by opening the /etc/locale.gen file and uncommenting en_US.UTF-8 UTF-8 and other needed locales: In my case also: de_DE.UTF-8 UTF 8 (since I live in Germany).

Once we are done, we need to generate them by running:

$ locale-gen

As a last step in this section, let’s execute the following in order to create the locale.conf(5) file and set the LANG variable accordingly:

$ echo LANG=en_US.UTF-8 > /etc/locale.conf
$ export LANG=en_US.UTF-8

Timezone

Let’s set our timezone by running:

$ tzselect

Once we have selected our timezone we need to update a few more things. First override the /etc/localtime file and symlink it to your timezone with this format:

$ ln -sf /usr/share/zoneinfo// /etc/localtime

In my case (Berlin):

$ ln -sf /usr/share/zoneinfo/Europe/Berlin /etc/localtime

Time so sync the clock settings and set the hardware clock to UTC by running hwclock(8) to generate /etc/adjtime:

$ hwclock --systohc --utc

Vconsole

This part will set the keyboard layout and font to be used by the virtual console as default values.

Let’s create the /etc/vconsole.conf configuration file and add keyboard configuration:

$ vim /etc/vconsole.conf

KEYMAP=us

At this point we could also (OPTIONAL) set another font by adding this to the mentioned file:

FONT=latarcyrheb-sun32
KEYMAP=us

Hostname

Time to give your system a name by adding that to /etc/hostname. As mentioned earlier, I am:

android10-xps-arch

Also, add a line for that same hostname to /etc/hosts:

$ vim /etc/hosts

# Static table lookup for hostnames.
# See hosts(5) for details.
127.0.0.1	localhost
::1               localhost
127.0.1.1	android10-xps-arch.localdomain android10-xps-arch

If the system has a permanent IP address, it should be used instead of 127.0.1.1.

GPU Power Saving

For this purpose, we have to create /etc/modprobe.d/i915.conf with the following content:

options i915 enable_guc_loading=-1 enable_guc_submission=-1

Mkinitcpio

DEFINITION:: mkinitcpio is what is used to generate the initramfs you’ll soon boot from. However, due to the hardware in this specific laptop and our disk partitioning we have to update it a bit. This configuration will use a full systemd based boot stack.

We need to modify /etc/mkinitcpio.conf and add the following information

set MODULES to: (nvme i915 intel_agp)
set HOOKS to: (base autodetect systemd block sd-vconsole sd-encrypt sd-lvm2 fsck keyboard filesystems)

Let’s regenerate the initramfs: For LVM, system encryption or RAID, modify mkinitcpio.conf(5) and recreate the initramfs by executing the following command:

$ mkinitcpio -p linux

If the command fail (it happened to me) and you see something like:

`specified kernel image does not exist /boot/vmlinuz-linux`

You might need to first reintall the linux kernel and then re run the above command:

$ pacman -S linux

$ mkinitcpio -p linux

The command output should look like this:

==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'default'
  -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux.img
==> Starting build: 4.13.9-1-ARCH
  -> Running build hook: [base]
  -> Running build hook: [systemd]
  -> Running build hook: [autodetect]
  -> Running build hook: [keyboard]
  -> Running build hook: [sd-vconsole]

...

  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: wd719x
==> WARNING: Possibly missing firmware for module: aic94xx
  -> Running build hook: [sd-encrypt]
  -> Running build hook: [sd-lvm2]
  -> Running build hook: [filesystems]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating gzip-compressed initcpio image: /boot/initramfs-linux-fallback.img
==> Image generation successful

Don’t worry about those two warnings, the XPS 13 doesn’t have any hardware on board that needs those drivers.

Microcode

Sometimes bugs are discovered in processors for which microcode updates are released. These updates provide bug fixes that can be critical to the stability of your system. Without them, you may experience spurious crashes or unexpected system halts that can be difficult to track down.

This module is loaded together with the initramfs when your system boots, let’s install the package for it:

$ pacman -Sy intel-ucode

Setting Up The Bootloader

We will be using systemd-boot as our bootloader.

DEFINITION:: systemd-boot is a simple UEFI boot manager which executes configured EFI images. The default entry is selected by a configured pattern (glob) or an on-screen menu to be navigated via arrow-keys. It is included with systemd, which is installed on an Arch system by default.

In order to start, we need to tell bootctl to install the necessary things onto /boot:

$ bootctl install --path=/boot

In the future we won’t need to call install, but update instead. The good thing is that there is a hook that can be installed which will do this automatically every time we perform a full system upgrade. We are going to do it later once we have a full system up and running.

Let’s edit /boot/loader/loader.conf and make it look like this:

timeout 10
default arch
editor 1

By setting editor 1 it’s possible for anyone to edit the kernel boot parameters, add init=/bin/bash and become root on your system. However, since the disk is still encrypted at this point they can’t do much with it. For instance I think this is very convenient to be able to edit those options when something does go wrong.

We now need to create the boot entry named arch. To that end, create the file /boot/loader/entries/arch.conf with the following content:

title Arch Linux
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options luks.uuid=$UUID luks.name=$UUID=luks 
        root=/dev/mapper/main-root rw 
        resume=/dev/mapper/main-swap ro 
        intel_iomu=igfx_off quiet mem_sleep_default=deep
        snd_hda_intel.dmic_detect=0

# NOTE: options should be in the same line separated by 
# spaces. Here I formatted this way for better understanding.

Replace $UUID with the value from this command:

$ cryptsetup luksUUID /dev/nvme0n1p2

TIP: It is also a GOOD IDEA to create another entry that allows us to boot with resume support disabled, in case that’s broken. To that end, create a file like /boot/loader/entries/arch-noresume.conf with the same content as above, but simply omit the resume=/dev/mapper/main-swap ro option.

OPTIONAL: Windows Dual-Boot

In case we already have a Windows Intallation, here are GOOD NEWS from the Arch Linux Wiki:

systemd-boot will automatically check at boot time for Windows Boot Manager at the location /EFI/Microsoft/Boot/Bootmgfw.efi, EFI Shell /shellx64.efi and EFI Default Loader /EFI/BOOT/bootx64.efi, as well as specially prepared kernel files found in /EFI/Linux/. When detected, corresponding entries with titles auto-windows, auto-efi-shell and auto-efi-default, respectively, will be generated. These entries DO NOT require manual loader configuration. However, it does not auto-detect other EFI applications (unlike rEFInd), so for booting the Linux kernel, manual configuration entries must be created.

I performed this step in another installation and everything was recognized automatically and added to the bootloader entries.

Sudo

For commands execution, it is always preferable to use sudo over changing to root. In order to do so we need to install the sudo package and update its configuration:

$ pacman -Sy sudo

Now let’s go to the configuration file:

$ sudo visudo

Next step is to update the configuration and uncomment the line that reads %wheel ALL=(ALL) ALL plus add some extra configuration at this point to save time when creating our first user, (here fernando is going to be my username):

...
##
## User privilege specification
##
root ALL=(ALL) ALL

# Options
Defaults editor=/usr/bin/vim, !env_editor
Defaults insults

# Full Access
fernando ALL=(ALL) ALL

# Last rule as a safety guard
fernando ALL=/usr/sbin/visudo

# Uncomment to allow member of group wheel to execute any command
%wheel ALL=(ALL) ALL
...

Creating a User Account

We now have to create a user account mentioned in the step above for ourselves and ensure we are added to the wheelgroup:

$ useradd -m -G wheel,users -s /bin/bash fernando
$ passwd fernando
    New password: 
    Retype new password: 
    passwd: password updated successfully

Installing GNOME

We need a Graphical User Interface. There are many options out there and I’m not going to point out which is better or worse, personally I think it is a matter of taste. Here are the most popular ones:

GNOME 3 (our choice in this guide).
KDE Plasma.
Xfce.

We will take advantage of this step and add a couple of extras, so let’s do it by executing the following commands:

$ pacman -Sy gnome gnome-extra dhclient iw dialog

$ pacman -Sy networkmanager network-manager-applet xf86-input-libinput

Something worth mentioning is that I researched a lot to build up this guide and part of it was inspired by one created by Daniele Sluijters who gave a good point about what is the reason behind installing dhclient over dhcpd:

I explicitly install dhclient because dhcpd isn’t very good at dealing with non-spec compliant DHCP implementations. Especially if you have a D-Link router or might encounter one, install this package. It also avoids some issues I’ve had on large networks like at the office, Eduroam etc.

After we are done with the installation, we have to enable both services GDM (gnome) and Network Manager:

$ systemctl enable gdm
$ systemctl enable NetworkManager

Booting into the System

So the time has come…If you are still there, I have to say WOW! Congratulation! You have survived to your first (or one more) installation of Arch Linux, which is great. I’m proud of you and I am also sure you have been learning a lot so far.

So one of our last steps will be to exit the chroot:

$ exit

Unmount our filesystems:

$ umount -R /mnt

And finally reboot

$ reboot

Arch Linux up and running with GNOME 3

One Last Upgrade

You though you were done right? The answer to this question is: Yes and No :).

One last thing: Since you might have used an ISO image, it could be that it is not the latest so let’s do a full system upgrade before continuing:

$ sudo pacman -Syu 

Post Installation MUSTs

LTS Kernel

By default we have (installed) the latest stable linux kernel version. LTS stands for: Long Term Support, which means that this kernel will not be updated as frequently as the most recent one.

Here is a char that simplify differences between them:

The good news is that we can have both installed and choose with which one we want to start our system.

It does not hurt at all to have both installed, actually in my experience, when we want to try the latest state of the art kernel functionalities by using the latest version, it could happen that either something stops working or there is some misbehaviour, so by having the LTS version could also be a life saver to fix things (refer to the troubleshooting section).

Let’s proceed by running this command in order to install the LTS Kernel:

$ sudo pacman -S linux-lts 

We need to add an entry for our boot loader, so we copy our /boot/loader/entries/arch.conf file and name it arch-lts.conf:

$ sudo cp /boot/loader/entries/arch.conf /boot/loader/entries/arch-lts.conf

Afterwards we open the file (sudo vim /boot/loader/entries/arch-lts.conf) and ONLY modify the first 2 LINES:

title Arch Linux LTS
linux /vmlinuz-linux-lts
...

We can now restart our system, choose the ‘Arch Linux LTS’ option during boot, and check out our current Linux Kernel with this command:

$ uname -r

AUR Helper

Arch Linux has an amazing Package Manager (pacman) but one of the things which makes Arch AMAZING is its community. There will be cases where pacman is not going to be enough and you will need to use a AUR Helper in order to download and install software created/maintained by the community.

There are several AUR Helpers out there and nowadays people are talking very good about Yay (written in Go) but I will personally stick to the old school way and use Pacaur, which has been written in bash and pretty much emulates pacman behavior.

I will also ad the steps to install Yay and give it a try. The choice is yours.

Since AUR Helpers are NOT part of the Core Arch Linux Repository we need to install them manually.

Installing Pacaur

Let’s install first required packages:

$ sudo pacman -S binutils make gcc fakeroot expac yajl jq wget gtest gmock wget --noconfirm

Pacaur relies on auracle in order to install AUR packages, so let’s set it up:

# We get the auracle `.tar.xz` file
$ wget https://aur.archlinux.org/cgit/aur.git/snapshot/auracle-git.tar.gz

# We need to uncompress the downloaded file
$ tar -xzf auracle-git.tar.gz

# Now we build the package
$ cd auracle-git
$ makepkg PKGBUILD --skippgpcheck --noconfirm

# We use pacman to install the already generated package
$ pacman -U auracle-git-*

We have to create a temporary working directory for installing Pacaur:

$ mkdir -p ~/tmp/pacaur_install
$ cd ~/tmp/pacaur_install

We install pacaur from AUR: Download the files from git, build a .tar.xz file and then we install it:

$ curl -o PKGBUILD https://aur.archlinux.org/cgit/aur.git/plain/PKGBUILD?h=pacaur
$ makepkg -i PKGBUILD --noconfirm
$ sudo pacman -U pacaur*.tar.xz --noconfirm

As a last step, let’s clean up the system: temporary directories deletion:

$ rm -r ~/tmp/pacaur_install
$ cd

Installing Yay

If you have already intalled Pacaur following the instructions above, it is pretty straightforwrd, it only requires you to run this command:

$ pacaur -S yay

Otherwise follow these steps:

pacman -S --needed git base-devel
git clone https://aur.archlinux.org/yay.git
cd yay
makepkg -si

Updating the EFI Boot Manager

In the section Setting Up The Bootloader, we have mentioned, that whenever there is a new version of systemd-boot, the boot manager can be optionally reinstalled by the user (we are the users :)).

This can be performed manually (REMEMBER: Automate all things!) or the update can be automatically triggered using pacman hooks, which is what we are going to do here by just installing the package systemd-boot-pacman-hook (in the AUR Repository), which will automate this process:

$ pacaur -S systemd-boot-pacman-hook

Extras

Everything from here is entirely OPTIONAL and based on PERSONAL TASTING. I just want to share my full setup with the hope that you can also get something useful out of it :).

Tools and Utilities

Tweeks Tool: I use Gnome 3 so its counterpart for this purpose is Gnome-Tweaks which let me to unlock and setup hidden functionalities.
Browsers: I like to have different ones, so not a surprise: Firefox, Chromium and Google Chrome.
Video Player: Vlc, which contains all the necessary codecs to play pretty much anything.
Photo Editor: Gimp, a must if you are using Linux.
Image Viewer: Imv, A tiny one that I can even call from the command line when browsing directories.
Image Utilities: GraphicsMagick, is the swiss army knife of image processing and manipulation.

Let’s install all this: We can do it in a batch processing fashion:

$ pacaur -S firefox google-chrome chromium vlc gimp gnome-tweaks imv graphicsmagick

Default Shell

And my choice here is oh-my-zsh due to its flexibility, customization and plugins system. You can do anything you want.

I also opted for the PowerLevel10k theme… check the final result:

oh-my-zsh customized using the PowerLevel10k them.

Step 1: Install Zsh (if not currently using it):

$ pacaur -S zsh oh-my-zsh-git

Step 2: Make Zsh your default shell (restart so your shell change takes effect):

$ chsh -l
$ chsh -s /usr/bin/zsh

Step 3: Install and enable the Powerlevel10k theme:

$ yay -S --noconfirm zsh-theme-powerlevel10k-git
$ echo 'source /usr/share/zsh-theme-powerlevel10k/powerlevel10k.zsh-theme' >>~/.zshrc

Step 4: Install Nerd Fonts Hack:

$ pacaur -S nerd-fonts-hack

Step 5: Migrate from Bash (skip if you are already using it):

We will have to move some content from your files .bash_profile and .bashrc to .zshrc and .zprofile respectively.

Step 6: This is my theme configuration in .zshrc file with the plugins (copy and paste :)):

# Enable Powerlevel10k instant prompt. Should stay close to the top of ~/.zshrc.
# Initialization code that may require console input (password prompts, [y/n]
# confirmations, etc.) must go above this block; everything else may go below.
if [[ -r "${XDG_CACHE_HOME:-$HOME/.cache}/p10k-instant-prompt-${(%):-%n}.zsh" ]]; then
  source "${XDG_CACHE_HOME:-$HOME/.cache}/p10k-instant-prompt-${(%):-%n}.zsh"
fi

# Path to your oh-my-zsh installation.
ZSH=/usr/share/oh-my-zsh/

export DEFAULT_USER="fernando"
export TERM="xterm-256color"
export ZSH=/usr/share/oh-my-zsh
export ZSH_POWER_LEVEL_THEME=/usr/share/zsh-theme-powerlevel10k

source $ZSH_POWER_LEVEL_THEME/powerlevel10k.zsh-theme

plugins=(archlinux 
	bundler 
	docker 
	jsontools 
	vscode web-search 
	k 
	tig 
	gitfast 
	colored-man-pages
	colorize 
	command-not-found 
	cp 
	dirhistory 
	autojump 
	sudo
	zsh-syntax-highlighting
	zsh-autosuggestions) 
# /!\ zsh-syntax-highlighting and then zsh-autosuggestions must be at the end

source $ZSH/oh-my-zsh.sh

# Uncomment the following line to disable bi-weekly auto-update checks.
DISABLE_AUTO_UPDATE="true"

ZSH_CACHE_DIR=$HOME/.cache/oh-my-zsh
if [[ ! -d $ZSH_CACHE_DIR ]]; then
  mkdir $ZSH_CACHE_DIR
fi

source $ZSH/oh-my-zsh.sh

Step 7: You can also customize even more if you go to the .p10k.zsh file, which is very well documented:

$ vim ~/.p10k.zsh

Step 8: Setup font in GNOME (skip if not a GNOME user):
1. Install GNOME Tweaks in case you have GNOME.
2. Set the system monospace font to “Hack Nerd Font Regular” and size, current one + 1.
3. In the Terminal’s Font Preference, I leave the Custom Font option unchecked, .i.e use system font.

My Color Palette in the Terminal Preferences.

Developer Tools

Here are the ones:

Git: The by default free and open source distributed version control system.
Asdf: A CLI tool for managing multiple runtime versions and programming languages.
Android Studio: is the official IDE for Google’s Android operating system, built on JetBrains’ IntelliJ IDEA software and designed specifically for Android development.
Docker: A set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.
VSCode: I use the open source release version which is called Code (Microsoft VSCode uses this one as base for its product).
Zeal: An offline documentation browser for software developers.
Intellij: IDE for mainly Kotlin, Java and Scala. But is supports many programming languages through its plugin system.
Slack: Communication and Collaboration Tool.

Installation:

$ pacaur -S git asdf android-studio docker code zeal intellij slack

Troubleshooting

In case we face problems, it is important to have written down all the necessary steps just for the sake of properly starting our system with a Rescue Disk (the same USB Drive we set up already).

We plug our Bootable USB Drive and boot into the system.
We need to open the encrypted disk so LVM can do its thing:

$ cryptsetup open /dev/nvme0n1p2 luks

Enter passphrase for /dev/nvme0n1p2: 

We setup the internet connection with iwctl:

`iwctl station list` # Display your wifi stations
`iwctl station station scan` # Start looking for networks with a station
`iwctl station station get-networks` # Display the networks found by a station
`iwctl station station connect network_name` # Connect to a network with a station

We need to mount all the partitions:

$ mount /dev/mapper/main-root /mnt
$ mount /dev/mapper/main-home /mnt/home
$ mount /dev/nvme0n1p1 /mnt/boot
$ swapon /dev/mapper/main-swap

Change root into the new system:

$ arch-chroot /mnt

Now you are ready to work and fix Arch Linux just in case something unexpected ocurred.

And when we are done, we exit the chroot:

$ exit

Unmount our filesystems:

$ umount -R /mnt

Reboot our system:

$ reboot

This specific installation uses ‘ext4’ as a file system but if you use ‘btrfs’, I have added troubleshooting information in my Linux Wiki

Conclusion

This has been such a ride! A very long but (hopefully) a fulfilling process. I have no more words than saying THANKS for READING!

I hope you found the material useful and of course any feedback is more than welcome, so feel free to drop me a line in any of the social networks that appear in this website.

From here, I will leave up to you to continue diving deeper into the Linux Ecosystem and finish up with an inspirational quote:

“Wisdom is not a product of schooling but of the lifelong attempt to acquire it.”

References

Learn out of mistakes: Postmortems to the rescue!

2020-06-21T00:00:00+00:00

“Continuos improvement and learning should be a must in every organization’s culture.”

Introduction

Some time ago, I wrote about how to build a company’s culture based on human values, and mentioned a bunch of ideas that I would like to bring back:

Understand we make mistakes: there are always good intentions but we are human beings and we are not perfect.
No finger pointing: learn out of failure and create retrospectives in order to not repeat those failures again, but please do not blame people, we are always together either in the good moments and when going through difficulties.
Be positive: there is always light at the end of the tunnel.

As we can see, these ideas reflect human behavior and attitudes against a problem, thus, I would like to take the opportunity to build on them and explore how we can leverage failures in order to learn from them. Let’s get started!

Make sure you visit my postmortems sections in order to complement your learnings in this article.

A culture of lessons learned

I bumped into this inspirational quote and wrote it down some time ago (please ping me if you know the author):

“Whatever you do, you will make mistakes, errors and blunders. Everyone does. Some less, some more, but no one is exempt. Yet, success is nothing more than mistakes you managed to overcome. That is why a mistake is not important. What counts are the things you do after one.”

In my opinion, it is our reponsibility to minimize failure, but company’s cultures should provide room for failing, and when this happens, we should get something positive and constructive out of it: a lesson learned.

Moreover, let me quote myself:

“Learning something without sharing it, is senseless”

We want to share what we have learned:

…to gain experience over something new.
…to acquire knowledge which will lead to success in similar situations.
…to have more tools in oder to recover from similar problems.
…to not repeat the same mistake again and again.

“The more we fail, the more resilient we become. Let’s let this knowledge to be shared in order to spread and build resilience.”

It is at this point when postmortems come in to play, so in the next section we are going to define what is a postmortem and how this concept can help us to acquire and transmit knowledge based on failure.

What is a postmortem?

DEFINITION: A postmortem (or post-mortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.

I definitely agree with the definition, but from my perspective, postmortems should not be restricted/limited to only technical issues (outages, technical incidents, failures), because otherwise we miss a lot of valuable information from other areas that are helpful for continuous improvement.

They should also include situational issues we might encounter like:

Communication problems.
Conflicts.
Anything you can learn from.

It is VERY IMPORTANT when writing them to:

Use blameless language.
Be annonymous without fingerpointing.
Be constructive.

Which information should a postmortem contain?

An overview to provide context.
A description of the problem (what happened?).
The root cause (Why it happened?).
A resolution (How we solved the issue).
A lesson learned.

How we organize this information, is a matter of taste, but we should keep consistency within the format to facilitate/favor readability. Some other extra information that could be useful to include:

Screenshots.
Tickets.
Charts.
Timelines.
External feedback.

Here is an screenshot of a postmortem out of a situation I experienced recently:

You can also visit my Postmortems Section if you are curious about my experiences in this field.

Tooling and Templates

There are no silver bullets when it comes to tooling and templates for writing and sharing postmortems.

In the example above I used a simple web using github pages that I created some time ago, but documents, wikis or anything that could be shared company wise is more than valid. I will leave it up to you, the only thing I would add here: make sure postmortems are visible and easy to find.

I have also forked a github repository with a bunch of useful templates that I adapted myself according to my needs.

Wrapping up

Postmortems are a useful tool/technique that can be used in order to contantly improve and learn.

Personally I’m strong believer of building a culture based on the following ideas:

“Make mistakes and embrace them, it is nothing to be ashamed of.”

“Learning something without sharing it is senseless.”

“Foment continuous learning and sharing.”

“Encourage a culture based on postmortems and lessons learned.”

I hope you have enjoyed this article and found it useful. As usual, remember that any feedback is more than welcome, I’m easily found on Twitter and other social networks. Cheers!

Real World Examples

Fernando Cejas Postmortems Section.

Resources and further reading

Learn out of mistakes: Postmortems to the rescue!

2020-06-21T00:00:00+00:00

“Continuos improvement and learning should be a must in every organization’s culture.”

Introduction

Some time ago, I wrote about how to build a company’s culture based on human values, and mentioned a bunch of ideas that I would like to bring back:

Understand we make mistakes: there are always good intentions but we are human beings and we are not perfect.
No finger pointing: learn out of failure and create retrospectives in order to not repeat those failures again, but please do not blame people, we are always together either in the good moments and when going through difficulties.
Be positive: there is always light at the end of the tunnel.

Make sure you visit my postmortems sections in order to complement your learnings in this article.

A culture of lessons learned

I bumped into this inspirational quote and wrote it down some time ago (please ping me if you know the author):

“Whatever you do, you will make mistakes, errors and blunders. Everyone does. Some less, some more, but no one is exempt. Yet, success is nothing more than mistakes you managed to overcome. That is why a mistake is not important. What counts are the things you do after one.”

Moreover, let me quote myself:

“Learning something without sharing it, is senseless”

We want to share what we have learned:

…to gain experience over something new.
…to acquire knowledge which will lead to success in similar situations.
…to have more tools in oder to recover from similar problems.
…to not repeat the same mistake again and again.

“The more we fail, the more resilient we become. Let’s let this knowledge to be shared in order to spread and build resilience.”

What is a postmortem?

DEFINITION: A postmortem (or post-mortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.

They should also include situational issues we might encounter like:

Communication problems.
Conflicts.
Anything you can learn from.

It is VERY IMPORTANT when writing them to:

Use blameless language.
Be annonymous without fingerpointing.
Be constructive.

Which information should a postmortem contain?

An overview to provide context.
A description of the problem (what happened?).
The root cause (Why it happened?).
A resolution (How we solved the issue).
A lesson learned.

Screenshots.
Tickets.
Charts.
Timelines.
External feedback.

Here is an screenshot of a postmortem out of a situation I experienced recently:

You can also visit my Postmortems Section if you are curious about my experiences in this field.

Tooling and Templates

There are no silver bullets when it comes to tooling and templates for writing and sharing postmortems.

I have also forked a github repository with a bunch of useful templates that I adapted myself according to my needs.

Wrapping up

Postmortems are a useful tool/technique that can be used in order to contantly improve and learn.

Personally I’m strong believer of building a culture based on the following ideas:

“Make mistakes and embrace them, it is nothing to be ashamed of.”

“Learning something without sharing it is senseless.”

“Foment continuous learning and sharing.”

“Encourage a culture based on postmortems and lessons learned.”

I hope you have enjoyed this article and found it useful. As usual, remember that any feedback is more than welcome, I’m easily found on Twitter and other social networks. Cheers!

Real World Examples

Fernando Cejas Postmortems Section.

Resources and further reading

Technical Debt… GURU LEVEL UNLOCKED!

2020-03-13T00:00:00+00:00

“Technical Debt is the additional effort and work required to complete any software development.”

Introduction

This is a long article that I wanted to squeeze in a smaller one but it was almost mission impossible to get rid of some important/valuable information. I hope you enjoy and find it helpful.

Feel free to provide feedback, which as usual, is more than welcome.

With that being said, I would like to start with a quote from Robert C. Martin:

“Bad code is always imprudent”

I cannot agree more with this, and no matter what I sell you in this post :), there is NEVER a good reason to write bad code.

The Questions

We as Engineers, Tech Leads and Managers know that technical debt is one of our worst enemies when it comes to codebases and software projects in general It can be very frustrating and demotivating thus making our life a bit more complicated…But…

What is technical debt really?
And Legacy code?
Is there a proportional relationship between them?
How can we measure and determine the healthiness of our project?
And once we measure it, how can we finally address the problem?

Let’s try to answer these questions and also explore in depth different techniques and strategies that will help us effectively deal with it.

Fact: Our Software is Terrible

In an ideal world, a project would be:

Finished on time.
With a clean code design.
Additional features.
Tested twice.
On Budget.

If that is your reality, then you can stop reading this post, luckily you have UNLOCKED the LEVEL SUPERHEROE, so please share your thoughts and ideas, I am more than curious to know how you have achieved it.

Otherwise, welcome to my world, where we create authentic monsters: giant beasts full of technical debt, legacy code, issues and bugs.

And if you let me add more, that also includes coordination and communication problems across the entire organization. Yes! Our Software is terrible and we know it is TRUE, which does not make it any special, right?

What is Legacy Code?

There are many definitions of legacy code and some of them, in my opinion, contradict themselves, so since you are familiar with the concept, let’s keep it simple:

“Legacy code is code without tests.”

Testing nowadays should be implicit in our engineering process when writing code. So if you are not at least unit testing your codebase, run and do it, it is a command :).

What is Reckless Debt?

I came across this term lately and it looks like we can use it as a synonym of technical debt, but in reality, here is the formal definition:

“Reckless Debt is code that violates design principles and good practices.”

That means that all code generated by us and our team is junk (not done on purpose of course).

Moreover, Reckless Debt will lead to Technical Debt in the short/mid term and it is also a signal that your team needs more training, or you have too many inexperienced or junior developers.

What is Technical Debt?

Here I will rely on Martin Fowler:

“Technical Debt is a metaphor developed by Ward Cunningham to help us think about this problem. Like a financial debt, technical debt incurs interest payments, which come in the form of the extra effort that we have to do in future development because of the quick and dirty design choice. We can choose to continue paying the interest, or we can pay down the principal by refactoring the quick and dirty design into the better design.”

In summary: Technical Debt is the additional effort and work required to complete any software development.

Real case scenario: Adding a new feature

So let’s put our day to day life back into our heads. In this case we have decided to add a new functionality to our project, so here we have 2 well defined options:

The “easy” way, built up with messy design and code, which will get us there way faster: REMEMBER WE NEED TO PAY THE INTEREST.
The “hard” way, built up with cleaner code and a meaningful and consistent architecture. Without a doubt this will take more time but it is going to be more EFFICIENT IN TERMS OF INTEREST COST.

“Accept some short term Technical Debt for tactical reason.”

It is not uncommon that at some point we need to develop something quickly because of time to market (or market experiment), or perhaps there is a new internal component that needs to be shipped in order to be used across the entire organization and we are contributing to it (a module for example), and we code it fast with not the best design until we can come up with a more robust and effective solution.

“No matter what is the reason, but part of this decision to accept technical debt is to also accept the need to pay it down at some point in the future. Having good regression testing assets in place, assures that refactoring accepted technical debt in the future, can be done with low risk.”

Let’s move on and see how we can analyze and inspect our codebase in order to detect technical debt.

ROOKIE Level Unlocked! Static Code Analysis

It is the most basic and fundamental building block when it comes to measuring technical debt at a code level.

Most of us are familiar with this practice since it aims to highlight potential bugs, vulnerabilities and complexity.

But first, in order to interpret the results of static code analysis and quantify technical debt, we need to be familiar with a bunch of code metrics:

Cyclomatic Complexity: stands for the complexity of classes and methods by analyzing the number of functional paths in the code (if clauses for example).
Code coverage: A lack of unit tests is a source of technical debt. This is the amount of code covered by unit tests (we should take this one responsibly, since testing getters and setter can also increase code coverage).
SQALE-rating: Broad evaluation of software quality. The scale goes from A to E, with A being the highest quality.
Number of rules: Number of rules violated from a given set of coding conventions.
Bug count: As technical debt increases, the quality of the software decreases. The number of bugs will likely grow (We can complement this one with information coming from our bug tracker).

There is a variety of tools out there (free for open source projects), which provide the above information out of the box, and most of the time, they can be easily integrated either with your CI infrastructure or directly with our version control system tools like github/gitlab/git.

Here is a screenshot of one codebase example using the online open source tool SonarQube:

Lint is also a very flexible and popular one (there are plugins for the most popular IDEs and you can write your own custom rules, in this case on Android):

Static code analysis should be our first mandatory step to tackle and measure technical debt.

So let’s make sure we include as a regular practice in our engineering process.

EXPERIENCED Level Unlocked! Tech Debt Radar

A Tech Debt Radar is a very simple tool that has personally given me really good results (while working at @SoundCloud, within the android team, it was (and it is AFAIK) a regular practice).

“We should know that this is not an automated tool (like the ones mentioned above) and I define it as a Social Technical Debt Detector by Experience”.

The idea is pretty simple: all the feedback related to how difficult is to work with the current codebase, comes from actually the developers working with it (by experience).

You can see a Tech Debt Radar in the picture below:

As we can see, there is a board with a few post-its representing each one either a feature or even an area of the codebase which eventually is hard to work with.

Then we have 2 axys:

X: represents the level of pain when working with a specific part of the codebase.
Y: represents how much development time would it take to improve the mentioned piece of code.

At a process level, this is done in a meeting with the development team and a technical debt captain (someone who will be in charge of analysing technical debt).

Basically each member of the team, will have the chance to place these post-its depending on how much pain (X axis) each is causing, and how much development time (Y axis) is required to fix it.

This would be mostly common sense (with strong arguments and an explanation of the whys) in the beginning but I can ensure that it will get better over time with the accumulated experience.

As an example on the board, let’s look at the DI card (Dependency Injection). It looks like it is a very painful area in our project and refactoring it will require a big effort. On the other hand, Login is causing a lot of pain and fixing it will not be very complicated.

With this in mind you can get some conclusions:

By addressing all features that are painful and at the same time require little development time (the ones placed upper left corner), we will be able to provide a lot of value and improvement by fixing them.
The rest of the functionalities will require some workout to be prioritized and refactored. As a rule of thumb, discuss with the team and use a divide and conquer approach (split up big problem into smaller ones).

Once we gather all this information, we need to keep track of all the collected feedback, so feel free to use your favorite tool for that purpose.

Even a document might do the job: this is a matter of taste, as soon as you have a place to store all this data and see the evolution over time.

A Technical Debt Radar will not provide the level of granularity and details that any other automated tool out there might do, but it is totally worth a try, and a very valuable method that perfectly complements our codebase analysis with the purpose of understanding the most painful spots, and the most important, is that this information comes from us, from the feedback of the people who daily work with the code.

Remember to have these meetings regularly (minimum once every 2-3 weeks) in order to keep an eye on how much progress (positive or negative) has been done.

GURU Level Unlocked! Behavioral Code Analysis

It is obvious that technical debt have a 1 to 1 relationship with legacy code but there is another important factor to take into consideration: the social part of our organization, which basically emphasizes in how we as human beings interact with each other (as a team), with customers, with the rest of the organization and with the code itself.

All this comes from the fact, that over the years there has seen changes in the way we work and interact with each other, which led to modifications in collaboration techniques, tools and again, the code itself.

References like Adam Tornhill in the area of human psychology and code are helping us to understand a bit more this social part.

Before continuing, let’s recap what a traditional static code analysis tool can do for us:

…focus on a snapshot of the code as it looks right now.
…find code that is overly complex.
…find code which has heavy dependencies on other parts.

In conclusion, static analysis is a very useful tool and as pointed out above, should be our first step when it comes to code inspection, but there is an important gap to fill in:

“Static analysis will never be able to tell you if that excess code complexity actually matters – just because a piece of code is complex doesn’t mean it is a problem.”

Social aspects of software development like coordination, communication, and motivation issues increase in importance and all these softer aspects of software development are invisible in our code:

“Adam Thornill: if you pick up a piece of code from your system there’s no way of telling if it has been written by a single developer or if that code is a coordination bottleneck for five development teams. That is, we miss an important piece of information: the people side of code.”

Behavioral code analysis emphasizes trends in the development of our codebase by mining version-control data.

Since version-control data is also social data, we know exactly which programmer that wrote each piece of code and with this in mind, it is possible to build up knowledge maps of a codebase, for example, like the one in the next figure which shows the main developers behind each module:

For the purpose of better understanding way more of what we are talking about, we will be diving deeper into this online toolset called Codescene.io, which is free for open source projects.

Needless to say, apart from being a great helper with a nice UI, the platform is mostly based on an open source project called code-maat from the same author.

Let’s see what Codescene is capable of…

Hotspots

In essence, a hotspot is complicated code that you have to work with often.

Its calculation is pretty simple:

With a Hotspot analysis we can get a hierarchical map that lets us analyze our codebase interactively.

By using one of the examples of the platform, we can check the following visualizations where each file is represented as a circle:

As we can see, we can also identify clusters of Hotspots that indicate problematic sub-systems.

By clicking on a Hotspot we can get more details to get deeper information:

The main benefits of a Hotspot analysis include:

Maintenance problems identification: Information on where sits complicated code that we have to work with often. This is useful to prioritize re-designs.
Risk management: It could be risky to change/extend functionality in a Hotspot for example. We can identify those areas up-front and schedule additional time or allocate extra testing efforts.
Defects Detector: It could identify parts of the codebase that seem unstable with lots of development activity.

Here is the full documentation with more details.

Code Biomarkers

In medicine, biomarkers stand for measurements that might indicate a particular disease or physiological state of an organism. We can do the same for code to get a high-level summary of the state of our hotspots and the direction our code is moving in.

Code biomarkers act like a virtual code reviewer that looks for patterns that might indicate problems.

They are scored from A to E where A is the best and E indicates code with severe potential problems.

Let’s have a look at a couple of examples listing risky areas of our code base:

In conclusion we can use Code Biomarkers to:

To decide when it’s time to invest in technical improvements instead of adding new features at a high pace.
Get immediate feedback on improvements.

Same as with hotspots, here is also the biomarkers full documentation.

There is way more to cover in this field like:

But from here I will leave it to you, otherwise this article will be too long and, by the way, the idea was to wake up your curiosity (luckily I have achieved it) and shade some light on what is possible by exploring the social side of the code.

“Behavioral code analysis helps you ask the right questions, and points your attention to the aspects of your system – both social and technical – that are most likely to need it. You use this information to find parts of the code that may have to be split and modularized to facilitate parallel development by separate teams, or, find opportunities to introduce a new team into your organization to take on a shared responsibility.”

Where should we focus improvements?
Where are the risk areas in the code?
Any team productivity bottleneck?

I definitely encourage you to give Codescene a try either within an open source repo or within the existent samples, you will be surprised how much curious stuff you find :).

Extra Ball

I would like to introduce an open source repository visualization tool called Gource.

Here is how the author describes it:

“Software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Directories appear as branches with files as leaves. Developers can be seen working on the tree at the times they contributed to the project.

In essence you can grab your git repository, run gource on it and the result is something like this (This is an example of the Bitcoin repository and its evolution):

The documentation sits at the Gource Github Wiki.

As a trick we have had it in a monitor during sprints to make more visible and transparent how we move around our codebase. Really fun!

Paying Technical Debt

“The best way to reduce technical debt in new projects is to include technical debt in the conversation early on.”

As this quote suggests, this is more at a process level, and even though we have our refactoring toolbox, without the effort of the team, would be impossible to minimize future technical debt and repair existent one.

So let’s see how we can deal with these contexts by pointing out a few tips for the action plan.

At Team level:
- Prioritize and keep track of technical debt: During the sprint planning for example.
- Allocate time to address technical debt: Also During the sprint planning or when estimating a task that requires touching a sick part of the code.
- Tech Debt Days: Another great practice where the team spends an entire day only focused on repairing affected code.
At Company level:
- Educate people about its existence: Cost of delay: This metric helps to make visible how much time a team loses due to technical debt.
- Make it transparent: Talk, talk and talk and always bring it up to the table.
- Communicate it properly: An idea would be to add a tech debt update meeting about the current state of it.

As a conclusion, I would like to finish this section with a bunch of quotes from Adam Tornhill (a reference in this field):

“Technical debt can be a frustrating and de-motivating topic for many Development Teams.”

“The keyword is transparency.”

“Explain the cost of low-quality code by using the transparent metaphor of ‘technical debt’.”

“Make technical debt visible in the code using a variety of objective metrics, and frequently re-evaluate these metrics.”

“Finally, make technical debt visible on the Product and/or Sprint Backlog.”

“Don’t hide Technical Debt from the Product Owner and the broader organization.”

Wrapping up

Technical debt is a ticking bomb and as our lovely Batman from 1966 (characterized by Adam West) would say (you can check the full 2 minutes video here, BTW one of my favorite scenes ever):

“Some days you just cannot get rid of a bomb…”

And based on this inspiring quote let me rephrase it to:

“Sometimes it is not easy to get rid of a bomb…”

It is a reality that technical debt exists in 99% of the codebases, it is also an important challenge we must face to keep the healthiness and maintenance of our software projects.

Hopefully there is light at the end of the tunnel and with the different techniques mentioned above, now you have a couple of new tools in your toolbox to address it effectively.

Have fun and do not let technical debt beat you.

Congratulations! Technical Debt GURU Level Unlocked!

Part of this article came out of a talk I gave about TECHNICAL DEBT recently, you can check the slides:

There is also a sketch that perfectly summarizes the main idea of my talk, courtesy of @lariki and @Miqubel:

And finally a video of my talk at Mobiconf:

Books for reference

Technical Debt… GURU LEVEL UNLOCKED!

2020-03-13T00:00:00+00:00

“Technical Debt is the additional effort and work required to complete any software development.”

Introduction

This is a long article that I wanted to squeeze in a smaller one but it was almost mission impossible to get rid of some important/valuable information. I hope you enjoy and find it helpful.

Feel free to provide feedback, which as usual, is more than welcome.

With that being said, I would like to start with a quote from Robert C. Martin:

“Bad code is always imprudent”

I cannot agree more with this, and no matter what I sell you in this post :), there is NEVER a good reason to write bad code.

The Questions

What is technical debt really?
And Legacy code?
Is there a proportional relationship between them?
How can we measure and determine the healthiness of our project?
And once we measure it, how can we finally address the problem?

Let’s try to answer these questions and also explore in depth different techniques and strategies that will help us effectively deal with it.

Fact: Our Software is Terrible

In an ideal world, a project would be:

Finished on time.
With a clean code design.
Additional features.
Tested twice.
On Budget.

Otherwise, welcome to my world, where we create authentic monsters: giant beasts full of technical debt, legacy code, issues and bugs.

What is Legacy Code?

There are many definitions of legacy code and some of them, in my opinion, contradict themselves, so since you are familiar with the concept, let’s keep it simple:

“Legacy code is code without tests.”

Testing nowadays should be implicit in our engineering process when writing code. So if you are not at least unit testing your codebase, run and do it, it is a command :).

What is Reckless Debt?

I came across this term lately and it looks like we can use it as a synonym of technical debt, but in reality, here is the formal definition:

“Reckless Debt is code that violates design principles and good practices.”

That means that all code generated by us and our team is junk (not done on purpose of course).

Moreover, Reckless Debt will lead to Technical Debt in the short/mid term and it is also a signal that your team needs more training, or you have too many inexperienced or junior developers.

What is Technical Debt?

Here I will rely on Martin Fowler:

“Technical Debt is a metaphor developed by Ward Cunningham to help us think about this problem. Like a financial debt, technical debt incurs interest payments, which come in the form of the extra effort that we have to do in future development because of the quick and dirty design choice. We can choose to continue paying the interest, or we can pay down the principal by refactoring the quick and dirty design into the better design.”

In summary: Technical Debt is the additional effort and work required to complete any software development.

Real case scenario: Adding a new feature

So let’s put our day to day life back into our heads. In this case we have decided to add a new functionality to our project, so here we have 2 well defined options:

The “easy” way, built up with messy design and code, which will get us there way faster: REMEMBER WE NEED TO PAY THE INTEREST.
The “hard” way, built up with cleaner code and a meaningful and consistent architecture. Without a doubt this will take more time but it is going to be more EFFICIENT IN TERMS OF INTEREST COST.

“Accept some short term Technical Debt for tactical reason.”

“No matter what is the reason, but part of this decision to accept technical debt is to also accept the need to pay it down at some point in the future. Having good regression testing assets in place, assures that refactoring accepted technical debt in the future, can be done with low risk.”

Let’s move on and see how we can analyze and inspect our codebase in order to detect technical debt.

ROOKIE Level Unlocked! Static Code Analysis

It is the most basic and fundamental building block when it comes to measuring technical debt at a code level.

Most of us are familiar with this practice since it aims to highlight potential bugs, vulnerabilities and complexity.

But first, in order to interpret the results of static code analysis and quantify technical debt, we need to be familiar with a bunch of code metrics:

Cyclomatic Complexity: stands for the complexity of classes and methods by analyzing the number of functional paths in the code (if clauses for example).
Code coverage: A lack of unit tests is a source of technical debt. This is the amount of code covered by unit tests (we should take this one responsibly, since testing getters and setter can also increase code coverage).
SQALE-rating: Broad evaluation of software quality. The scale goes from A to E, with A being the highest quality.
Number of rules: Number of rules violated from a given set of coding conventions.
Bug count: As technical debt increases, the quality of the software decreases. The number of bugs will likely grow (We can complement this one with information coming from our bug tracker).

Here is a screenshot of one codebase example using the online open source tool SonarQube:

Lint is also a very flexible and popular one (there are plugins for the most popular IDEs and you can write your own custom rules, in this case on Android):

Static code analysis should be our first mandatory step to tackle and measure technical debt.

So let’s make sure we include as a regular practice in our engineering process.

EXPERIENCED Level Unlocked! Tech Debt Radar

A Tech Debt Radar is a very simple tool that has personally given me really good results (while working at @SoundCloud, within the android team, it was (and it is AFAIK) a regular practice).

“We should know that this is not an automated tool (like the ones mentioned above) and I define it as a Social Technical Debt Detector by Experience”.

The idea is pretty simple: all the feedback related to how difficult is to work with the current codebase, comes from actually the developers working with it (by experience).

You can see a Tech Debt Radar in the picture below:

As we can see, there is a board with a few post-its representing each one either a feature or even an area of the codebase which eventually is hard to work with.

Then we have 2 axys:

X: represents the level of pain when working with a specific part of the codebase.
Y: represents how much development time would it take to improve the mentioned piece of code.

At a process level, this is done in a meeting with the development team and a technical debt captain (someone who will be in charge of analysing technical debt).

Basically each member of the team, will have the chance to place these post-its depending on how much pain (X axis) each is causing, and how much development time (Y axis) is required to fix it.

This would be mostly common sense (with strong arguments and an explanation of the whys) in the beginning but I can ensure that it will get better over time with the accumulated experience.

With this in mind you can get some conclusions:

By addressing all features that are painful and at the same time require little development time (the ones placed upper left corner), we will be able to provide a lot of value and improvement by fixing them.
The rest of the functionalities will require some workout to be prioritized and refactored. As a rule of thumb, discuss with the team and use a divide and conquer approach (split up big problem into smaller ones).

Once we gather all this information, we need to keep track of all the collected feedback, so feel free to use your favorite tool for that purpose.

Even a document might do the job: this is a matter of taste, as soon as you have a place to store all this data and see the evolution over time.

A Technical Debt Radar will not provide the level of granularity and details that any other automated tool out there might do, but it is totally worth a try, and a very valuable method that perfectly complements our codebase analysis with the purpose of understanding the most painful spots, and the most important, is that this information comes from us, from the feedback of the people who daily work with the code.

Remember to have these meetings regularly (minimum once every 2-3 weeks) in order to keep an eye on how much progress (positive or negative) has been done.

GURU Level Unlocked! Behavioral Code Analysis

References like Adam Tornhill in the area of human psychology and code are helping us to understand a bit more this social part.

Before continuing, let’s recap what a traditional static code analysis tool can do for us:

…focus on a snapshot of the code as it looks right now.
…find code that is overly complex.
…find code which has heavy dependencies on other parts.

In conclusion, static analysis is a very useful tool and as pointed out above, should be our first step when it comes to code inspection, but there is an important gap to fill in:

“Static analysis will never be able to tell you if that excess code complexity actually matters – just because a piece of code is complex doesn’t mean it is a problem.”

“Adam Thornill: if you pick up a piece of code from your system there’s no way of telling if it has been written by a single developer or if that code is a coordination bottleneck for five development teams. That is, we miss an important piece of information: the people side of code.”

Behavioral code analysis emphasizes trends in the development of our codebase by mining version-control data.

For the purpose of better understanding way more of what we are talking about, we will be diving deeper into this online toolset called Codescene.io, which is free for open source projects.

Needless to say, apart from being a great helper with a nice UI, the platform is mostly based on an open source project called code-maat from the same author.

Let’s see what Codescene is capable of…

Hotspots

In essence, a hotspot is complicated code that you have to work with often.

Its calculation is pretty simple:

With a Hotspot analysis we can get a hierarchical map that lets us analyze our codebase interactively.

By using one of the examples of the platform, we can check the following visualizations where each file is represented as a circle:

As we can see, we can also identify clusters of Hotspots that indicate problematic sub-systems.

By clicking on a Hotspot we can get more details to get deeper information:

The main benefits of a Hotspot analysis include:

Maintenance problems identification: Information on where sits complicated code that we have to work with often. This is useful to prioritize re-designs.
Risk management: It could be risky to change/extend functionality in a Hotspot for example. We can identify those areas up-front and schedule additional time or allocate extra testing efforts.
Defects Detector: It could identify parts of the codebase that seem unstable with lots of development activity.

Here is the full documentation with more details.

Code Biomarkers

In medicine, biomarkers stand for measurements that might indicate a particular disease or physiological state of an organism. We can do the same for code to get a high-level summary of the state of our hotspots and the direction our code is moving in.

Code biomarkers act like a virtual code reviewer that looks for patterns that might indicate problems.

They are scored from A to E where A is the best and E indicates code with severe potential problems.

Let’s have a look at a couple of examples listing risky areas of our code base:

In conclusion we can use Code Biomarkers to:

To decide when it’s time to invest in technical improvements instead of adding new features at a high pace.
Get immediate feedback on improvements.

Same as with hotspots, here is also the biomarkers full documentation.

There is way more to cover in this field like:

“Behavioral code analysis helps you ask the right questions, and points your attention to the aspects of your system – both social and technical – that are most likely to need it. You use this information to find parts of the code that may have to be split and modularized to facilitate parallel development by separate teams, or, find opportunities to introduce a new team into your organization to take on a shared responsibility.”

Where should we focus improvements?
Where are the risk areas in the code?
Any team productivity bottleneck?

I definitely encourage you to give Codescene a try either within an open source repo or within the existent samples, you will be surprised how much curious stuff you find :).

Extra Ball

I would like to introduce an open source repository visualization tool called Gource.

Here is how the author describes it:

“Software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Directories appear as branches with files as leaves. Developers can be seen working on the tree at the times they contributed to the project.

In essence you can grab your git repository, run gource on it and the result is something like this (This is an example of the Bitcoin repository and its evolution):

The documentation sits at the Gource Github Wiki.

As a trick we have had it in a monitor during sprints to make more visible and transparent how we move around our codebase. Really fun!

Paying Technical Debt

“The best way to reduce technical debt in new projects is to include technical debt in the conversation early on.”

So let’s see how we can deal with these contexts by pointing out a few tips for the action plan.

At Team level:
- Prioritize and keep track of technical debt: During the sprint planning for example.
- Allocate time to address technical debt: Also During the sprint planning or when estimating a task that requires touching a sick part of the code.
- Tech Debt Days: Another great practice where the team spends an entire day only focused on repairing affected code.
At Company level:
- Educate people about its existence: Cost of delay: This metric helps to make visible how much time a team loses due to technical debt.
- Make it transparent: Talk, talk and talk and always bring it up to the table.
- Communicate it properly: An idea would be to add a tech debt update meeting about the current state of it.

As a conclusion, I would like to finish this section with a bunch of quotes from Adam Tornhill (a reference in this field):

“Technical debt can be a frustrating and de-motivating topic for many Development Teams.”

“The keyword is transparency.”

“Explain the cost of low-quality code by using the transparent metaphor of ‘technical debt’.”

“Make technical debt visible in the code using a variety of objective metrics, and frequently re-evaluate these metrics.”

“Finally, make technical debt visible on the Product and/or Sprint Backlog.”

“Don’t hide Technical Debt from the Product Owner and the broader organization.”

Wrapping up

Technical debt is a ticking bomb and as our lovely Batman from 1966 (characterized by Adam West) would say (you can check the full 2 minutes video here, BTW one of my favorite scenes ever):

“Some days you just cannot get rid of a bomb…”

And based on this inspiring quote let me rephrase it to:

“Sometimes it is not easy to get rid of a bomb…”

It is a reality that technical debt exists in 99% of the codebases, it is also an important challenge we must face to keep the healthiness and maintenance of our software projects.

Hopefully there is light at the end of the tunnel and with the different techniques mentioned above, now you have a couple of new tools in your toolbox to address it effectively.

Have fun and do not let technical debt beat you.

Congratulations! Technical Debt GURU Level Unlocked!

Part of this article came out of a talk I gave about TECHNICAL DEBT recently, you can check the slides:

There is also a sketch that perfectly summarizes the main idea of my talk, courtesy of @lariki and @Miqubel:

And finally a video of my talk at Mobiconf: