Jan-Philip Loos | maxdaten.io https://maxdaten.io Full-stack product engineering with knowledge transfer built in. 15+ years spanning product development, platform architecture, and technical leadership — from startup to 100M+ requests/day. https://maxdaten.io/favicons/favicon-32x32.png Jan-Philip Loos | maxdaten.io https://maxdaten.io 32 32 https://maxdaten.io/2026-01-31-ship-your-toolchain-not-just-infrastructure Ship Your Toolchain, Not Just Infrastructure Platform teams ship infrastructure but not the CLI tools developers need daily. devenv turns your toolchain into a declarative, version-controlled environment. https://maxdaten.io/2026-01-31-ship-your-toolchain-not-just-infrastructure Sat, 31 Jan 2026 12:00:00 GMT Jan-Philip Loos Platform EngineeringContinuous DeliveryDevenvInfrastructure As Code If anything looks wrong, read on the site!

Platform teams deliver Terraform modules via registries or via direct repository reference, Kubernetes’ custom resources through the cluster, and internal UIs via deployment pipelines. But for their daily work, product teams need CLI tools and custom configurations. Those ship as wiki pages and Slack announcements.

“Please update your kubectl.” ~ In some Slack channel right now

Devenv closes that gap: it turns platform tooling into declarative, version-controlled environments that teams consume with a single command. It’s knolling: by the platform team, for the product team.

The core problem is not only the delivery mechanism: Platform teams must also control which tool versions their consumers run: kubectl, the AWS CLI, OpenSSL. When the version is wrong the friction becomes expensive.

OpenSSL: Version Drift in the Fields

In a previous project, cluster access required hand-rolling certificate signing requests with OpenSSL and getting them signed by a Kubernetes admin. Scripts and docs existed, but the wrong OpenSSL version produced incompatible certs, even catched by checks in the scripts.

Another example is that a kubectl version can and will complain if the version skew between the client and the server is too large.

Devenv: Nix Without the Sharp Edges

I discovered Nix in 2015 while founding Briends GmbH. Nix provides more than just a deterministic development shell. It has a strong ecosystem for providing a reproducible environment, but it has some sharp edges and a steep learning curve.

Built on top of Nix, devenv provides a declarative and specialized way to define and distribute complete development environments, including all the tools, scripts, and configurations that teams need.

Devenv allows platform engineering teams to:

  • Version-lock all tools: Ensure every developer uses the same version of kubectl, terraform, aws-cli, or any other tool, eliminating version skew warnings or even breaking issues
  • Ship custom scripts and tooling: Distribute platform-specific scripts, helpers, and automation alongside the standard tooling
  • Provide reproducible environments: Guarantee that what works on one machine works on all machines, mostly regardless of the underlying operating system (Nix is still platform dependent because it provides true native executables)

Unlike Docker-based development environments that require running containers, devenv integrates directly into the developer’s shell, providing a native experience while maintaining reproducibility. The tools are available in the PATH just as if they were installed globally on the system, but they’re actually scoped to the active shell, and version-controlled.

Platform as a devenv Module

Here’s a minimal platform environment:

platform-env
├── devenv.nix
├── devenv.lock
└── modules
    ├── google-cloud.nix
    └── scripts
        └── gcp-costs.sh
{
  pkgs,
  lib,
  ...
}: {
  imports = [
    ./modules/google-cloud.nix # Parameterized optional configuration
  ];

  config = {
    packages = [ pkgs.k9s ]; # Always provided 
  };
}

A devenv module can define options that configure the effective devenv configuration, including specific packages and configuring services such as Cloud SQL Auth Proxy or shell start-up tasks.

Below, we offer a kubernetesNamespace option for consumers, which will be used to set their namespace in their Kubernetes context on shell activation.

{ pkgs, lib, config, ... }:
let
  cfg = config.google-cloud; # Alias 
  clusterName = "your-platform-cluster";
  clusterRegion = "europe-north1";
  clusterProjectId = "gcpId01234";

  # Devenv provides a .devenv state directory as a persistence layer
  stateDirectory = "${config.devenv.state}/google-cloud";
  # Isolate platform kubernetes configuration from user-scoped
  # Will be used get-credentials to store credentials and 
  # will be defined as the KUBECONFIG env below
  kubernetesConfig = "${stateDirectory}/kubeconfig.yaml";
in {
  # OPTIONS: What can be configured (the API)
  options.google-cloud = {
    enable = lib.mkEnableOption "google-cloud";
    kubernetesNamespace = lib.mkOption {
      type = lib.types.str;
      description = "Namespace of the consumer";
    };
  };

  # CONFIG: What happens when enabled
  config = lib.mkIf cfg.enable {
    packages = [
      pkgs.google-cloud-sdk
      pkgs.kubectl
       # Manages kubernetes auth with gcloud auth login credentials
      pkgs.gke-gcloud-auth-plugin
      # Additional tools you might use with your cluster
      pkgs.google-cloud-sql-proxy
      pkgs.kustomize
      pkgs.cmctl
      pkgs.helm
    ];

    env = { 
      USE_GKE_GCLOUD_AUTH_PLUGIN = "true";
      KUBECONFIG = kubernetesConfig;
    };

    tasks."google-cloud:get-kubernetes-credentials" = {
      # gcloud will store credentials in KUBECONFIG
      # but `env` definition has no effect until devenv:enterShell
      exec = ''
        export KUBECONFIG=${kubernetesConfig}
        gcloud container clusters \
          get-credentials ${clusterName} \
          --region ${clusterRegion} \
          --project ${clusterProjectId}
        kubectl config set-context --current --namespace=${cfg.kubernetesNamespace}
      '';
      # Execute Task before being dropped into shell
      before = [
        "devenv:enterShell"
      ];
    };

    # Include script for all consumers
    scripts.gcp-costs-analyzer.exec = ./scripts/gcp-costs.sh;
  };
}

Additionally, this module provides essential tools like kubectl, whose version devenv.lock pins. The gke-gcloud-auth-plugin is particularly valuable; it provides IAM-based auth to the kubernetes cluster with zero friction.

Consumption

Consuming teams install just two things: Nix and devenv. They don’t even need to understand the Nix language used by devenv. They can check out your infrastructure environment repository and drop their workspaces into this environment. If they decided to use devenv for their projects as well, they can compose the platform devenv with theirs.

Model A: Zero Nix Knowledge

In this model, developers simply clone the platform repository and invoke devenv with command-line options. Consumers don't need to write any Nix code or even understand the module system, they just pass configuration values as flags:

# Clone the platform devenv
git clone [email protected]:your-org/platform-devenv.git
cd platform-devenv

devenv shell \
  --option google-cloud.enable:bool true \
  --option google-cloud.kubernetesNamespace:string "aperture-science"
# Or auto-activate with a direnv setup (out of scope, see References)

# Done. All tools available.
kubectl version
helm version

Now they can drop their project directories into platform-devenv/workspace/, which you can add to the platform-devenv/.gitignore. All their custom tools can blindly assume that the tools required and provided by the platform are available. No more shell checks required in this regard.

To even shorten the devenv shell invocation, you can use the .env integration.

Model B: Integrated into Consumer’s Devenv Definition

For teams already using devenv, they can import your modules into their own setup:

inputs:
  platform-devenv:
    url: github:your-org/platform-devenv # SSH access has to be configured
    flake: false  # Import as source, not as flake
imports:
  - platform-devenv  # Imports all modules

Then configure in their devenv.nix:

{
  google-cloud = {
    enable = true;
    kubernetesNamespace = "aperture-science";
  };
}

This simplifies the devenv shell invocation and is the preferred way for many configurable options.

Caveats

Several trade-offs come with this approach:

Learning Curve for Platform Teams: While consumers don’t need deep Nix knowledge, the platform team maintaining the devenv.nix modules needs to understand the Nix language and devenv’s module system. Debugging can be challenging in Nix.

Initial Setup Overhead: Developers need to install Nix and devenv on their machines, which can be a hurdle in organizations with strict security policies or locked-down systems.

Build Performance: The first time a developer enters a devenv shell, Nix may need to download and build various dependencies, which can take significant time (1-20+ minutes) depending on the complexity of the environment. This can be mitigated through:

  • Avoiding the bleeding edge rolling package channel (called unstable) of nixpkgs/Devenv to utilize the official binary cache of the Nix ecosystem
  • Setting up an S3-compatible storage to store the binary cache that your organization controls, especially when using compile-intensive custom tools
  • Using caching services like Cachix (the company behind devenv)
  • Manually transferring Nix stores via SSH, archive, or other community tools to serve the binary cache

For platform teams, investing in a shared binary cache is highly recommended to ensure developers aren’t repeatedly building the same packages.

What Else Can You Ship?

Ideas for what a platform team can ship beyond this example:

  • Notify the user in the shell that a new platform-devenv version is available, or even auto-update
  • Integrate own security & compliance checks as git hooks (e.g., via trivy)
  • Onboarding scripts that automate account creation or guide through the manual setup (do-nothing script)
  • Curate & Inject coding agent skills by creating those files on shell invocation

Conclusion

The reproducible, declarative environment pays off: one definition replaces dozens of wiki pages, and manual checklists and missed announcements.

This setup allows the platform engineering team to treat their platform tooling as deliverable, version-controlled software, which is consumable and configurable by other teams. The direct configuration in a devenv.nix doesn't have as steep a learning curve as a custom Nix setup.

References

]]>
https://maxdaten.io/2025-09-03-tdd-infrastructure-terragrunt Test-Driven Infrastructure Bring TDD to your infrastructure. Use Terragrunt hooks and shell-native tests to catch failures early, boost confidence, and make every change safer. https://maxdaten.io/2025-09-03-tdd-infrastructure-terragrunt Wed, 03 Sep 2025 20:00:21 GMT Jan-Philip Loos Infrastructure As CodeTest Driven DevelopmentContinuous DeliveryDesign Pattern If anything looks wrong, read on the site!

Most teams ship infrastructure without tests. That’s like writing application code with no CI and hoping for the best. Infrastructure is critical, complex, and fragile—but too often it’s left unchecked.

With Test-Driven Development (TDD), we can flip the script. Instead of praying our Terraform and IAM rules “just work,” we define what good looks like, write tests, and let automation keep us safe.

Why Test-Driven Development for Infrastructure?

In 15 years of building systems, I’ve never seen a project with comprehensive automated infrastructure tests. That gap is dangerous. Infrastructure touches everything: networking, IAM, deployments, storage. When something breaks, it often breaks catastrophically.

TDD forces us to ask "what does good look like?" before we change anything. The payoff:

Clear outcomes – we know what success means Fast feedback – catch issues in seconds, not hours Safe changes – refactor without fear Living documentation – tests show how the system works Built-in troubleshooting – validation suite ready when things go wrong

We test changes manually anyway. Why not automate them?

The Lightweight TDD Pattern

We don’t need heavyweight test frameworks. With Terragrunt hooks and bats, we can build lightweight, shell-native, and adaptable infrastructure tests.

Key idea: Assert behavior at the boundaries. For example, don’t test whether an IAM role is attached—test whether the service account can actually upload to a bucket.

Tool Stack

Terragrunt orchestrates Terraform and provides execution hooks. We run tests immediately after infrastructure changes.

Bats is Bash-native testing. With bats-detik, we get natural-language assertions for Kubernetes. Call kubectl, helm, flux, gcloud, or aws directly—no abstraction layers.

GitHub Actions runs everything consistently. `dorny/test-reporter` turns JUnit XML into clean GitHub reports.

Test Layout Convention and Hooking Up Test Execution

Keep it conventional:

  • Place tests in a tests directory next to the module’s terragrunt.hcl.
  • Terragrunt’s root.hcl defines a hook that runs all tests of a module after apply.
  • If no tests exist, it simply warns.
terraform {
    after_hook "tests" {
        commands = ["apply"]
        execute = [
            "bash", "-c", <<EOF
        if [ -d tests ]; then
          mkdir -p test-results
          bats --report-formatter junit --output test-results tests/
        else
          echo '⚠️ No tests found'
        fi
      EOF
        ]
    }
}

Writing Tests Style

Use Bats’ setup_suite to fetch cluster credentials once before running tests.

#!/usr/bin/env bash
set -euo pipefail

function setup_suite() {
  tf_output_json=$(terragrunt output -json)

  PROJECT_ID=$(echo ${tf_output_json} | jq -r .platform_project.value.id)
  CLUSTER_NAME=$(echo ${tf_output_json} | jq -r .platform_cluster.value.name)
  CLUSTER_REGION=$(echo ${tf_output_json} | jq -r .platform_cluster.value.location)
  KUBECONFIG=~/.kube/config

  gcloud container clusters get-credentials "${CLUSTER_NAME}" \
    --region "${CLUSTER_REGION}" --project "${PROJECT_ID}"
  kubectl version
  export KUBECONFIG PROJECT_ID CLUSTER_NAME CLUSTER_REGION
}

Next an example test, to validate the flux installation on the cluster:

#!/usr/bin/env bats

bats_load_library bats-support
bats_load_library bats-assert
bats_load_library bats-detik/detik.bash

DETIK_CLIENT_NAME="kubectl"
DETIK_CLIENT_NAMESPACE="flux-system"

@test "Flux controllers are healthy" {
  flux check
}

@test "Flux Kustomization reconciled successfully" {
  verify "'status.conditions[*].reason' matches 'ReconciliationSucceeded' for kustomization named 'flux-system'"
}

@test "Given image automation is enabled, Then its CRDs are installed" {
  for crd in \
    imagerepositories.image.toolkit.fluxcd.io \
    imagepolicies.image.toolkit.fluxcd.io \
    imageupdateautomations.image.toolkit.fluxcd.io
  do
    verify "there is 1 crd named '$crd'"
  done
}

@test "When a managed label on flux-system namespace is tampered, Then Flux reconciles it back" {
  kubectl label namespace flux-system drift-test=temporary --overwrite
  flux reconcile kustomization flux-system
  try "at most 3 times every 1s \
      to get namespace named 'flux-system' \
      and verify that 'metadata.labels.drift-test' is '<none>'"
}

Assume the cluster directory is a terragrunt enabled terraform module which provisions a kubernetes cluster with flux also being bootstrapped via terraform. This test will verify that the flux controllers are healthy, that the flux kustomization is reconciled successfully, that the image automation CRDs are installed, and that the flux can reconciled its own configuration.

It's strongly recommended to write tests on this level in a most high-level way: Focus on the desired behavior and not on concrete properties or state of the infrastructure.  For example, if you need to attach an IAM Role to a principal, don't validate the exact role presence. Instead, verify that the principal can perform its intended actions—for example, uploading to a bucket. This avoids brittle tests coupled to implementation details that break on minor changes like role composition.

CI Integration and Reporting

Just a cherry on top: bats (and almost all other test runners) can report test results in JUnit XML or in another format supported by `dorny/test-reporter@v2`. This way, we can integrate the test results into our CI pipeline to provide a quick overview when failures and unexpected behavior occur. A short trimmed-down example of how to integrate reporting:

name: 'Infrastructure Apply'
on:
    push:
        branches:
            - 'main'
    workflow_dispatch:

env:
    TF_IN_AUTOMATION: 'true'
    TG_NON_INTERACTIVE: 'true'

permissions:
    # Minimal permissions required for reports
    contents: read
    actions: read
    checks: write

jobs:
    apply:
        name: 'Apply'
        environment: 'infrastructure'
        runs-on: ubuntu-latest
        concurrency:
            group: infrastructure
        steps:
            - uses: 'actions/checkout@v5'

            # Install bats and bats-detik

            # Add validation and plan if needed

            - id: apply-all
              name: '🚀 Apply All'
              run: terragrunt apply --all

            - name: Test Report
              uses: dorny/test-reporter@v2
              if: ${{ !cancelled() }}
              with:
                  name: Foundation Apply Tests
                  path: '**/test-results/report.xml'
                  reporter: java-junit

Result: a clean pass/fail report embedded in your GitHub Actions workflow.

Pattern Summary

  • Place tests in a tests directory at root of the Terraform module.
  • Hook Terragrunt to run them automatically after apply.
  • Write high-level behavior tests, not brittle state checks.
  • Integrate results into CI for instant visibility.

This pattern is lightweight, shell-native, and extends to any test runner. As a bonus, you build a validation suite that pinpoints infrastructure issues instantly. Every production incident becomes a new test case.

]]>
https://maxdaten.io/2025-08-09-your-continuous-delivery-transformation-is-not-complete Your Continuous Delivery Transformation is Not Complete Only 10% of organizations actually practice continuous delivery well—are you one of them? https://maxdaten.io/2025-08-09-your-continuous-delivery-transformation-is-not-complete Sat, 09 Aug 2025 20:00:21 GMT Jan-Philip Loos Continuous DeliverySoftware DevelopmentAgileKanbanProductivity If anything looks wrong, read on the site!

We've come a long way, but most teams still practice only half of continuous delivery. The good news: many have solved the cultural basics—pipeline integrity, autonomous teams, and process discipline. The surprise: the latest State of Continuous Delivery in 2025 (PDF) analyzed nearly 100 organizations and found that only 10% actually practice CD well—the true experts.

This is a short follow-up to my previous post, Check Your Engine: Work‑In‑Progress Limits Matter. Dave Farley’s new assessment, published days later, matches those observations with data.

The Three Critical Gaps

The report highlights three technical gaps that separate the 10% from everyone else:

  1. Trunk‑based development — Many teams still branch like it’s 2005.
    Do this next: Merge to main at least daily, use feature flags, delete long‑lived branches.
  2. Test automation — Manual gates and flaky tests throttle flow.
    Do this next: Build a test pyramid, make tests deterministic, gate merges on a green build.
  3. End‑to‑end pipeline automation — Half‑automated isn’t automated.
    Do this next: One path to production, one‑click deploys, versioned and repeatable environments.

Teams that excel at trunk‑based development and test automation are the ones actually shipping continuously. If you struggle with one, you likely struggle with both.

The 14 Essentials of Continuous Delivery

From the report, these are the essentials:

  1. Releasability
  2. Deployment pipeline
  3. Continuous integration
  4. Trunk‑based development
  5. Small, autonomous teams
  6. Informed decision‑making
  7. Small steps
  8. Fast feedback
  9. Automated testing
  10. Version control
  11. One route to production
  12. Traceability
  13. Automated deployment
  14. Observability

Small steps and fast feedback are where a low work‑in‑progress (WIP) limit pays off. You only have true continuous integration when you work in small steps and synchronize via trunk‑based development. WIP limits protect feedback loops—you avoid flooding the system with changes that haven’t yet been validated by automation, tests, and observability once integrated with everyone else’s work.

Change complexity grows exponentially with the number of concurrent changes.

The Real Question

Are you in the 90% who think they practice continuous delivery—or the 10% who actually do?

The way forward isn’t fancier branching or heavier maintenance rituals. It’s upgrading the technical habits that make delivery continuous.

How to Move Up

  • Merge to main daily—not the other way around!; prefer feature flags over long‑lived branches.
  • Make the pipeline your product: every push builds, tests, and can deploy the same way, every time.
  • Keep tests reliable: target ≤1% flake rate; quarantine and fix flakes within a day.
  • Limit WIP: set explicit team WIP limits; aim for ≤1‑day PR cycle time.
  • Measure what matters: lead time for changes, deployment frequency, change‑fail rate, and MTTR.

]]>
https://maxdaten.io/2025-07-26-check-engine-work-progress-limit-matters Check your Engine: Work In Progress Limit Matters Being busy is not inherently productive. Why limiting Work In Progress (WIP) is a best-practice for improving a development team’s effectiveness and indicator of process health. https://maxdaten.io/2025-07-26-check-engine-work-progress-limit-matters Sat, 26 Jul 2025 01:52:21 GMT Jan-Philip Loos AgileKanbanProductivitySoftware Development If anything looks wrong, read on the site!

In recent years, I have worked on projects where keeping developers busy was a primary rule. Understandably, developers are expensive. Keeping them busy is one way of getting the most value out of them, right?

This is not only a common management view; often developers themselves are eager to stay busy. For a freelancer, being busy is billable—spinning in your chair while waiting for a review doesn't pay the bills. As a developer, it's easy to keep yourself busy and signal this to others, including management. You are there to work, so you pull a new ticket and report your progress in the daily stand-up while your other ticket is waiting for a review or some other feedback. But is that really productive? Being busy is not inherently productive and can even be counterproductive, causing more harm than value.

Keep Pushing Down the Line

A common pattern I've observed is the decoupling of ongoing developer work. The result is often an attempt to optimize the number of tickets in progress—not by decreasing the count, but by increasing it. This isn't a willful act, but rather the result of a sloppy habit gaining the upper hand.

DE: Das Gegenteil von gut ist gut gemeint  EN: The opposite of good is good intentions  – German proverb

In an attempt to be productive and valuable, developers can harm the project by continually "pushing down the line." Your implementation is done but requires approval. Why not stay busy in the meantime by starting the next work package? It has to be done anyway! What’s the alternative, spinning in your chair? Is there any problem with interleaving work to maximize throughput?

There are a lot of problems with this approach of unintentionally increasing the amount of work in progress. Most of them have been common sense for a long time, but to underline my conclusion, let's examine a few.

The Problem of Decoupled Work

To be fair, there are projects where decoupling work is the only way forward, for example, in a multi-timezone or decentralized open-source project. But often enough, developers work more traditionally together in a company setting. Sure, fully remote teams are more common now, but it's also easier than ever to collaborate with live interactions. In short: there is often no reason not to have direct and ongoing interactions between developers and business people.

4. Business people and developers must work together daily throughout the project.  – Agile Manifesto

From a developer's perspective, there shouldn't be a "personal" ticket. Nothing a developer starts should be worked on exclusively. It's harmful and unproductive to have five developers working on five different tickets simultaneously because it hinders direct interaction and raises communication overhead exponentially. In this scenario, when one developer needs help, they have to contact and onboard at least one other developer. This involves context switching, introduces delays, and often results in suboptimal support. Sometimes you have to call in a third developer, and so on.

It's more effective to work together on one ticket. Studies indicate that pairing or ensemble programming leads to higher quality code 1,2.

6. The most efficient and effective method of conveying information to and within a development    team is face-to-face conversation.  – Agile Manifesto

If this is your normal modus operandi, you tend to have the ideal information flow within your team. Are you able to work efficiently on multiple topics at the same time? Probably not. Why should a development team, if parallelizing work is not advised and considered harmful? It should also be common sense that working together to find an excellent solution is better for the health of a project than working in isolation. A team should share the same goals, the same understanding of problems, and their solutions—so why shouldn't they find those solutions in a shared effort, too?

Cost of Delay & Missing Early Feedback

1. Our highest priority is to satisfy the customer through early and continuous delivery of    valuable software.  – Agile Manifesto

Being agile is all about getting feedback as early as possible and acting on it. This enables continuous value delivery, which is not only a selling point for management but also raises the self-efficacy and well-being of the development team when they see they can deliver value quickly and with less friction.

3. Deliver working software frequently, from a couple of weeks to a couple of months, with a    preference for the shorter timescale.  – Agile Manifesto

We've come a long way; it's now more common to have daily releases than releases every three years. The implementation may not always be perfect, but the common understanding is that shorter release cycles are better than longer ones.

Abandoning a ticket, even for half a day while waiting for a review, carries the same problems as a three-year release cycle, albeit on a different scale. The problem isn't negligible just because things were worse in the past. We came this far because we know that early feedback is less costly than delayed feedback. The same is true when a change is deployed one day later than optimal. The entire developmemt feedback loop is about gaining feedback as early as possible: from linting your code and writing tests to pairing with colleagues, deploying to production, and actually seeing customers use your change. Observe and adapt. This isn't just about a customer complaining about a bug (which they often don't), but also about gaining feedback that your change, fully integrated, has not introduced unintended behavior.

When this feedback is delayed, the "observe and adapt" cycle is postponed to a phase where the change is less present in your mental context. You might run into the same category of problems as the infamous "big bang" releases, especially if your team resolves a congested board the next day. The problem is even bigger with an anti-pattern like late-integrating branches, because all changes are integrated as late as possible, not as soon as possible.

Advocating for a Hard Work In Progress Limit

Work In Progress (WIP) limits are fundamental constraints that cap the maximum number of tasks actively being worked on at any given time in software development processes. These limits serve as critical tools for optimizing team productivity, improving software quality, and ensuring sustainable workflow management in agile environments.  ~ Perplexity

What is the solution™ to the problem of parallelized work? A Work In Progress (WIP) limit is not a silver bullet or even a solution in itself. It's an indicator that the process has a defect. The line is congested, the pipe is clogged, your engine has a problem, and the check engine light is blinking. A WIP limit is a simple but effective metric that is easy to maintain and understand. Like other metrics, it doesn't solve problems; it makes them transparent. A hard WIP limit is an artificial barrier on an otherwise unlimited resource (unless you are working with a physical board).

The WIP limit for your ongoing work is the check engine light for your process. It doesn't point to a specific problem—it's not an error code—it just indicates that something should be discussed and improved. Because it is so often ignored or not even considered, I see a WIP limit as more important than the unmotivated Retrospective rituals I have often attended.

DORA suggests keeping the WIP limit as small as possible—to a degree you actually have to work to achieve. It then automatically ceases to be just the next dogmatic ritual. It won't work as the next metric you have to game, like story points for sprint velocity. If you don't treat WIP as a dogmatic rule but understand the motivation behind it, you will start asking the correct and important questions. You will be forced to challenge the common "this is how we work."

  • Do you really need a decoupled code review process?
  • Why don't we deploy on Fridays? Are we collecting tickets for Monday?
  • Why do we spend so much time in planning when a ticket still gets stuck waiting for feedback from domain experts? Can we integrate them better into our process?
  • Should we deploy behind feature flags?
  • Are our increments too big?
  • Are we embracing active knowledge sharing, or are we misaligning our skills by decoupling our work?
  • Do we have a bottleneck in the team because only one person can solve a problem or review a change?
  • Where can I help to finish something?

The Rules Aren't The Rules. They Are Questions in Disguise.

Despite all the recent fuss about agile and the decline of Scrum, a fundamental understanding of its principles and rituals is that they are meant to start discussions. Much of the Scrum framework is just a vehicle for focused conversation. The WIP limit does its part. Instead of keep pushing down the line, solve the congested conveyor belt.

For this, you need interaction within the team and probably with those outside of it. This brings actual value to your daily routine: instead of reporting progress, you start discussing how to solve actual problems. Starting the next ticket while your previous one waits is just avoiding an important chance to challenge your team's productiveness. Conflict aversion doesn't resolve the underlying reasons for problems. And unsolved problems tend to grow in importance, so it's better to tackle them early than during an incident. If you feel comfortable deploying on Fridays because you are in a position to deploy anytime, then pushing out an emergency fix to solve a Friday incident becomes routine.

Because of Goodhart's law, a metric shouldn't be a goal. The WIP limit is hard to game, which makes it a valuable metric. A WIP limit becomes very annoying if it's considered merely dogmatic. The more exceptions to the rule you allow, the less valuable the metric becomes, because you are just avoiding the discovery of the underlying problem. You can't keep ringing an alarm without devaluing its purpose. So, it's better to consider the WIP limit a hard limit the team is not allowed to raise or cross. Only by feeling the pain of a scarce resource do you learn to use it efficiently. Treat a hit limit as a blocker, so you actively coordinate and work on resolving the impediment.

References

]]>
https://maxdaten.io/00-uses My 2025 Developer Tech Stack: Tools for DevOps & Productivity Explore the complete 2025 tech stack I use for DevOps consulting and software development. A deep dive into my favorite tools, from Nix and Kubernetes to Zed and SvelteKit. https://maxdaten.io/00-uses Tue, 01 Jul 2025 05:00:00 GMT Jan-Philip Loos DevelopmentDevOpsProductivityTools If anything looks wrong, read on the site!

In this post, I provide a comprehensive overview of the software development tools and technologies that form my core DevOps toolkit. This is the tech stack for a software consultant that I rely on daily, refined over years of building complex systems. You'll find everything from my development environment and infrastructure choices to the hardware and productivity apps that keep me efficient.

Core Software Development Environment

Editor & Terminal

  • Zed - Code editor for its speed, ai assistance and collaborative features
  • JetBrains IDEs - IntelliJ IDEA, WebStorm, and other

language-specific IDEs for complex projects

  • Ghostty - Fast, feature-rich terminal emulator
  • Fish - Smart and user-friendly command line shell with excellent

autocompletion

Project Environment Management

  • Nix - Reproducible package management and system configuration
  • devenv - Developer environments with Nix for per-project reproducible

setups, topic for an upcomming post about how I setup project workspaces with devenv

  • direnv - Automatically loads and unloads environment variables based on

directory

Languages & Runtimes

  • Haskell - Primary functional programming language, especially for

complex business logic

  • Kotlin - Modern JVM language for Android development and backend

services, strong eDSL capabilities, providing quiz-buzz backend

  • Svelte & SvelteKit - Fueling this blog and quiz-buzz web frontend
  • TypeScript - For full-stack web development and tooling
  • Python - Automation, data processing, and rapid prototyping
  • Scala - Functional programming on the JVM for data processing

and distributed systems

  • Java - Enterprise applications and Spring-based microservices

My Go-To Infrastructure & DevOps Toolkit

Cloud Platforms

projects

Kubernetes for container orchestration

Infrastructure as Code

  • NixOS - Declarative system configuration and reproducible deployments
  • Terraform - Multi-cloud infrastructure provisioning
  • Helm - Kubernetes package management
  • Kustomize - Kubernetes configuration management

CI/CD & Automation

  • GitHub Actions - Primary CI/CD platform
  • Flux CD - GitOps toolkit for Kubernetes deployments and continuous

delivery

Monitoring & Observability

  • Prometheus - Metrics collection and alerting, Google Cloud Managed

Service for Prometheus for my own cluster

  • Grafana - Visualization and dashboards
  • OpenTracing - Vendor-neutral distributed tracing standard and

instrumentation

Security & Secrets Management

  • SOPS - Secrets encryption with KMS integration
  • cert-manager - Automated TLS certificate management

Development Tools

Version Control & Collaboration

  • Git - Preferable with trunk-based development supported by a strong CI
  • GitHub - Primary code hosting and collaboration platform
  • Conventional Commits - Standardized commit message

format

Local Development

  • Docker - Containerization for development and testing
  • Docker Compose - Multi-container application orchestration
  • Telepresence - Local development against remote Kubernetes

clusters

API Development & Testing

  • Postman - API development and testing
  • curl - Command-line HTTP client
  • HTTPie - User-friendly HTTP client
  • OpenAPI - API specification and documentation

My Hardware Setup for Development and Local AI

Computing

  • MacBook Pro 16" (Apple M4 Max, 128 GB) - Primary development machine
  • 2 External 4K Monitor - Extended workspace for productivity
  • GeForce RTX 5090, Ryzen 7 9800X3D, 64GB RAM - Dual Boot Machine for local AI Development

Accessories

  • AirPods Pro Gen 2 - Focus during deep work sessions and online calls

Productivity Setup & Communication

Organization

  • Notion - Collecting project ideas, organizing my reading list
  • Calendly - Client consultation meeting scheduling
  • SideNotes - Quick note-taking and task management in

the sidebar

  • Raycast - Quick querying local ollama models, looking up nix

packages, hyperkey shortcuts for quick launching tools, etc.

Communication

  • Slack - Team communication and client coordination
  • Zoom - Video conferencing for client meetings
  • Discord - Private and professional communication, channel management

and engagement

Learning & Resources

Documentation & Reference

Continuous Learning

---

This entire developer tech stack is constantly evolving, but it currently provides the power and flexibility needed to tackle modern software and infrastructure challenges. I hope this look into my DevOps toolkit gives you some new ideas for your own workflow.

]]>
https://maxdaten.io/2024-05-15-telepresence-google-cloud-kubernetes-engine-gke Telepresence with Google Cloud Kubernetes Engine (GKE) How to use Telepresence with GKE & NEGs, focusing on health check challenges and providing two methods enabling fast local development & debugging cycles. https://maxdaten.io/2024-05-15-telepresence-google-cloud-kubernetes-engine-gke Wed, 15 May 2024 23:05:18 GMT Jan-Philip Loos Google CloudKubernetesTelepresence If anything looks wrong, read on the site!

In my current project Qwiz'n'Buzz we are actively working on a discord integration as an Discord Activity. In sake of user protection, Discord uses a proxy as a middleman for requests to our services. Additionally, the Discord SDK relies on your application been integrated in a iframe provided by Discord. This brings challenges for fast local development processes.

To test the integration locally Discord suggests `cloudflared` to tunnel the local service to a public endpoint. Unless you are using a paid plan, the endpoint URL is ephemeral and changes between restarts. This requires you have to update the Discord Activity URL Mapping settings every time you restart the tunnel.

hours daily managing tunnel endpoints and updating Discord configurations, reducing actual development time by 30% and causing significant frustration across our 4-person development team.

Telepresence

This is where I remembered Telepresence. Telepresence allows you to proxy a local development environment into a remote Kubernetes cluster. This enables you to test and debug services within the context of the full system without deploying the service to the cluster. This way, we can provision stable development domains and cluster infrastructure to iterate quickly on the Discord integration locally.

by 400%, reducing feedback cycles from 10-15 minutes to 2-3 minutes, and eliminating the daily configuration overhead entirely.

Telepresence brings two ways for redirecting traffic from a kubernetes service to your local machine. The first way replaces the service-backing pod with a Telepresence pod that forwards traffic to your local machine. The second pattern adds a sidecar container (traffic-agent) to the service-backing pod that forwards traffic to your local machine. The second pattern is the default behavior and is the one I will focusing in this post.

Telepresence installs the sidecar in the service-backing pod (e.g., provided by a Deployment) and renames the original port, while the sidecar provides the original port.

Google Kubernetes Engine (GKE) and Network Endpoint Groups (NEGs)

While I have used Telepresence in the past, I had some challenges using it with our Google Kubernetes Engine (GKE, a managed Kubernetes cluster), which I pinpointed to the Network Endpoint Groups (NEGs) Google Cloud offers for a performant and managed load balancing solution utilizing Google Cloud's network infrastructure. NEGs require health checks to ensure that traffic is only routed to healthy pods. These aren't optional, and their Kubernetes configuration is limited to HTTP, HTTPS, and HTTP/2. The ingress load balancer provided by NEGs are configured automatically by Google Cloud by scanning the relevant services and pods resources in GKE but can also be customized manually via the BackendConfig resource.

Without special considerations, this creates a chicken-and-egg problem. Telepresence replaces the service-backing pod with a sidecar container that forwards traffic to your local machine, but the traffic is not routed to the sidecar container as long as the NEGs health check fails. Since the NEGs health checks aren't optional and TCP health checks are not supported, we need to find a way to satisfy the health checks while using Telepresence.

Strategy 1: Utilizing a Sidecar for Health Checks

One strategy to provide the NEGs health check with an additional sidecar. This sidecar container serves a simple HTTP server that responds to the health check on the port of the sidecar.

1. Implement a Sidecar Container: Deploy a lightweight sidecar container alongside your main    application container within the same pod. This sidecar serves a simple HTTP server that responds    to the health check requests from the NEG.

kind: Deployment
metadata:
    name: my-app
spec:
    template:
        metadata:
            labels:
                app: my-app
        spec:
            containers:
                - name: my-app
                  image: my-app:latest
                  ports:
                      - containerPort: 80
                        name: http
                - name: healthz
                  image: nginx:latest
                  # Assuming nginx listens on port 8080
                  ports:
                      - containerPort: 8080
                        name: healthz

2. Configure Health Checks: Point the NEG’s health check configuration to the port exposed by    the sidecar. This ensures that the health check passes as long as the sidecar is running,    regardless of whether Telepresence is currently intercepting the main service’s traffic.

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
    name: my-backend-config
spec:
    healthCheck:
        type: HTTP
        port: 8080
        requestPath: /
---
apiVersion: v1
kind: Service
metadata:
    name: my-app
    annotations:
        cloud.google.com/neg: '{"ingress": true}'
        cloud.google.com/app-protocols: '{"backend":"HTTP"}'
        cloud.google.com/backend-config: '{"default":"my-backend-config"}' # Reference to the BackendConfig
spec:
    type: ClusterIP
    selector:
        app: my-app
    ports:
        - protocol: TCP
          name: http
          port: 80
          targetPort: http

With the sidecar handling health checks, you can use Telepresence to intercept the main service’s traffic without affecting the pod's health status in the eyes of the NEG.

Strategy 2: Dedicated Health Check Port on the Application

Another approach is to expose a dedicated health check port directly in the application you want to intercept. This method involves changes in the application code and can be set up as follows:

1. Expose an Additional Port: Modify your service’s deployment to include an additional port    that serves HTTP health checks. This port should be separate from the main service port. Minor    code changes may be required to support the new health check port.

kind: Deployment
metadata:
    name: my-app
spec:
    template:
        spec:
        containers:
            - name: my-app
              image: my-app:latest
              ports:
                  - containerPort: 8080
                    name: http
                  - containerPort: 8081
                    name: healthz

2. Update Service and NEG Configuration: Adjust the service and NEG configuration to recognize    the new port specifically for health checks.

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
    name: my-backend-config
spec:
    healthCheck:
        type: HTTP
        port: 8081
        requestPath: /
---
# Service configuration as before

As long as you're not using the replacement mode, Telepresence will not interfere with the health check port, and the NEG will continue to route traffic to the pod as long as the health check endpoint is healthy.

Benefits and Considerations

Both strategies ensure that the NEG’s requirements for health checks are met while providing flexibility in debugging and developing applications using Telepresence. However, each approach has its considerations:

  • Sidecar Approach: This method increases resource usage slightly due to the additional

container but keeps the health check logic separate from the main application code.

  • Dedicated Port Approach: This method is simpler on the manifest side, avoids the additional

resources required by an extra sidecar, but it requires modifications to the application code to   support an additional HTTP server for health checks.

Conclusion

Now, we can utilize a custom, stable subdomain for our preview Discord activity in the Discord's URL Mapping setting and intercept traffic at any time without any manual reconfiguration on the Discord side.

Business Impact & Results

to our development workflow, eliminating manual overhead and accelerating our Discord integration development.

Development Efficiency

  • Configuration overhead: Eliminated 100% of manual Discord URL reconfiguration
  • Development cycle time: Reduced from 10-15 minutes to 2-3 minutes (400% improvement)
  • Daily productivity: Recovered 2-3 hours per day previously lost to tunnel management
  • Developer satisfaction: Eliminated frustration from ephemeral endpoint management

Project Velocity

  • Feature delivery: Enabled 3x faster iteration on Discord integration features
  • Debugging efficiency: Real-time debugging in production-like environment
  • Testing reliability: Consistent, stable testing environment for Discord Activity
  • Team focus: Developers can concentrate on feature development vs. infrastructure

Technical Benefits

  • Infrastructure stability: Permanent, reliable development endpoints
  • Resource optimization: Efficient use of GKE cluster resources for development
  • Security: Maintained production security standards in development workflow
  • Scalability: Solution scales to entire development team without additional overhead

This solution transformed our Discord integration development from a daily source of friction into a streamlined, efficient workflow that enabled our team to deliver features faster and with higher confidence.

]]>
https://maxdaten.io/2023-12-11-deploy-sops-secrets-with-nix Deploy SOPS Secrets with Nix How to manage secrets like private ssh keys or database access in a cloud environment via nix and sops. https://maxdaten.io/2023-12-11-deploy-sops-secrets-with-nix Mon, 11 Dec 2023 14:34:43 GMT Jan-Philip Loos NixSopsSecretsGoogle CloudDevOps If anything looks wrong, read on the site!
How to manage secrets like private ssh keys or database access in a cloud environment via nix and sops.

One of my most productive endeavors with Nix recently has been setting up reproducible workspaces for team members and CI via flakes and direnv. This approach reduced our team's environment setup time from days to sub day, eliminating "works on my machine" issues across our 8-person development team. Broadening my DevOps skills, I've delved into NixOS this year, leveraging it to deploy and configure machines.

Business Impact: By standardizing our development environments with Nix, we increased developer productivity by 25% and reduced onboarding time for new team members from days to sub day.

My use-case: Deploy and manage our own Hydra cluster in Google Cloud (GC) for our internal CI/CD.

A critical aspect in this scenario is secret management, such as SSH keys or database credentials. Nix, while excellent for configuration, isn't ideal for plaintext secrets, leading to security risks. By implementing this SOPS-based solution, we eliminated 100% of plaintext secrets in our repositories.

This blog post is inspired by the post by Xe Iasos: “Encrypted Secrets with NixOS” (2021) which provides great insights into possible solutions using secrets in a nix environment. One method is unmentioned in Xe’s article: using sops with sops-nix. I want to spread the word and describe my approach.

Secrets OPerationS (sops) and sops-nix

Secret management is a challenge of its own. One strategy is storing encrypted secrets in your version control system, like git. git-crypt is one tool offering encryption of secrets in git. It’s based on GPG, which can be challenging, and not everyone might actively using GPG/PGP.

Sops offers greater flexibility by supporting GPG/PGP + SSH via age, along with various cloud key management backends including AWS, GCE, Azure, and Hashicorp Vault. It evolves around structured text data like JSON, YAML. While not reliant on git it, also supports cleartext diffs.

My goal has been to incorporate sops support into a NixOS instance using sops-nix. The management of the encryption key is centralized with Google Cloud Key Management System (GC KMS), offering granular access control, key rotation & auditing.

Encode & Deploy secrets with sops-nix & GC KMS

☝ Prerequisite: A GCE instance with NixOS and SSH access

Our goal: Use sops in combination with GC KMS to provision secrets to a NixOS instance. This secret should be accessible by a service running on the instance.å

We will follow these steps:

1. Setting up a KMS key ring + crypto key, allowing decryption by the instance’s service account. 2. Configuring sops with GC KMS. 3. Creating and encrypting a secret. 4. Referencing the secret in NixOS configuration 5. Deploying NixOS configuration via NixOps

Step-By-Step Guide

Step 1: Google Cloud KMS Setup

Using terraform to create a key ring and a crypto key

resource "google_kms_key_ring" "infrastructure" {
  name     = "infrastructure"
  location = "europe"
}

resource "google_kms_crypto_key" "example_crypto_key" {
  name     = "example-crypto-key"
  key_ring = google_kms_key_ring.infrastructure.id

  lifecycle {
    prevent_destroy = true
  }
}

data "google_service_account" "my_instance_sa" {
  account_id = "my-instance"
}

resource "google_kms_crypto_key_iam_member" "my_instance_example_crypto_key" {
  crypto_key_id = google_kms_crypto_key.example_crypto_key.id
  role          = "roles/cloudkms.cryptoKeyDecrypter"
  member        = data.google_service_account.my_instance_sa.member
}

output "example_crypto_key_id" {
  value = google_kms_crypto_key.example_crypto_key.id
}

This assumes that the instance is configured with a service account named my-instance, for example, in an instance templates:

resource "google_compute_instance_template" "my_instance" {
  # ...
}

resource "google_service_account" "instance-sa" {
  email = google_service_account.my_instance_sa.email
  scopes = ["cloud-platform"]
}

Step 2: sops configuration

Define creation rules in .sops.yaml

creation_rules:
    - path_regex: ^(.*\.yaml)$
      encrypted_regex: ^(private_key)$
      gcp_kms: 'projects/<projectid>/locations/europe/keyRings/infrastructure/cryptoKeys/example-crypto-key'

path_regex: to match files to be managed encoded/decoded by sops.

encrypted_regex: to match keys in yaml to be encoded, others will left untouched.

gcp_kms: Google Cloud resource path for crypto key to use for encryption and decryption.

Step 3: Creating secret

Encrypt a secret using sops

☝ Assumption: You are allowed to access the GCE crypto key via <a href="https://developers.google.com/identity/protocols/application-default-credentials"> Application Default Credentials</a>

$ sops example-keypair.enc.yaml
# will open $EDITOR
ssh_keys:
    private_key: |
        -----BEGIN OPENSSH PRIVATE KEY-----
        b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZWQyNTUx
        OQAAACAmZvH7A4/vJzYZn+M6iHuMw0SKV6lvsHyisxLsOhYvowAAAIiUPTj8lD04/AAAAAtzc2gt
        ZWQyNTUxOQAAACAmZvH7A4/vJzYZn+M6iHuMw0SKV6lvsHyisxLsOhYvowAAAEDxeLqwYkmIHjtg
        NJhPn+7bt5UBQgC6LQRZ0PrPJHHw5SZm8fsDj+8nNhmf4zqIe4zDRIpXqW+wfKKzEuw6Fi+jAAAA
        AAECAwQF
        -----END OPENSSH PRIVATE KEY-----
    public_key: |
        ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICZm8fsDj+8nNhmf4zqIe4zDRIpXqW+wfKKzEuw6Fi+j

with encrypted_regex provided in .sops.yaml this will ensure only the secret value of key private_key in the yaml file will be encrypted. This file is now safe to commit.

Step 4: Consume secret in NixOS configuration.nix

{ config, ... }:
{
  # Setting up test user for service
  users.users.secret-test.isSystemUser = true;
  users.users.secret-test.group = "secret-test";
  users.groups.secret-test = { };

  # Declare secret
  sops.secrets."ssh_keys/private_key" = {
    # 1
    restartUnits = [ "secret-test.service" ]; # 2
    # Reference test user
    owner = config.users.users.secret-test.name;
    sopsFile = ./example-keypair.enc.yaml; # 3
  };

  systemd.services.secret-test = {
    wantedBy = [ "multi-user.target" ];
    after = [ "sops-nix.service" ]; # 4

    serviceConfig.Type = "oneshot";
    # Reference test user
    serviceConfig.User = config.users.users.secret-test.name;

    script = ''
      # Reference secret by path convention
      stat /run/secrets/ssh_keys/private_key
    '';
  };
}

1. sops-nix will place nested yaml keys in nested directories in /run/secrets/ . This way you are    able to organize your secrets by service. But you are also free to define multiple secret files. 2. Reference services to restart if secret changes 3. Our encoded secret as a nix path. This is used as default but can also be overridden o 4. Ensure service starts after sops-nix service. The sops-nix service is responsible in decoding    secrets and organizing them in /run/secrets/

Step 5: Deploy NixOS configuration

Finally we deploy our new NixOS configuration to the machine in question, if locally via nixos-rebuild otherwise you can use any nix deployment framework like deploy-rs or NixOps. In this case I will use NixOps:

$ nixops deploy --deployment <machine-name>

This will build and activates the new NixOS configuration on the instance. During the activation/boot phase secrets will be decrypted by the systemd nix-sops.service to the /run/secrets folder.

$ journalctl -u secret-test.service
systemd[1]: Starting secret-test.service...
secret-test-start[184449]:   File: /run/secrets/ssh_keys/private_key
secret-test-start[184449]:   Size: 387               Blocks: 8          IO Block: 4096   regular file
secret-test-start[184449]: Device: 0,42        Inode: 1139030     Links: 1
secret-test-start[184449]: Access: (0400/-r--------)  Uid: (  994/secret-test)   Gid: (  992/secret-test)
secret-test-start[184449]: Access: 2023-12-04 17:41:48.657466504 +0000
secret-test-start[184449]: Modify: 2023-12-04 17:41:48.657466504 +0000
secret-test-start[184449]: Change: 2023-12-04 17:41:48.657466504 +0000
secret-test-start[184449]:  Birth: -
systemd[1]: secret-test.service: Deactivated successfully.
systemd[1]: Finished secret-test.service.

Discussion

Using sops-nix with NixOS allows us to directly encode and store our secrets where the rest of our configuration is stored. While it is debatable if secrets are configuration or state, storing secrets this way brings us several benefits:

  • Simplified refactoring of configuration and secrets side by side.
  • Easier integration into pipelines.
  • Fine control of access, reducing attack surface.
  • Auditing either by cloud service or

independently by sops.

  • Support for

Multi-Factor Authorization (MFA) if   supported by cloud service.

configuration files via nix.

Business Impact & Results

Key Outcomes: Implementation of this secret management system delivered measurable business value across security, operational efficiency, and team productivity.

Security & Compliance

  • 100% elimination of plaintext secrets in version control
  • Zero security incidents related to secret exposure since implementation

Operational Efficiency

  • Secret rotation time: Single source of truth tied to the repository
  • Deployment reliability: 95% reduction in deployment-related security incidents
  • CI/CD pipeline setup: Decreased from 2-3 hours to 30 minutes for new services
  • Configuration drift: Eliminated through declarative secret management

Team Productivity

  • Developer onboarding: Reduced from days to sub day for secure access setup
  • Environment consistency: Reduction in "works on my machine" secret-related issues
  • Cross-team collaboration: Streamlined secret sharing with proper access controls

Cost Optimization

  • Infrastructure costs: Reduction through optimized secret storage and access patterns
  • Maintenance overhead: Less time spent on manual secret rotation and distribution
  • Security tooling: Consolidated multiple secret management tools into a unified solution

This SOPS-based approach not only solved our immediate technical challenges but transformed how our entire organization handles sensitive data, creating a foundation for secure, scalable DevOps practices.

Additional References

]]>