DEV Community: Michael Nikitochkin

How I Deployed Woodpecker CI on Fedora IoT

Michael Nikitochkin — Mon, 26 Jan 2026 20:26:44 +0000

I wanted a self-hosted CI/CD system that was lightweight and container-native, without the overhead of enterprise solutions. This is the story of how I deployed Woodpecker CI¹ on my Fedora IoT² server. Along the way, I had to navigate DNS conflicts, SELinux hurdles, and the challenge of secure external access.

In this setup, I used OpenTofu³ for infrastructure automation and Cloudflare Tunnel⁴ to bridge the gap between my local network and the web.

Why I Chose Woodpecker CI
My Infrastructure Layout
The Road to Deployment
Step 1: Handing Security & Secrets
Step 2: Bridging with Cloudflare Tunnel
Step 3: Orchestrating the Server
Step 4: Taming the Agent & Networking
How I Verified Everything
Final Thoughts & Troubleshooting

Why I Chose Woodpecker CI

I needed a platform that felt modern but stayed out of my way. Woodpecker CI fit the bill because:

Isolation: I liked that every build runs in its own ephemeral container.
Integration: It has seamless support for the GitHub OAuth flow I already use.
Simplicity: I could define my pipelines in a familiar .woodpecker.yaml format.

Researching the bits: I spent some time with the Woodpecker architecture docs⁵ to understand how the server and agent communicate over gRPC.

My Infrastructure Layout

I decided to use Cloudflare Tunnel so I wouldn't have to touch my firewall or handle SSL certificates manually. My Woodpecker Server acts as the brain, while the Agent does the heavy lifting via the Podman socket.

My Core Decisions

DNS: I solved the port 53 conflict by using a custom bridge network for internal resolution.
Access: I mapped https://ci.homelab.example directly to my local instance.

The Road to Deployment

My process involved four main phases. I used OpenTofu to make the deployment repeatable and Podman Quadlets⁶ to manage the container lifecycles as systemd services.

Step 1: Handing Security & Secrets

I started with the security groundwork. First, I set up a new OAuth application on GitHub to handle authentication.

My GitHub OAuth Setup

I went to GitHub settings > Developer settings > OAuth Apps > New OAuth App⁷.
Homepage URL: https://ci.homelab.example
Authorization callback URL: https://ci.homelab.example/authorize
I made sure to store the Client ID and Client Secret securely for later.

Generating the Agent Secret

The server and agent need a shared secret to talk to each other. I generated a secure random string using openssl:

openssl rand -hex 32

Step 2: Bridging with Cloudflare Tunnel

I chose to use a tunnel because it's much simpler than managing port forwarding on my router.

My Initial Authentication

I had to run this once on my Fedora IoT server to link it to my Cloudflare account:

cloudflared tunnel login

Automation with OpenTofu

I wrote an OpenTofu resource to automate the installation of cloudflared, create the tunnel, and set up the systemd service. Here is the configuration I used:

resource "null_resource" "setup_cloudflare_tunnel" {
  connection {
    type        = "ssh"
    user        = "admin"
    host        = "192.168.1.100"
    private_key = file("~/.ssh/id_ed25519")
  }

  provisioner "file" {
    content     = local.tunnel_script_content
    destination = "/tmp/setup_cloudflare_tunnel.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "chmod +x /tmp/setup_cloudflare_tunnel.sh",
      "/tmp/setup_cloudflare_tunnel.sh",
    ]
  }
}

locals {
  tunnel_script_content = <<-EOF
#!/bin/bash
set -euo pipefail

# Fedora IoT uses rpm-ostree, install cloudflared from binary if missing
if ! command -v cloudflared &> /dev/null; then
    ARCH=$(uname -m)
    case $ARCH in
        x86_64) DOWNLOAD_URL="https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64" ;;
        aarch64) DOWNLOAD_URL="https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64" ;;
        *) echo "❌ Unsupported architecture: $ARCH"; exit 1 ;;
    esac
    curl -L "$DOWNLOAD_URL" -o /tmp/cloudflared
    sudo install -m 755 /tmp/cloudflared /usr/local/bin/cloudflared
fi

# Create tunnel and route DNS (requires manual 'cloudflared tunnel login' first)
TUNNEL_NAME="woodpecker"
cloudflared tunnel create "$TUNNEL_NAME" || true
TUNNEL_ID=$(cloudflared tunnel list | grep "$TUNNEL_NAME" | awk '{print $1}')
cloudflared tunnel route dns "$TUNNEL_NAME" ci.homelab.example 2>/dev/null || true

# Setup system user and config
sudo useradd --system --home /var/lib/cloudflared --shell /usr/sbin/nologin cloudflared 2>/dev/null || true
sudo mkdir -p /etc/cloudflared /var/lib/cloudflared
sudo cp "$HOME/.cloudflared/$TUNNEL_ID.json" /etc/cloudflared/
sudo chown -R cloudflared:cloudflared /etc/cloudflared /var/lib/cloudflared

sudo tee /etc/cloudflared/config.yml > /dev/null <<CONFIG
tunnel: $TUNNEL_ID
credentials-file: /etc/cloudflared/$TUNNEL_ID.json
ingress:
  - hostname: ci.homelab.example
    service: http://localhost:8000
  - service: http_status:404
CONFIG

# Setup and start Systemd service
sudo tee /etc/systemd/system/cloudflared.service > /dev/null <<SERVICE
[Unit]
Description=Cloudflare Tunnel
After=network.target

[Service]
Type=simple
User=cloudflared
Group=cloudflared
ExecStart=/usr/local/bin/cloudflared tunnel --config /etc/cloudflared/config.yml run
Restart=on-failure

[Install]
WantedBy=multi-user.target
SERVICE

sudo systemctl daemon-reload
sudo systemctl enable --now cloudflared
EOF
}

My shell script (which I managed in a local block) handles the heavy lifting of installing the binary and configuring the cloudflared system user.

Step 3: Orchestrating the Server

For the Woodpecker Server, I used a Podman Quadlet⁶ container. This allowed me to manage the container-native service directly through systemd.

My Quadlet Configuration (`woodpecker-server.container`)

[Container]
ContainerName=woodpecker-server
Image=docker.io/woodpeckerci/woodpecker-server:v3
PublishPort=8000:8000
PublishPort=9000:9000
Volume=/var/lib/woodpecker/server:/var/lib/woodpecker:Z

Environment=WOODPECKER_ADMIN=admin
Environment=WOODPECKER_HOST=https://ci.homelab.example
Environment=WOODPECKER_GITHUB=true
Environment=WOODPECKER_GITHUB_CLIENT=Iv1.a629723b814c123e
Environment=WOODPECKER_GITHUB_SECRET=ghs_1234567890abcdef1234567890abcdef12345678
Environment=WOODPECKER_AGENT_SECRET=a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456

[Service]
Restart=always

[Install]
WantedBy=multi-user.target

My Automation Flow

I used OpenTofu to push the container file and refresh the systemd daemon. This made it easy to iterate on my configuration.

resource "null_resource" "woodpecker_server" {
  # ... connection details ...
  provisioner "file" {
    content     = local.woodpecker_server_config
    destination = "/etc/containers/systemd/woodpecker-server.container"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo mkdir -p /var/lib/woodpecker/server",
      "sudo chown $USER:$USER -R /var/lib/woodpecker",
      "sudo systemctl daemon-reload",
      "sudo systemctl enable --now woodpecker-server.service",
    ]
  }
}

Step 4: Taming the Agent & Networking

This was the trickiest part of my journey. I had to deal with SELinux and a annoying DNS conflict.

My DNS Port 53 Conflict

Since I run PiHole on the same machine, it was listening on 0.0.0.0:53. This completely blocked Podman from starting its own internal DNS service for my build networks.

The symptoms I saw:

My agent containers refused to start.
I found "Address already in use (os error 98)" in the logs.

How I fixed it:

I reconfigured PiHole: I forced it to bind only to my LAN and localhost IPs.
I disabled Podman DNS: I made a global change to Podman's config so it wouldn't try to claim port 53.

My Podman Fix (`/etc/containers/containers.conf`)

[network]
dns_enabled = false

My Agent Configuration

I also had to disable SELinux labels for the agent so it could talk to the Podman socket without being blocked.

[Container]
ContainerName=woodpecker-agent
Image=docker.io/woodpeckerci/woodpecker-agent:v3
User=root
SecurityLabelDisable=true
Volume=/run/podman/podman.sock:/var/run/docker.sock
Volume=/etc/containers:/etc/containers:ro
Environment=WOODPECKER_SERVER=192.168.1.100:9000
Environment=WOODPECKER_AGENT_SECRET=a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456
Environment=WOODPECKER_BACKEND=docker

How I Verified Everything

Once the services were up, I ran a quick status check and set up my first pipeline.

1. Verification

I checked that my three core services were healthy:

sudo systemctl status cloudflared woodpecker-server woodpecker-agent

2. My First Pipeline

I logged into my new dashboard at https://ci.homelab.example and added this .woodpecker.yaml to one of my repos:

steps:
  - name: test
    image: alpine:latest
    commands:
      - echo "Hello from Woodpecker CI!"

Watching the first green checkmark appear was a great feeling.

Final Thoughts & Troubleshooting

The biggest hurdle for me was definitely the DNS conflict. If you find your builds can't resolve hostnames, check if Podman is fighting another service for port 53.

I'm really happy with how this turned out. It's a clean, efficient CI/CD setup that runs perfectly on my Fedora IoT hardware.

From 4 Minutes to 3 Seconds: How Database Transaction Rollback Revolutionized Test Suite

Michael Nikitochkin — Sat, 24 Jan 2026 16:15:32 +0000

Executive Summary

In a single afternoon, I transformed my Crystal/Marten test suite from a 4-minute ordeal into a 3-second sprint by replacing expensive database TRUNCATE with lightning-fast transaction rollback. This 98.8% performance improvement didn't just make developers happier—it fundamentally changed how I approach testing.

The Bottom Line: 447 tests now run in 2.84 seconds instead of 245.87 seconds—a 86.5x speedup that makes test-driven development (TDD) practical again.

The Problem: When Tests Become a Bottleneck

The Performance Crisis

The test suite was destroying developer productivity. Every pull request meant waiting, every bug fix meant coffee breaks. With truncation-based test isolation, the team was hemorrhaging time:

Individual test example: UserTest#test_email_validation - 0.527 seconds

The Root Cause: Truncation Hell

The culprit was my test isolation strategy: database truncation. Before each test, I was:

Dropping and recreating data across 20+ tables
Performing expensive I/O operations that scaled with data size

Each test was paying a ~500ms tax just to clean up after itself. With 447 tests, that's over 3 minutes of pure overhead.

The Solution: Transaction Rollback Strategy

Instead of physically deleting data and resetting sequences, I could simply wrap each test in a database transaction and always roll back. PostgreSQL transactions are designed for exactly this—atomic operations that can be discarded instantly.

How It Works

# The core insight: Wrap ONLY the test execution
def run_one(name : String, proc : Test ->) : Nil
  # 1. Setup runs OUTSIDE transaction
  before_setup
  setup
  after_setup

  # 2. Test runs INSIDE transaction
  Marten::DB::Connection.default.transaction do
    proc.call(self)  # Run the actual test
    raise Marten::DB::Errors::Rollback.new("Test cleanup")
  end

  # 3. Teardown runs AFTER rollback
  before_teardown
  teardown
  after_teardown
end

Why This Is Magical:

Setup/teardown run once - Database schema loading, migrations, etc.
Only test data is transactional - All changes disappear instantly
No I/O overhead - rollback is just a memory operation
Clean isolation - Each test gets a fresh slate automatically

Performance Transformation

Dramatic Individual Test Speedup

Example: User::ValidateTest#test_email_validation

Metric	Before	After	Improvement
Test Duration	0.527s	0.002s	263x faster

Test Suite Revolution

Metric	Before (Truncation)	After (Transaction)	Improvement
Total Duration	00:04:05.872s (245.87s)	00:00:02.841s (2.84s)	86.5x faster
Runs per Second	0.00407 runs/s	0.352 runs/s	86.5x improvement
Test Count	447 tests	447 tests	Same coverage

What This Means for Developers

Before (Truncation Hell):

"I'll run tests while grabbing coffee" - waiting kills productivity
1.8 tests per minute - glacial feedback
4+ minute feedback loop - context switching inevitable
TDD feels painful - testing becomes optional

After (Transaction Magic):

"I'll run tests before every commit" - instant gratification
9,450 tests per minute - blazing speed
3-second feedback loop - you're still thinking about the code
TDD feels effortless - testing becomes second nature

Technical Deep Dive

The Problem: Expensive Database Surgery

Each test paid ~40-60ms of truncation overhead:

TRUNCATE operations: 20-30ms (disk I/O)
Sequence resets: 10-15ms (catalog updates)
Connection overhead: 5-10ms

With 447 tests, that's over 3 minutes of pure cleanup time.

The Solution: Instant Rollback

Transaction rollback costs ~2-4ms per test (50-75% reduction):

Transaction start: 1-2ms (memory allocation)
Rollback: 1-2ms (memory discard)
No disk I/O: Pure memory operation

![IMAGE PLACEHOLDER: Prompt: Performance cost breakdown comparison chart. Left bar showing truncation costs: "TRUNCATE ops (20-30ms)" in red, "Sequence resets (10-15ms)" in orange, "Connection overhead (5-10ms)" in yellow, totaling 40-60ms per test. Right bar showing rollback costs: "Transaction start (1-2ms)" in light blue, "Rollback (1-2ms)" in dark blue, totaling 2-4ms per test. Include annotations: "50-75% reduction" and "Pure memory operation" for rollback. Style: modern bar chart with gradient colors, clear cost labels, benchmark comparison.]

Implementation Blueprint

Override Minitest's run_one method to wrap test execution in database transactions:

Setup OUTSIDE transaction - Database schema and reference data loads once per test suite
Test INSIDE transaction - Each test runs atomically and always rolls back
Teardown AFTER rollback - Clear caches (Marten::Cache, converted models, email collectors, WebMock)

Why Override run_one Instead of Using Standard Hooks?

The core issue is how Marten's DB.transaction works¹:

Marten uses yield to execute code inside a transaction block
It sets thread-local variables to track the active transaction connection
Individual database operations check these thread variables to reuse the transaction connection
This pattern requires wrapping the entire test execution from the outside

Standard Minitest lifecycle hooks have a fatal limitation², as there is no around_run or wrapper mechanism.

By overriding run_one, we control the order of operations and can properly use yield:

1. Setup (OUTSIDE transaction)     ← Reference data persists
2. Begin transaction with yield
3. Test code (INSIDE transaction)  ← Thread variables track connection
4. Rollback transaction
5. Teardown (OUTSIDE transaction)  ← Clear caches, email collectors

References

Source Code References

Crystal Minitest and the Shutdown Order Problem

Michael Nikitochkin — Sat, 24 Jan 2026 11:26:17 +0000

TL;DR: Optimizing Minitest setup by moving initialization from before_setup to module-level broke logging: Log::AsyncDispatcher creates resources early, and they're cleaned up before tests run (which happens in at_exit). Solution: use DirectDispatcher for tests instead.

Story: The Optimization and Discovery

When setting up a Marten¹ application to work with Minitest², I moved all initialization instructions to before_setup hooks:

class Minitest::Test
  def before_setup
    Marten.setup if Marten.apps.app_configs.empty?    # Runs before EVERY test
    Marten::Spec.setup_databases
  end
end

After tests were working reliably, I optimized: why run this repetitive setup for every test? If I moved it to module-level initialization, it would run once when test file loads:

require "minitest/autorun"
require "../src/project"

Marten.setup                      # Runs once at module load
Marten::Spec.setup_databases

Tests passed initially. But when I ran with DEBUG=1 to enable verbose output, something broke:

DEBUG=1 crystal run test/users_test.cr

Channel::ClosedError: Channel is closed
    /usr/share/crystal/src/channel.cr:142:8 in 'send'
    /usr/share/crystal/src/log/dispatch.cr:55:7 in 'dispatch'

The logging channel was closed. I reverted to before_setup as a workaround, but I needed to understand: Why does when setup runs matter more than that it runs?

Investigation: The Key Discovery

Why Minitest Is Different

Minitest doesn't run tests during normal program execution. Instead:

# minitest/src/autorun.cr:8-11
at_exit do
  exit(Minitest.run(ARGV))  # Tests run during shutdown!
end

This creates a timing problem. With module-level setup:

Marten.setup initialize a Log³ instance with Log::AsyncDispatcher⁴ on booting
Log::AsyncDispatcher spawns a background fiber with a channel
Main program completes → Crystal runtime cleanup begins
Channel may be closed by garbage collection (this is where my assumption lies)
at_exit fires → Minitest runs tests
Tests call Log.info → channel already closed → Channel::ClosedError

With before_setup, setup happens inside at_exit, so the Log instance is created during shutdown, not during normal execution.

How `Log::AsyncDispatcher` Works

# Simplified from: crystal/src/log/dispatch.cr
class Log::AsyncDispatcher
  @channel = Channel(Entry).new

  def initialize
    @fiber = spawn do
      loop do
        entry = @channel.receive
        write_entry(entry)
      end
    end
  end

  def dispatch(entry)
    @channel.send(entry)  # ← Assumes channel is open!
  end

  def close : Nil
    # TODO: this might fail if being closed from different threads
    unless @channel.closed?
      @channel.close
      @done.receive
    end
  end

  def finalize : Nil
    close  # ← Channel gets closed here during GC/shutdown
  end
end

During shutdown, garbage collection calls finalize() -> close (), which closes the channel. But at_exit hooks run after some cleanup, creating the race condition.

Solution: The Fix

The issue isn't optimization itself—it's that Log::AsyncDispatcher isn't suitable for code that runs before at_exit. The solution is to use a dispatcher without background resources for tests:

# config/settings/test.cr
Marten.configure :test do |config|
  config.log_backend = Log::IOBackend.new(
    dispatcher: Log::DispatchMode::Direct  # No background fiber
  )

  if ENV.has_key?("DEBUG")
    config.debug = true
    config.log_level = Log::Severity::Trace
  end
end

Dispatcher Options

Dispatcher	Best For	Pros	Cons
`AsyncDispatcher`	Production	Non-blocking, efficient	Unreliable at shutdown
`SyncDispatcher`	Threaded tests	Thread-safe, reliable	Slight mutex overhead
`DirectDispatcher`	Single-threaded tests	Zero overhead, simple	Not thread-safe

For tests with module-level init, Log::DirectDispatcher is ideal.

Conclusion

The optimization itself was sound—module-level initialization is faster. The issue was incompatibility with Log::AsyncDispatcher in a shutdown context.

More broadly, any code that relies on objects with finalize methods could be affected by garbage collection events and at_exit timing—background resources created during normal execution may be cleaned up before tests run.

The fix is simple (one configuration change) but reveals a broader principle:

Test infrastructure must account for shutdown order and resource cleanup timing, regardless of framework. Some libraries use at_exit handlers for cleanup in multi-threaded applications, and when tests run after these handlers, any finalized objects (channels, connections, files, caches) become inaccessible.

References

Optimizing Crystal Build time in Woodpecker CI: 415s to 196s with Caching

Michael Nikitochkin — Wed, 21 Jan 2026 06:10:11 +0000

The Problem

Crystal test builds in Woodpecker CI were taking 415 seconds to complete.
Every pipeline run would recompile dependencies from scratch, even though most changes affected application code, not third-party libraries.

Crystal's compiler is thorough and safe, but recompiling everything on each CI run was costly - especially when running tests multiple times per day.

The Solution

Persistent Crystal cache storage was implemented using named volumes combined with a custom cache directory:

test:
  image: crystallang/crystal:${CRYSTAL_VERSION}
  environment:
    CRYSTAL_CACHE_DIR: /cache/crystal
  volumes:
    # Persistent cache for compiled Crystal modules
    - crystal-cache-${CRYSTAL_VERSION}:/cache/crystal
  commands:
    - rake test:build
    - rake test:run

Note: Testing also included adding tmpfs: ["/tmp:size=2g"] but it provided no measurable improvement. The persistent cache is where the real optimization happens.

How It Works

1. Custom Cache Directory

By setting CRYSTAL_CACHE_DIR: /cache/crystal, the Crystal compiler stores compiled artifacts in a predictable location instead of the default temporary directory.

environment:
  CRYSTAL_CACHE_DIR: /cache/crystal

This provides control over where Crystal stores:

Compiled standard library modules
Compiled dependency (shard) modules
Precompiled object files
LLVM intermediate representations

2. Named Volumes for Persistence

Woodpecker CI uses container volumes to persist data between pipeline runs. A named volume is mounted at the cache directory:

volumes:
  - crystal-cache-${CRYSTAL_VERSION}:/cache/crystal

Key insight: Using ${CRYSTAL_VERSION} in the volume name maintains separate caches for different Crystal versions (e.g., crystal-cache-1.16.3 and crystal-cache-nightly). This prevents cache conflicts when testing against multiple Crystal versions in matrix builds.

3. What About tmpfs for /tmp?

Initial assumptions suggested that adding tmpfs for /tmp would help, since Crystal might write temporary files there during compilation. Testing was performed:

tmpfs:
  - /tmp:size=2g

Result: No measurable improvement (~196s with or without tmpfs).

Why? Crystal's compilation model doesn't write much to temporary directories:

Most I/O goes to CRYSTAL_CACHE_DIR (which is already optimized with persistent volumes)
Crystal keeps most compilation state in memory
The compiler creates minimal intermediate temp files
Container overlay filesystem is "good enough" for the small amount of /tmp usage

4. Container Overlay Storage Optimization

Woodpecker agents using container overlay storage drivers (such as Podman's overlay2) benefit from efficient layer caching.
The named volume persists between runs, and only changed files need to be written - combining with the container runtime's copy-on-write mechanism for optimal performance.

Important note: Named volumes are local to each agent. If multiple Woodpecker agents are running, each maintains its own separate cache. This means:

First run on a new agent will be slow (cold cache)
Subsequent runs on the same agent will be fast (warm cache)
Pipelines may experience variable build times depending on which agent executes them

For shared caching across multiple agents, consider external cache storage solutions (S3, network volumes, or distributed cache systems).

The Impact

Before (no optimizations): 415 seconds per test build

After (persistent cache only): 196 seconds per test build

After (cache + tmpfs for /tmp): ~196 seconds (no significant change)

Improvement: 2.1x faster (53% time reduction) ⚡

Why Persistent Cache Helps, But tmpfs Doesn't

Persistent cache (/cache/crystal) - BIG WIN:

✅ Avoids recompiling unchanged dependencies between builds
✅ Speeds up subsequent builds dramatically (415s → 196s)
✅ Survives across pipeline runs

tmpfs for /tmp - Minimal Impact:

⚠️ Crystal doesn't write much to /tmp during compilation
⚠️ Most I/O goes to CRYSTAL_CACHE_DIR (already on persistent volume)
⚠️ Modern container overlay filesystem is "good enough" for the small amount of temp files

Why This Makes Sense

Crystal's compiler architecture is smart:

Compiled artifacts go to cache directory - this is where the bulk of I/O happens
Temporary files are minimal - Crystal doesn't create many intermediate temp files
Most work is in-memory - the compiler keeps most data structures in RAM during compilation
Cache hits dominate - on warm cache, there's very little new compilation happening

The persistent cache eliminated the expensive recompilation (hundreds of seconds). The remaining time is mostly:

Linking compiled objects
Running tests (database operations)
Test framework overhead

Adding tmpfs for /tmp doesn't help because there's simply not much disk I/O happening there.

What Gets Cached?

On the first run, Crystal compiles everything:

Standard library (~100MB of compiled code)
Third-party shards (dependencies)
Application code

On subsequent runs, Crystal reuses:

✅ Unchanged standard library modules
✅ Unchanged dependency code
✅ Unchanged application code
❌ Only recompiles what changed

Matrix Builds: One Cache Per Version

CI testing runs against multiple Crystal versions:

matrix:
  CRYSTAL_VERSION:
    - 1.16.3
    - 1.19.1
    - nightly

Each version gets its own cache:

crystal-cache-1.16.3 - old version cache
crystal-cache-1.19.1 - stable version cache
crystal-cache-nightly - nightly version cache

This prevents cache corruption when compiler internals change between versions.

Implementation Details

Complete Woodpecker Configuration

steps:
  test:
    image: crystallang/crystal:${CRYSTAL_VERSION}
    environment:
      DATABASE_URL: postgres://postgres:dbpgpassword@postgres:5432/api_test
      CRYSTAL_CACHE_DIR: /cache/crystal
    volumes:
      - crystal-cache-${CRYSTAL_VERSION}:/cache/crystal
    commands:
      - crystal env
      - rake test

Repository Trust Requirement

Important: Using volumes: and tmpfs: in Woodpecker CI requires the repository to be marked as "Trusted".

Why trust is required:

volumes: - Allows mounting host volumes into containers, providing access to persistent storage
tmpfs: - Allows mounting in-memory filesystems, requiring elevated container privileges

Both features give containers more access to the host system. Woodpecker restricts these features to prevent potentially malicious pipeline configurations from compromising the CI infrastructure.

How to enable trust:

Navigate to repository settings in Woodpecker UI
Enable the "Trusted" checkbox
Only repository administrators can modify this setting

Security consideration: Only enable trust for repositories with controlled access and reviewed pipeline configurations. Trusted pipelines can potentially access sensitive data on the CI host system.

Key Takeaways

Named volumes survive pipeline runs - Woodpecker's volume mounting is key for persistence
Caches are per-agent - each Woodpecker agent maintains its own cache, not shared across agents
Repository must be marked as "Trusted" - required for using volumes: and tmpfs: features
2x speedup is achievable - especially for projects with many dependencies
Focus on what matters - persistent cache is the killer feature for Crystal CI

Potential Issues and Solutions

Problem: Cache grows too large

Solution: Periodically clean old caches or set retention policies in Woodpecker agent configuration.

Problem: Cache corruption after Crystal upgrade

Solution: The ${CRYSTAL_VERSION} suffix naturally creates new caches for new versions.

Problem: Multiple agents don't share caches

Solution: This is by design - named volumes are local to each agent. Each agent maintains its own cache, which means:

First build on each agent will take the full 415 seconds (cold cache)
Subsequent builds on the same agent will take ~196 seconds (warm cache)
Build times vary depending on which agent picks up the job

For consistent performance across all agents, consider:

Use agent labels to pin jobs to specific agents
Implement external cache storage (S3, NFS, network volumes)
Accept variable build times as a trade-off for distributed load

Problem: Volumes not mounting or permission errors

Solution: Verify the repository has "Trusted" status enabled in Woodpecker settings. Without trust, volumes: and tmpfs: directives are silently ignored or produce permission errors.

Comparison with Other CI Systems

CI System	Cache Strategy
GitHub Actions	`actions/cache` with path `/home/runner/.cache/crystal`
GitLab CI	`cache:` directive with `key: ${CI_COMMIT_REF_SLUG}`
Woodpecker	Named volumes with version-specific keys

Woodpecker's approach is simpler - no cache upload/download steps, just persistent volumes.

Speeding Up PostgreSQL in Containers

Michael Nikitochkin — Mon, 19 Jan 2026 23:45:50 +0000

The Problem

Running a test suite on an older CI machine with slow disks revealed PostgreSQL as a major bottleneck. Each test run was taking over 1 hour to complete. The culprit? Tests performing numerous database operations, with TRUNCATE commands cleaning up data after each test.

With slow disk I/O, PostgreSQL was spending most of its time syncing data to disk - operations that were completely unnecessary in a ephemeral CI environment where data persistence doesn't matter.

Catching PostgreSQL in the Act

Running top during test execution revealed the smoking gun:

242503 postgres  20   0  184592  49420  39944 R  81.7   0.3   0:15.66 postgres: postgres api_test 10.89.5.6(43216) TRUNCATE TABLE

PostgreSQL was consuming 81.7% CPU just to truncate a table! This single TRUNCATE operation ran for over 15 seconds. On a machine with slow disks, PostgreSQL was spending enormous amounts of time on fsync operations, waiting for the kernel to confirm data was written to physical storage - even though we were just emptying tables between tests.

The Solution

Three simple PostgreSQL configuration tweaks made a dramatic difference:

services:
  postgres:
    image: postgres:16.11-alpine
    environment:
      POSTGRES_INITDB_ARGS: "--nosync"
      POSTGRES_SHARED_BUFFERS: 256MB
    tmpfs:
      - /var/lib/postgresql/data:size=1g

1. `--nosync` Flag

The --nosync flag tells PostgreSQL to skip fsync() calls during database initialization. In a CI environment, we don't care about data durability - if the container crashes, we'll just start over. This eliminates expensive disk sync operations that were slowing down database setup.

2. Increased Shared Buffers

Setting POSTGRES_SHARED_BUFFERS: 256MB (up from the default ~128MB) gives PostgreSQL more memory to cache frequently accessed data. This is especially helpful when running tests that repeatedly access the same tables.

3. tmpfs for Data Directory (The Game Changer)

The biggest performance win came from mounting PostgreSQL's data directory on tmpfs - an in-memory filesystem.
This completely eliminates disk I/O for database operations:

tmpfs:
  - /var/lib/postgresql/data:size=1g

With tmpfs, all database operations happen in RAM. This is especially impactful for:

TRUNCATE operations - instant cleanup between tests
Index updates - no disk seeks required
WAL (Write-Ahead Log) writes - purely memory operations
Checkpoint operations - no waiting for disk flushes

The 1GB size limit is generous for most test databases. Adjust based on your test data volume.

The Impact

Before: ~60 minutes per test run

After: ~10 minutes per test run

Improvement: 6x faster! 🚀

Real Test Performance Examples

You should have seen my surprise when I first saw a single test taking 30 seconds in containers.
I knew something was terribly wrong. But when I applied the in-memory optimization and
saw the numbers drop to what you'd expect on a normal machine - I literally got tears in my eyes.

Before tmpfs optimization:

API::FilamentSupplierAssortmentsTest#test_create_validation_negative_price = 25.536s
API::FilamentSupplierAssortmentsTest#test_list_with_a_single_assortment = 29.996s
API::FilamentSupplierAssortmentsTest#test_list_missing_token = 25.952s

Each test was taking 25-30 seconds even though the actual test logic was minimal!
Most of this time was spent waiting for PostgreSQL to sync data to disk.

After tmpfs optimization:

API::FilamentSupplierAssortmentsTest#test_list_as_uber_curator = 0.474s
API::FilamentSupplierAssortmentsTest#test_list_as_assistant = 0.466s
API::FilamentSupplierAssortmentsTest#test_for_pressman_without_filament_supplier = 0.420s

These same tests now complete in 0.4-0.5 seconds - a 50-60x improvement per test! 🎉

Where the Time Was Going

The biggest gains came from reducing disk I/O during:

TRUNCATE operations between tests - PostgreSQL was syncing empty table states to disk
Database initialization at the start of each CI run
INSERT operations during test setup - creating test fixtures (users, roles, ...)
Transaction commits - each test runs in a transaction that gets rolled back
Frequent small writes during test execution

With slow disks, even simple operations like creating a test user or truncating a table would take seconds instead of milliseconds. The top output above shows a single TRUNCATE TABLE operation taking 15+ seconds and consuming 81.7% CPU - most of that was PostgreSQL waiting for disk I/O. Multiply that across hundreds of tests, and you get hour-long CI runs.

The Math

24 tests in this file alone
Before: ~27 seconds average per test = ~648 seconds (10.8 minutes) for one test file
After: ~0.45 seconds average per test = ~11 seconds for the same file
Per-file speedup: 59x faster!

With dozens of test files, the cumulative time savings are massive.

Why This Works for CI

In production, you absolutely want fsync() enabled and conservative settings to ensure data durability. But in CI:

Data is ephemeral - containers are destroyed after each run
Speed matters more than durability - faster feedback loops improve developer productivity
Disk I/O is often the bottleneck - especially on older/slower CI machines

By telling PostgreSQL "don't worry about crashes, we don't need this data forever," we eliminated unnecessary overhead.

Key Takeaways

Profile your CI pipeline - we discovered disk I/O was the bottleneck, not CPU or memory
CI databases don't need production settings - optimize for speed, not durability
tmpfs is the ultimate disk I/O eliminator - everything in RAM means zero disk bottleneck
Small configuration changes can have big impacts - three settings saved us 50 minutes per run
Consider your hardware - these optimizations were especially important on older machines with slow disks
Watch your memory usage - tmpfs consumes RAM; ensure your CI runners have enough (1GB+ for the database)

Implementation in Woodpecker CI

Here's our complete PostgreSQL service configuration:

services:
  postgres:
    image: postgres:16.11-alpine
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: dbpgpassword
      POSTGRES_DB: api_test
      POSTGRES_INITDB_ARGS: "--nosync"
      POSTGRES_SHARED_BUFFERS: 256MB
    ports:
      - 5432
    tmpfs:
      - /var/lib/postgresql/data:size=1g

Note: The tmpfs field is officially supported in Woodpecker CI's backend (defined in pipeline/backend/types/step.go). If you see schema validation warnings, they may be from outdated documentation - the feature works perfectly.

Lucky us! Not all CI platforms support tmpfs configuration this easily. Woodpecker CI makes it trivial with native Docker support - just add a tmpfs: field and you're done. If you're on GitHub Actions, GitLab CI, or other platforms, you might need workarounds like docker run with --tmpfs flags or custom runner configurations.

Simple, effective, and no code changes required - just smarter configuration for the CI environment.

Why Not Just Tune PostgreSQL Settings Instead of tmpfs?

TL;DR: I tried. tmpfs is still faster AND simpler.

After seeing the dramatic improvements with tmpfs, I wondered: "Could we achieve similar performance by aggressively tuning PostgreSQL settings instead?" This would be useful for environments where tmpfs isn't available or RAM is limited.

Tested Aggressive Disk-Based Tuning

Experimenting with disabling all durability features:

services:
  postgres:
    command:
      - postgres
      - -c
      - fsync=off # Skip forced disk syncs
      - -c
      - synchronous_commit=off # Async WAL writes
      - -c
      - wal_level=minimal # Minimal WAL overhead
      - -c
      - full_page_writes=off # Less WAL volume
      - -c
      - autovacuum=off # No background vacuum
      - -c
      - max_wal_size=1GB # Fewer checkpoints
      - -c
      - shared_buffers=256MB # More memory cache

The Results: tmpfs Still Wins

Even with all these aggressive settings, tmpfs was still faster.

Disk-based (even with fsync=off):

❌ File system overhead - ext4/xfs metadata operations
❌ Disk seeks - mechanical latency on HDDs, limited IOPS on SSDs
❌ Kernel buffer cache - memory copies between user/kernel space
❌ Docker overlay2 - additional storage driver overhead
❌ Complexity - 7+ settings to manage and understand

tmpfs-based:

✅ Pure RAM operations - no physical storage involved
✅ Zero disk I/O - everything happens in memory
✅ Simple configuration - just one tmpfs line
✅ Maximum performance - nothing faster than RAM

Bonus: Other PostgreSQL CI Optimizations to Consider

If you're still looking for more speed improvements:

Disable query logging - reduces I/O overhead:

  command:
    - postgres
    - -c
    - log_statement=none           # Don't log any statements
    - -c
    - log_min_duration_statement=-1  # Don't log slow queries

Use fsync=off in postgresql.conf - similar to --nosync but for runtime (redundant with tmpfs)
Increase work_mem - helps with complex queries in tests

Debugging a Double Free in Crystal with libxml2, GDB, and Valgrind

Michael Nikitochkin — Sun, 07 Dec 2025 19:01:29 +0000

This is a personal note about how I tracked down and fixed a double-free bug caused by Crystal’s garbage collector interacting with libxml2. I used gdb and valgrind to trace the issue, understand where memory was allocated and freed, and eventually identify the root cause. I am not an advanced user of these tools, so this write-up serves as a reminder of the steps I took and what I learned along the way.

The Problem

For a few days, some of my tests started crashing intermittently, against the night builds of the Crystal:

$ bin/drar_test --seed 6690 --verbose --parallel 1

free(): double free detected in tcache 2

The crashes were non-deterministic: they didn’t always occur locally, and sometimes didn’t even appear in CI.
The error message itself wasn’t very helpful at first, and I wasn’t sure where the issue was coming from.

Step 1: Reproducing the Issue

I started by reproducing the failure locally:

I used the same app configuration as in CI.
I tested different --seed arguments until I found a seed that reliably triggered the crash.

Step 2: Initial Investigation with GDB

Since the error originated from free(), I wanted to see what was happening at the crash:

$ crystal --version
Crystal 1.17.0 [d2c705b53] (2025-07-16)

$ crystal build --stats --threads 1 --time -o bin/drar_test ./test/ext/std/openssl/cipher_test.cr ...

$ gdb --args bin/drar_test --seed 6690 --verbose --parallel 1
> run
Run options: --seed 6690 --verbose --parallel 1...

free(): double free detected in tcache 2

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44       return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;

At the crash, I used bt to print the backtrace:

> bt
...
#6  0x00007ffff7258ad8 in tcache_double_free_verify (e=<optimized out>) at malloc.c:3350
#7  0x00007ffff7e7413b in xmlFreeNodeList (cur=0x1da4db00) at /usr/src/debug/libxml2-2.12.10-5.fc43.x86_64/tree.c:3662
#8  0x00007ffff7e73e68 in xmlFreeDoc (cur=0x1da4b710) at /usr/src/debug/libxml2-2.12.10-5.fc43.x86_64/tree.c:1212
#9  0x0000000003e480a0 in finalize () at /home/miry/src/crystal/crystal/src/xml/document.cr:67
#10 0x0000000000482b86 in -> () at /home/miry/src/crystal/crystal/src/gc/boehm.cr:340
#11 0x00007ffff73cc517 in GC_invoke_finalizers () at extra/../finalize.c:1255
#12 0x00007ffff73cc801 in GC_notify_or_invoke_finalizers () at extra/../finalize.c:1342
#13 GC_notify_or_invoke_finalizers () at extra/../finalize.c:1282
#14 0x00007ffff73d8e77 in GC_generic_malloc_many (lb=<optimized out>, k=1, result=0x7ffff750b130 <first_thread+496>)
    at extra/../mallocx.c:336
#15 0x00007ffff73e67b5 in GC_malloc_kind (bytes=<optimized out>, kind=<optimized out>) at extra/../thread_local_alloc.c:187

The backtrace revealed that Crystal’s GC was finalizing an XML::Document and calling xmlFreeDoc.
This was my first clue that the crash involved XML nodes being freed twice.

Step 3: Using Valgrind

I then ran the same binary under Valgrind:

$ valgrind --track-origins=yes --leak-check=full bin/drar_test 2> valgrind.logs

In the logs, I found:


==232953== Invalid free() / delete / delete[] / realloc()
==232953==    at 0x1E2C6E43: free (vg_replace_malloc.c:990)
==232953==    by 0x1E33B39B: xmlFreePropList (tree.c:2052)
==232953==    by 0x1E33B39B: xmlFreePropList (tree.c:2047)
==232953==    by 0x1E33B0A7: xmlFreeNodeList (tree.c:3638)
==232953==    by 0x1E33AE67: xmlFreeDoc (tree.c:1212)
==232953==    by 0x3EAB88F: *XML::Document#finalize:Nil (document.cr:67)
...
==232953==  Address 0x1f4b6780 is 0 bytes inside a block of size 96 free'd
==232953==    at 0x1E2C6E43: free (vg_replace_malloc.c:990)
==232953==    by 0x1E33B39B: xmlFreePropList (tree.c:2052)
==232953==    by 0x1E33B39B: xmlFreePropList (tree.c:2047)
==232953==    by 0x1E33B0A7: xmlFreeNodeList (tree.c:3638)
==232953==    by 0x1E33AE67: xmlFreeDoc (tree.c:1212)
==232953==    by 0x3EAB88F: *XML::Document#finalize:Nil (document.cr:67)
...
==232953==  Block was alloc'd at
==232953==    at 0x1E2C3B26: malloc (vg_replace_malloc.c:447)
==232953==    by 0x1E3374C5: xmlSAX2AttributeNs (SAX2.c:1880)
==232953==    by 0x1E3393E8: xmlSAX2StartElementNs (SAX2.c:2299)
==232953==    by 0x1E3289F1: xmlParseStartTag2.constprop.0 (parser.c:10091)
==232953==    by 0x1E328EBB: xmlParseElementStart (parser.c:10473)
==232953==    by 0x1E32AF84: xmlParseElement (parser.c:10406)
==232953==    by 0x1E32B267: xmlParseDocument (parser.c:11190)
==232953==    by 0x1E332F38: xmlDoRead (parser.c:14835)
==232953==    by 0x3EAAE19: *XML::parse<String>:XML::Document (xml.cr:61)
==232953==    by 0xFF0FF3: *ActionText::RichText#to_html:String (rich_text.cr:41)
==232953==    by 0x41F7A90: *ActionText::RichTextTest#test_render_html_with_image_and_tags:Bool (rich_text_test.cr:70)
==232953==    by 0x4B1003: ~proc223Proc(Minitest::Test, Nil)@lib/minitest/src/runnable.cr:17 (runnable.cr:17)

Valgrind confirmed the double free: the same memory address was freed twice by the Garbage Collector.
It also showed the allocation site, pointing back to XML parsing, which confirmed the findings from GDB.
The extra information helped identify where the object was allocated.

Step 4: Root Cause

The root cause of the problem was my hacks with extra bindings indeed.
In my approach I exteneded the Crystal's XML with bindings from libxml2:

@[Link("xml2")]
lib LibXML
  fun xmlAddChild(parent : Node*, child : Node*)
end

class XML::Node
  def add_child(child : Node)
    LibXML.xmlAddChild(self, child)
  end
end

The actual issue appeared in the way I was using it:

html = XML.parse_html "<article>some text </article>", options: XML::HTMLParserOptions::RECOVER | XML::HTMLParserOptions::NOIMPLIED | XML::HTMLParserOptions::NODEFDTD
html.xpath_nodes("//action-text-attachment").each do |parent|
    ...
    image = XML.parse("<img src='#{blob.redirect_url}'>").xpath_node("//img").not_nil!
    parent.add_child(image)
end

The problem:

image = XML.parse("<img src='#{blob.redirect_url}'>").xpath_node("//img").not_nil!

The double-free happens because libxml2 Nodes belong to a single Document:

XML.parse creates a temporary XML::Document.
xpath_node returns a child node (image) that still belongs to this temporary document.
parent.add_child(image) inserts the node into another document (parent) without detaching or copying it.
When the temporary document is finalized by Crystal’s GC, it frees all its nodes, including image.
The parent document still references the same image node. Later, when the parent document is finalized, it tries to free image again → double free.

Valgrind and GDB confirmed this pattern: the same address (0x1f4b6780) was freed twice — first by the temporary document finalizer, second by the parent document finalizer.

Step 5: Fixing the Problem

The solution is to insert a copy of the node into the document, rather than the original. The copy is fully independent and can safely be added to another document.

@[Link("xml2")]
lib LibXML
  fun xmlAddChild(parent : Node*, child : Node*)
  fun xmlCopyNode(old : Node*, extended : Int) : Node*
end

class XML::Node
  def add_child(child : Node)
    copied_node = LibXML.xmlCopyNode(child, 1)
    LibXML.xmlAddChild(self, copied_node)
  end
end

Now, the temporary document used to create the node can be freed safely without causing crashes.
Any attributes, children, or other memory owned by the original document are copied correctly.

Step 6: Lessons Learned

Nodes belong to a single document in libxml2; sharing them across documents without copying or detaching is unsafe.
Valgrind and GDB are invaluable debugging tools:
- Valgrind detects invalid frees and memory issues.
- GDB lets you inspect the backtrace at the crash.
Valgrind can be misleading at first because it does not trigger a crash; instead, you need to read the logs carefully to identify double-free memory addresses. Once found, it shows the allocation and free sites, which greatly helps in investigating the root cause.

Instrumenting a Marten App with OpenTelemetry

Michael Nikitochkin — Tue, 13 May 2025 18:39:59 +0000

This article demonstrates how to instrument a Crystal application using OpenTelemetry with the Marten web framework¹.
It begins with a basic setup, covers visualizing traces in Jaeger, and introduces HTTP request tracing using middleware.
After it connects two services and propagating trace context between them to build a complete distributed trace.

1. Project Setup

Begin by creating a new Marten application named DrukArmy (inspired by https://drukarmy.org.ua/en):

marten new project drukarmy
cd drukarmy

Add the opentelemetry-sdk² shard to the shard.yml:

# shard.yml
dependencies:
  ...
  opentelemetry-sdk:
    github: wyhaines/opentelemetry-sdk.cr

Next, create an initializer in config/initializers/opentelemetry.cr to configure the OpenTelemetry SDK.
To verify that the setup is working, emit a test span:

# config/initializers/opentelemetry.cr
require "opentelemetry-sdk"

OpenTelemetry.configure do |config|
  config.service_name = "drukarmy"
  config.exporter = OpenTelemetry::Exporter.new(variant: :stdout)
end

OpenTelemetry.tracer.in_span("startup") do |root_span|
  root_span.consumer!
end

Run the server to confirm that spans are being emitted:

marten serve

You should see a span named startup printed to the terminal.

{
  "type": "trace",
  "traceId": "2bc89670000edfb4dab7470af935d3e9",
  "resource": {
    "service.name": "drukarmy",
    ...
  },
  "spans": [
    {
      "type": "span",
      "traceId": "2bc89670000edfb4dab7470af935d3e9",
      "spanId": "0edfb4dab7000001",
      "name": "startup",
      ...
    }
  ]
}

2. Viewing Traces

To make this more useful, let’s view spans in Jaeger³, a lightweight UI for working with trace data.
Run Jaeger in a container:

docker run --rm -p 16686:16686 -p 4318:4318 quay.io/jaegertracing/jaeger:2.6.0

Next, update the OpenTelemetry configuration to use the http exporter instead of stdout:

# config/initializers/opentelemetry.cr
require "opentelemetry-sdk"

OpenTelemetry.configure do |config|
    config.service_name = "drukarmy"
    config.exporter = OpenTelemetry::Exporter.new(variant: :http) # changed from :stdout
end

OpenTelemetry.tracer.in_span("startup") do |root_span|
  root_span.consumer!
end

This configuration sends spans to the default HTTP endpoint: http://localhost:4318/v1/traces.

Visit the Jaeger UI to explore the emitted traces.

[!TIP]
Set the DEBUG=1 environment variable to enable more verbose logging from the OpenTelemetry library.

3. Instrumenting HTTP Requests

3.1 Add a Middleware

To trace incoming HTTP requests, you can use a Marten middleware⁴,
which allows you to insert tracing logic around request handling.

For reference, other frameworks also offer OpenTelemetry instrumentation libraries,
which may serve as inspiration⁵.

Create a middleware for tracing. For simplicity, place it alongside other handlers:

# src/handlers/opentelemetry_middleware.cr
class OpenTelemetryMiddleware < Marten::Middleware
  def call(request : Marten::HTTP::Request, get_response : Proc(Marten::HTTP::Response)) : Marten::HTTP::Response
    ::OpenTelemetry.tracer.in_span("process_request") do |span|
      span.server!
      # Add standard HTTP attributes
      span["http.request.method"] = request.method
      span["url.path"] = request.path

      response = get_response.call
      span["http.response.status_code"] = response.status
      response
    end
  end
end

Marten's request and response APIs are described in the official documentation⁶.
The attributes above follow OpenTelemetry's semantic conventions for HTTP spans⁷.

Marten.configure do |config|
  ...
  config.middleware = [
    OpenTelemetryMiddleware,
    ...
  ]
  ...
end

After adding the middleware, check the Jaeger UI to confirm that HTTP request traces are being captured.

3.2 Create a Sample Handler

Next, define a basic handler to test HTTP request tracing:

# src/handlers/home_handler.cr
class HomeHandler < Marten::Handler
  def get
    ::OpenTelemetry.tracer.in_span("render_the_page") do |span|
      span.set_attribute("custom_logic", "true")
      respond %[{"message": "Hello!"}]
    end
  end
end

Update the route configuration to map the root path to the handler:

# config/routes.cr
Marten.routes.draw do
  path "/", HomeHandler, name: "home"
  ...

Now, when you visit http://localhost:8000/ (e.g curl localhost:8000), a span named render_the_page will appear in Jaeger.
At this point, you should be able to explore how a single application can generate traces and visualize them using Jaeger.

4. Distributed Tracing

One of the key benefits of OpenTelemetry is the ability to correlate telemetry data across multiple services involved in handling the same request.

To demonstrate this, we’ll set up a second application and observe how traces from both services can be linked together.

The following diagram outlines the interaction between two applications and the expected trace behavior:

Spans across services are connected using a shared TraceID.
Additionally, the parent SpanID helps define the relationship and order of spans within the trace:

request (SERVER, trace=t1, span=s1, service=drukarmy)
  |
  -- GET /backend - 200 (CLIENT, trace=t1, span=s2, parent=s1, service=drukarmy)
      |
      --- server (SERVER, trace=t1, span=s3, parent=s2, service=backend)

In the next steps, we’ll build and connect two services and propagate the tracing context between them.

4.1. Set Up a Second App

To simulate a multi-service environment, duplicate the existing application to serve as a second service:

cd ..
cp -a drukarmy backend
cd backend

Update the service name and port to allow both applications to run concurrently:

# <backend>/config/initializers/opentelemetry.cr
require "opentelemetry-sdk"

OpenTelemetry.configure do |config|
  config.service_name = "backend" # changed from "drukarmy"
  config.exporter = OpenTelemetry::Exporter.new(variant: :http)
end
...

# <backend>/config/settings/development.cr
Marten.configure :development do |config|
  config.debug = true
  config.host = "127.0.0.1"
  config.port = 8001 # changed from 8000
end

Create a dedicated handler for this service:

# <backend>/src/handlers/backend_handler.cr
class BackendHandler < Marten::Handler
  def get
    OpenTelemetry.in_span("backend_process") do |span|
      respond %[{"message": "Hello from Backend"}]
    end
  end
end

Update the routes accordingly:

# <backend>/config/routes.cr
Marten.routes.draw do
  path "/backend", BackendHandler, name: "backend"
  ...
end

Now run the second application:

marten serve

You should now have two services running:

drukarmy on port 8000
backend on port 8001

You can test the backend app and then validate the spans in the Jaeger UI by running:

$ curl localhost:8001/backend
{"message": "Hello from Backend"}

4.2. Chain of HTTP Calls

Update the HomeHandler in the original drukarmy application to make an outgoing HTTP request
to the backend service and propagate the tracing context:

# <drukarmy>/src/handlers/home_handler.cr
class HomeHandler < Marten::Handler
  def dispatch
    ::OpenTelemetry.tracer.in_span("render_the_page") do |span|
      respond client_request
    end
  end

  def client_request
    ::OpenTelemetry.tracer.in_span("client_request") do |span|
      span.client!
      url = "http://localhost:8001/backend"
      span["url.full"] = url

      headers = HTTP::Headers.new

      # Propagate the trace context via HTTP headers
      OpenTelemetry::Propagation::TraceContext.new(span.context).inject(headers)

      response = HTTP::Client.get url, headers: headers
      span["http.response.status_code"] = response.status_code
      if response.status_code != 200
        span.status.error!("Error: #{response.status_code}")
      end
      response.body
    end
  rescue ex : Socket::ConnectError
    %[{"error": Something goes wrong"}]
  end
end

This handler does two things:

It starts a span for the incoming request (render_the_page).
It performs an outgoing HTTP request to the backend service within a child span (client_request).

The key detail here is the use of OpenTelemetry::Propagation::TraceContext,
which injects the trace and span identifiers into the request headers.
This allows the backend service to associate its span with the same trace.

This mechanism is based on the "W3C Trace Context specification"⁸,
which defines how trace context should be propagated using standard HTTP headers like traceparent and tracestate.

4.3. Receive Trace Context in Backend

In the previous step, we propagated the trace context from the drukarmy app to the backend app.
However, the backend service still needs to extract and respect that context in order to properly link its span to the original trace.

To do this, update the OpenTelemetry middleware in both applications to extract the context using OpenTelemetry::Propagation::TraceContext:

# <drukarmy>/src/handlers/opentelemetry_middleware.cr
# <backend>/src/handlers/opentelemetry_middleware.cr

class OpenTelemetryMiddleware < Marten::Middleware
  def call(request : Marten::HTTP::Request, get_response : Proc(Marten::HTTP::Response)) : Marten::HTTP::Response
    trace = ::OpenTelemetry.trace
    traceparent_header = request.headers["traceparent"]?

    # Extract and assign trace_id from headers
    if traceparent_header
      traceparent = ::OpenTelemetry::Propagation::TraceContext::TraceParent.from_string(traceparent_header)
      trace.trace_id = traceparent.trace_id
      trace.span_context.trace_id = traceparent.trace_id
    end

    trace.in_span("process_request") do |span|
      span.server!

      # Reconstruct parent span and span context if traceparent header is present
      if traceparent_header
        parent_span = ::OpenTelemetry::Span.build("Phantom Parent") do |pspan|
          pspan.is_recording = false
          pspan.context = ::OpenTelemetry::Propagation::TraceContext.new(span.context).extract(request.headers).as(::OpenTelemetry::SpanContext)

          # Prevent duplicate propagation
          request.headers.delete("traceparent")
          request.headers.delete("tracestate")
        end
        span.parent = parent_span if parent_span
      end

      # Add standard HTTP attributes
      span["http.request.method"] = request.method
      span["url.path"] = request.path

      response = get_response.call
      span["http.response.status_code"] = response.status
      response
    end
  end
end

This middleware:

Extracts the traceparent header and sets the trace ID.
Reconstructs the parent span from the incoming context.
Starts a server span that now belongs to the same trace as the original request in drukarmy.

Now, when you run:

curl localhost:8000

You’ll see a fully connected distributed trace in Jaeger that spans both the drukarmy and backend services.

Example Propagation Headers

When the client span from drukarmy calls the backend, it sends:

traceparent: 00-05f1aec8000edfb2caa1c7444e97e4d0-0edfb2caa1000004-01
tracestate:

Here’s what it means:

trace-id: 05f1aec8000edfb2caa1c7444e97e4d0 → the shared trace for this request
parent-id: 0edfb2caa1000004 → the span from the drukarmy client
trace-flags: 01 → marks the trace as sampled

With this setup complete, Jaeger will show a coherent trace tree with spans from both services correctly linked.

What’s Next?

Now that you have basic and distributed tracing working with OpenTelemetry in a Marten application,
here are a few directions to explore next:

Automated Instrumentation Use opentelemetry-instrumentation.cr⁵ to automatically instrument HTTP server and client requests, reducing the need for manual span management.
Semantic Conventions Enhance the value of your traces by adopting "OpenTelemetry semantic conventions"⁷. These help standardize span attributes such as http.method, db.system, and messaging.operation.
Learn More About Tracing in Crystal For a more in-depth guide, check out my previous article: "How to begin with Traces in Crystal"⁹.

References

Marten Framework ↩
opentelemetry-sdk.cr ↩
Jaeger ↩
Marten Middlewares ↩
opentelemetry-instrumentation.cr: frameworks ↩
Marten: The request and response objects ↩
Semantic conventions for HTTP spans ↩
W3C: Trace Context ↩
How to begin with Traces in Crystal by Michael ↩

Speeding Up Crystal CI/CD: Fast Drafts, Optimized Builds

Michael Nikitochkin — Fri, 02 May 2025 20:11:37 +0000

I have started working on a production web application built with Crystal and Marten.
With every new feature I add to the project, the compilation time keeps growing—almost exponentially.

I found that waiting 50 minutes to build an image isn't worth it for quick experiments. I realized I don't need full performance for development builds.

Here's my view on how I can address the problem:

I'd like to introduce a "draft" image that builds in around 3 minutes and is ready for deployment—it even starts deploying to the production clusters.

After that, it would trigger a "pristine" build with all optimizations enabled, which might take 60 minutes.

If a new build is triggered in the meantime, the pristine build is cancelled and replaced by the most recent one job.

With this approach, I can still build and test quickly, while eventually delivering a highly optimized version for better performance.

Why Infrastructure Engineers Should Start with Backend Development

Michael Nikitochkin — Thu, 10 Apr 2025 06:00:51 +0000

Infrastructure engineering has evolved far beyond managing servers and spinning up cloud resources. Today, it’s about crafting resilient platforms, improving developer experience, and obsessing over user needs. That’s why I believe every infrastructure or production engineer should spend at least five years building backend applications before moving into infra roles.

Here’s why.

1. Code Quality and Developer Empathy

Working on backend products teaches you the fundamentals of writing clean, maintainable code. You develop a natural sensitivity to things like variable naming, code structure, and debugging workflows. More importantly, you learn how to think like the developers who will be your users in an infra role. This empathy helps you build tools that others actually want to use—not just ones that “work.”

2. UX Isn't Just for Designers

When you’ve been on the receiving end of poorly documented, overly complex internal tooling, you start to appreciate good UX—yes, even in infra. Backend experience wires your brain to care about latency, clarity, and consistency, not just uptime and throughput. It trains you to ask: Will this make someone’s life easier?

3. Avoiding Infra for Infra’s Sake

Without application experience, it’s easy to fall into the trap of building systems that are technically impressive but practically unusable. You end up with setups only infra teams understand—and nobody else wants to touch. A strong backend foundation keeps you grounded, reminding you that the goal isn’t to build fancy pipelines or run bleeding-edge stacks. The goal is to support real teams solving real problems.

Final Thoughts

Yes, infrastructure is fun. There’s joy in automation, orchestration, and performance tuning. But without first getting your hands dirty with backend development, you risk building solutions in a vacuum. Start with the app layer, feel the pain, and then go fix it with empathy and purpose.

That’s what makes a great infra engineer.

On-Call Requirements

Michael Nikitochkin — Mon, 31 Mar 2025 22:19:59 +0000

Summary

This document outlines on-call requirements for global companies. Since employees are spread across various countries, each with its own labor laws, it's essential to align expectations before joining an on-call rotation.

Being on-call comes with responsibilities and limitations. It affects your social life, sleep schedule, and availability. You serve as a crucial safety net for the organization.

Hardware

Phone

The company should provide a dedicated on-call phone. It doesn’t need to be high-end but must be secure and support the following apps:

PagerDuty or Opsgenie – for incident notifications
Mail – for alerts and updates
Slack, Discord, or Google Meet – for team communication
Browser – for troubleshooting
1Password (or similar) – for credential management
Two-Factor Authentication (2FA) apps – FreeOTP, Yubico Authenticator, Authy, etc.

Mobile Contract

The mobile plan should support incoming calls from PagerDuty and allow incident acknowledgment via mobile data. A 10GB monthly data plan is typically sufficient. In case of an incident, the developer should be able to connect their laptop and triage the issue from wherever they are. Outgoing calls are only needed for escalation when other methods fail.

Balanced On-Call

To avoid burnout, no more than 25% of an engineer's time should be spent on-call. Following this rule:

A single-site team requires at least eight engineers for a sustainable rotation.
A dual-site team should have at least six engineers per site.
Each shift should include both a primary and secondary on-call engineer.

Quality Balance

Engineers need time for incident response and follow-ups, including writing postmortems. An incident is defined as a sequence of events related to the same contribution factor and should be treated as a single issue.

On-Call Policies & Practices

A well-structured on-call system requires clear policies to ensure smooth operations. Engineers should not have to figure things out when an alert goes off. Instead, proactive planning should include:

Incident severity definitions
Playbooks for common issues
Clear escalation rules

Aligning these elements in advance helps create an effective and manageable on-call process.

Recording My Crystal Snippets from Today’s Learning

Michael Nikitochkin — Sat, 15 Mar 2025 11:41:04 +0000

I want to document some snippets from today’s learning while working on open-source projects.

0. Printing Available Methods for an Object of a Class

I found it useful to debug an object’s methods in a way similar to Ruby. Here’s a snippet that helped me with this:

# Print the available methods of a Crystal class
# Usage: `puts Spec::CLI.methods.sort`
class Object
  macro methods
    {{ @type.methods.map &.name.stringify }}
  end
end

puts Spec::CLI.methods.sort # => ["abort!", "add_formatter", ...]

More macro examples can be found in the Crystal Macro Methods documentation.

1. Filtering Crystal Spec Tests Based on Tags

I worked on executing different types of tests, including unit and integration tests.

There are multiple approaches to separating them, and many ideas can be found in this Crystal Forum discussion.

Here’s the approach I took:

Project Folder Structure

My project’s test structure looks like this:

$ tree spec   
spec
├── awscr-s3
├── awscr-s3_spec.cr
├── fixtures.cr
├── integration
│   ├── compose.yml
│   └── minio_spec.cr
└── spec_helper.cr

The integration tests are marked with the tag "integration".

Filtering Tests Based on Tags

One of the things I love about Crystal is that the code is simple and intuitive to read.

While learning about the Spec module in the Crystal Spec Documentation, I found links to the source code, which helped me understand how filtering works.

Here’s how I implemented tag-based filtering:

# spec/spec_helper.cr
class Spec::CLI
  def tags
    @tags
  end
end

Spec.around_each do |example|
  tags = Spec.cli.tags
  # By default, skip tagged tests and run only unit tests
  next if (tags.nil? || tags.empty?) && !example.example.all_tags.empty?
  example.run
end

Explanation

Spec.cli is a command-line interface that parses options and stores them internally in the @tags variable.

For example, when running:

$ crystal spec --tag 'integration'

The "integration" tag is stored as a Set in @tags. This allows me to check which filters were enabled without manually parsing the command-line arguments.

However, there’s a small drawback: @tags is not publicly accessible. To work around this, I extended the Spec::CLI class and exposed it. (There may be a better way to do this.)

The second part of the code is a simple filtering mechanism implemented using Spec.around_each:

It checks the provided tags and then validates the test’s tags.
If no tags are specified, all tagged tests are skipped by default.

A simple debug statement like pp! example can help explore more filtering options.

2. Configuring Test Dependencies Based on Tags

Integration tests allow sending real requests.

Instead of adding tags to every integration test individually, we can leverage the folder structure (e.g., placing them in an integration folder).

Here’s one way to configure WebMock dynamically based on test tags or file location:

Spec.around_each do |example|
  integration = example.example.all_tags.includes?("integration") || example.example.file.includes?("spec/integration")
  WebMock.reset

  WebMock.allow_net_connect = integration
  example.run
end

That’s all for today! 🚀

Involving the Right People in an Incident

Michael Nikitochkin — Sun, 16 Feb 2025 23:27:04 +0000

It's been a while since I last wrote about incidents. Lately, I’ve been more focused on backend development in a Ruby and Crystal projects, but after handling a few recent incidents, I wanted to jot down my thoughts.

The Problem: Over-Involving Teams During an Incident

It’s common for an Incident Commander to be paged when something isn’t working. As the Incident Commander, you might see reports from customers. Your responsibility is to identify contributing factors and bring the right people together to stop the bleeding.

However, the concept of bringing the "correct" people is sometimes misunderstood. Some Incident Commanders assume this means inviting everyone who might be remotely involved. They create massive video or audio calls, hoping someone will figure out the problem. While this might seem like a thorough approach, it often leads to frustration among teams who are pulled into the incident but have nothing to contribute. They end up waiting passively, leading to wasted time and effort.

This broad approach may give Incident Commanders a false sense of control—believing that if all teams are present, they’ve done everything possible. But in reality, each team may assume the issue lies elsewhere, leading to passive listening rather than active problem-solving.

The Consequences of Over-Involvement

Bringing too many people into an incident can have several negative effects:

High-cost meetings with low productivity: More people in the call means more noise, more conflicting theories, and a harder time reaching a consensus.
Blame-shifting and distraction: Each team might focus on their own long-standing issues rather than identifying the real root cause.
Loss of the bigger picture: With too many perspectives, the core problem can become obscured, making it harder to pinpoint the actual failure.

In complex systems, problems can be hidden under layers of dependencies, making a broad approach ineffective. It’s crucial to separate valid long-term concerns from immediate incident causes.

How to Solve an Incident Without Over-Involving Teams

So, how can an Incident Commander solve an unknown issue efficiently without involving too many people, while still resolving the problem as quickly as possible?

Stop the bleeding using all available tools. Start by narrowing down the issue to its closest impact point—typically where users are directly affected. Investigate progressively deeper into microservices and vendor solutions.
Analyze patterns by building a timeline and reproducing the problem as closely as possible to the reported issue. This is often the hardest step, especially if the issue is intermittent or device-specific.
Leverage observability tools across mobile, backend services, and database profiling. These tools should be a core part of every playbook.
Identify the success lines in monitoring reports to determine possible mitigation steps.
Engage teams incrementally — bring in only the necessary teams one at a time, verifying details with each and syncing on next steps before continuing the investigation. Even if details are shared in an incident channel, it's more effective to request targeted help in short bursts.
Consult experts when needed — if someone has experience with a similar issue, involve them, but avoid defaulting to large group calls.
Track multiple leads separately in different threads, summarizing findings regularly.
Mitigation over resolution: Depending on the incident’s criticality, full resolution might not be immediate. Collaborate closely with 1-2 relevant teams to assess mitigation strategies before broadening involvement.
Maintain focused escalation: Always escalate and page when necessary. Most people are willing to help, but ensure they have a clear role rather than keeping them in a call unnecessarily.

Conclusion: All vs. Correct

Should you bring everyone into an incident call? Or should you focus on identifying the correct people? While including all teams might seem like a faster way to solve the problem, understanding the issue through observability tools and selectively involving the right teams is a more effective approach. This minimizes stress and improves resolution time.

Does this mean you should hesitate to escalate? Absolutely not — always escalate when necessary. People are generally willing to help, but ensure they have a clear role rather than keeping them in a call unnecessarily.

By shifting from an “all-in” approach to a targeted, observability-driven strategy, Incident Commanders can handle incidents more efficiently, reduce noise, and ensure faster recovery.

Of course, this isn’t something that can be perfected during an active incident. Understanding company structure, service dependencies, mitigation practices, and observability tools requires preparation. One of the best ways to improve is by reviewing past incidents and occasionally practicing simulated ones using exercises like Wheel of Misfortune.

And now, I trust you to make the right call!

Check out these resources to learn more:

Shrinking the time to mitigate production incidents—CRE life lessons

DEV Community: Michael Nikitochkin

How I Deployed Woodpecker CI on Fedora IoT

Table of Contents

Why I Chose Woodpecker CI

My Infrastructure Layout

My Core Decisions

The Road to Deployment

Step 1: Handing Security & Secrets

My GitHub OAuth Setup

Generating the Agent Secret

Step 2: Bridging with Cloudflare Tunnel

My Initial Authentication

Automation with OpenTofu

Step 3: Orchestrating the Server

My Quadlet Configuration (woodpecker-server.container)

My Automation Flow

Step 4: Taming the Agent & Networking

My DNS Port 53 Conflict

My Podman Fix (/etc/containers/containers.conf)

My Agent Configuration

How I Verified Everything

1. Verification

2. My First Pipeline

Final Thoughts & Troubleshooting

From 4 Minutes to 3 Seconds: How Database Transaction Rollback Revolutionized Test Suite

Executive Summary

The Problem: When Tests Become a Bottleneck

The Performance Crisis

The Root Cause: Truncation Hell

The Solution: Transaction Rollback Strategy

How It Works

Performance Transformation

Dramatic Individual Test Speedup

Test Suite Revolution

What This Means for Developers

Technical Deep Dive

The Problem: Expensive Database Surgery

The Solution: Instant Rollback

Implementation Blueprint

References

Related Articles in My CI/CD Optimization Journey

Source Code References

Crystal Minitest and the Shutdown Order Problem

Story: The Optimization and Discovery

Investigation: The Key Discovery

Why Minitest Is Different

How Log::AsyncDispatcher Works

Solution: The Fix

Dispatcher Options

Conclusion

References

Optimizing Crystal Build time in Woodpecker CI: 415s to 196s with Caching

The Problem

The Solution

How It Works

1. Custom Cache Directory

2. Named Volumes for Persistence

3. What About tmpfs for /tmp?

4. Container Overlay Storage Optimization

The Impact

Why Persistent Cache Helps, But tmpfs Doesn't

Why This Makes Sense

What Gets Cached?

Matrix Builds: One Cache Per Version

Implementation Details

Complete Woodpecker Configuration

Repository Trust Requirement

Key Takeaways

Potential Issues and Solutions

Problem: Cache grows too large

Problem: Cache corruption after Crystal upgrade

Problem: Multiple agents don't share caches

Problem: Volumes not mounting or permission errors

Comparison with Other CI Systems

Speeding Up PostgreSQL in Containers

The Problem

Catching PostgreSQL in the Act

The Solution

1. --nosync Flag

2. Increased Shared Buffers

My Quadlet Configuration (`woodpecker-server.container`)

My Podman Fix (`/etc/containers/containers.conf`)

How `Log::AsyncDispatcher` Works

1. `--nosync` Flag