DEV Community: Michael Nikitochkin The latest articles on DEV Community by Michael Nikitochkin (@miry). https://dev.to/miry https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F99722%2F8c4f4baf-1623-40dd-91f9-523462892c24.jpeg DEV Community: Michael Nikitochkin https://dev.to/miry en How I Deployed Woodpecker CI on Fedora IoT Michael Nikitochkin Mon, 26 Jan 2026 20:26:44 +0000 https://dev.to/miry/how-i-deployed-woodpecker-ci-on-fedora-iot-4pbh https://dev.to/miry/how-i-deployed-woodpecker-ci-on-fedora-iot-4pbh <p>I wanted a self-hosted CI/CD system that was lightweight and container-native, without the overhead of enterprise solutions. This is the story of how I deployed <strong>Woodpecker CI</strong><sup id="fnref1">1</sup> on my <strong>Fedora IoT</strong><sup id="fnref2">2</sup> server. Along the way, I had to navigate DNS conflicts, SELinux hurdles, and the challenge of secure external access.</p> <p>In this setup, I used <strong>OpenTofu</strong><sup id="fnref3">3</sup> for infrastructure automation and <strong>Cloudflare Tunnel</strong><sup id="fnref4">4</sup> to bridge the gap between my local network and the web.</p> <h2> Table of Contents </h2> <ul> <li>Why I Chose Woodpecker CI</li> <li>My Infrastructure Layout</li> <li>The Road to Deployment</li> <li>Step 1: Handing Security &amp; Secrets</li> <li>Step 2: Bridging with Cloudflare Tunnel</li> <li>Step 3: Orchestrating the Server</li> <li>Step 4: Taming the Agent &amp; Networking</li> <li>How I Verified Everything</li> <li>Final Thoughts &amp; Troubleshooting</li> </ul> <h2> Why I Chose Woodpecker CI </h2> <p>I needed a platform that felt modern but stayed out of my way. <strong>Woodpecker CI</strong> fit the bill because:</p> <ul> <li> <strong>Isolation</strong>: I liked that every build runs in its own ephemeral container.</li> <li> <strong>Integration</strong>: It has seamless support for the GitHub OAuth flow I already use.</li> <li> <strong>Simplicity</strong>: I could define my pipelines in a familiar <code>.woodpecker.yaml</code> format.</li> </ul> <blockquote> <p><strong>Researching the bits:</strong> I spent some time with the <strong>Woodpecker</strong> architecture docs<sup id="fnref5">5</sup> to understand how the server and agent communicate over gRPC.</p> </blockquote> <h2> My Infrastructure Layout </h2> <p>I decided to use <strong>Cloudflare Tunnel</strong> so I wouldn't have to touch my firewall or handle SSL certificates manually. My <strong>Woodpecker Server</strong> acts as the brain, while the <strong>Agent</strong> does the heavy lifting via the <strong>Podman</strong> socket.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6tzpopngw9htfxzln2fi.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6tzpopngw9htfxzln2fi.png" alt="System Architecture Diagram" width="800" height="800"></a></p> <h3> My Core Decisions </h3> <ul> <li> <strong>DNS</strong>: I solved the port 53 conflict by using a custom bridge network for internal resolution.</li> <li> <strong>Access</strong>: I mapped <code>https://ci.homelab.example</code> directly to my local instance.</li> </ul> <h2> The Road to Deployment </h2> <p>My process involved four main phases. I used <strong>OpenTofu</strong> to make the deployment repeatable and <strong>Podman Quadlets</strong><sup id="fnref6">6</sup> to manage the container lifecycles as systemd services.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ylfejfc32u8m9vt444.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ylfejfc32u8m9vt444.png" alt="Deployment Flow Diagram" width="800" height="800"></a></p> <h2> Step 1: Handing Security &amp; Secrets </h2> <p>I started with the security groundwork. First, I set up a new OAuth application on GitHub to handle authentication.</p> <h3> My GitHub OAuth Setup </h3> <ol> <li>I went to <strong>GitHub settings</strong> &gt; <strong>Developer settings</strong> &gt; <strong>OAuth Apps</strong> &gt; <strong>New OAuth App</strong><sup id="fnref7">7</sup>.</li> <li> <strong>Homepage URL</strong>: <code>https://ci.homelab.example</code> </li> <li> <strong>Authorization callback URL</strong>: <code>https://ci.homelab.example/authorize</code> </li> <li>I made sure to store the <strong>Client ID</strong> and <strong>Client Secret</strong> securely for later.</li> </ol> <h3> Generating the Agent Secret </h3> <p>The server and agent need a shared secret to talk to each other. I generated a secure random string using openssl:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>openssl rand <span class="nt">-hex</span> 32 </code></pre> </div> <h2> Step 2: Bridging with Cloudflare Tunnel </h2> <p>I chose to use a tunnel because it's much simpler than managing port forwarding on my router. </p> <h3> My Initial Authentication </h3> <p>I had to run this once on my Fedora IoT server to link it to my Cloudflare account:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>cloudflared tunnel login </code></pre> </div> <h3> Automation with OpenTofu </h3> <p>I wrote an OpenTofu resource to automate the installation of <code>cloudflared</code>, create the tunnel, and set up the systemd service. Here is the configuration I used:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight hcl"><code><span class="nx">resource</span> <span class="s2">"null_resource"</span> <span class="s2">"setup_cloudflare_tunnel"</span> <span class="p">{</span> <span class="nx">connection</span> <span class="p">{</span> <span class="nx">type</span> <span class="p">=</span> <span class="s2">"ssh"</span> <span class="nx">user</span> <span class="p">=</span> <span class="s2">"admin"</span> <span class="nx">host</span> <span class="p">=</span> <span class="s2">"192.168.1.100"</span> <span class="nx">private_key</span> <span class="p">=</span> <span class="nx">file</span><span class="p">(</span><span class="s2">"~/.ssh/id_ed25519"</span><span class="p">)</span> <span class="p">}</span> <span class="nx">provisioner</span> <span class="s2">"file"</span> <span class="p">{</span> <span class="nx">content</span> <span class="p">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">tunnel_script_content</span> <span class="nx">destination</span> <span class="p">=</span> <span class="s2">"/tmp/setup_cloudflare_tunnel.sh"</span> <span class="p">}</span> <span class="nx">provisioner</span> <span class="s2">"remote-exec"</span> <span class="p">{</span> <span class="nx">inline</span> <span class="p">=</span> <span class="p">[</span> <span class="s2">"chmod +x /tmp/setup_cloudflare_tunnel.sh"</span><span class="p">,</span> <span class="s2">"/tmp/setup_cloudflare_tunnel.sh"</span><span class="p">,</span> <span class="p">]</span> <span class="p">}</span> <span class="p">}</span> <span class="nx">locals</span> <span class="p">{</span> <span class="nx">tunnel_script_content</span> <span class="p">=</span> <span class="o">&lt;&lt;-</span><span class="no">EOF</span><span class="sh"> #!/bin/bash set -euo pipefail # Fedora IoT uses rpm-ostree, install cloudflared from binary if missing if ! command -v cloudflared &amp;&gt; /dev/null; then ARCH=$(uname -m) case $ARCH in x86_64) DOWNLOAD_URL="https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64" ;; aarch64) DOWNLOAD_URL="https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64" ;; *) echo "❌ Unsupported architecture: $ARCH"; exit 1 ;; esac curl -L "$DOWNLOAD_URL" -o /tmp/cloudflared sudo install -m 755 /tmp/cloudflared /usr/local/bin/cloudflared fi # Create tunnel and route DNS (requires manual 'cloudflared tunnel login' first) TUNNEL_NAME="woodpecker" cloudflared tunnel create "$TUNNEL_NAME" || true TUNNEL_ID=$(cloudflared tunnel list | grep "$TUNNEL_NAME" | awk '{print $1}') cloudflared tunnel route dns "$TUNNEL_NAME" ci.homelab.example 2&gt;/dev/null || true # Setup system user and config sudo useradd --system --home /var/lib/cloudflared --shell /usr/sbin/nologin cloudflared 2&gt;/dev/null || true sudo mkdir -p /etc/cloudflared /var/lib/cloudflared sudo cp "$HOME/.cloudflared/$TUNNEL_ID.json" /etc/cloudflared/ sudo chown -R cloudflared:cloudflared /etc/cloudflared /var/lib/cloudflared sudo tee /etc/cloudflared/config.yml &gt; /dev/null &lt;&lt;CONFIG tunnel: $TUNNEL_ID credentials-file: /etc/cloudflared/$TUNNEL_ID.json ingress: - hostname: ci.homelab.example service: http://localhost:8000 - service: http_status:404 CONFIG # Setup and start Systemd service sudo tee /etc/systemd/system/cloudflared.service &gt; /dev/null &lt;&lt;SERVICE [Unit] Description=Cloudflare Tunnel After=network.target [Service] Type=simple User=cloudflared Group=cloudflared ExecStart=/usr/local/bin/cloudflared tunnel --config /etc/cloudflared/config.yml run Restart=on-failure [Install] WantedBy=multi-user.target SERVICE sudo systemctl daemon-reload sudo systemctl enable --now cloudflared </span><span class="no">EOF </span><span class="p">}</span> </code></pre> </div> <p>My shell script (which I managed in a <code>local</code> block) handles the heavy lifting of installing the binary and configuring the <code>cloudflared</code> system user.</p> <h2> Step 3: Orchestrating the Server </h2> <p>For the Woodpecker Server, I used a <strong>Podman Quadlet</strong><sup id="fnref6">6</sup> container. This allowed me to manage the container-native service directly through systemd.</p> <h3> My Quadlet Configuration (<code>woodpecker-server.container</code>) </h3> <div class="highlight js-code-highlight"> <pre class="highlight ini"><code><span class="nn">[Container]</span> <span class="py">ContainerName</span><span class="p">=</span><span class="s">woodpecker-server</span> <span class="py">Image</span><span class="p">=</span><span class="s">docker.io/woodpeckerci/woodpecker-server:v3</span> <span class="py">PublishPort</span><span class="p">=</span><span class="s">8000:8000</span> <span class="py">PublishPort</span><span class="p">=</span><span class="s">9000:9000</span> <span class="py">Volume</span><span class="p">=</span><span class="s">/var/lib/woodpecker/server:/var/lib/woodpecker:Z</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_ADMIN=admin</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_HOST=https://ci.homelab.example</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_GITHUB=true</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_GITHUB_CLIENT=Iv1.a629723b814c123e</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_GITHUB_SECRET=ghs_1234567890abcdef1234567890abcdef12345678</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_AGENT_SECRET=a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456</span> <span class="nn">[Service]</span> <span class="py">Restart</span><span class="p">=</span><span class="s">always</span> <span class="nn">[Install]</span> <span class="py">WantedBy</span><span class="p">=</span><span class="s">multi-user.target</span> </code></pre> </div> <h3> My Automation Flow </h3> <p>I used OpenTofu to push the container file and refresh the systemd daemon. This made it easy to iterate on my configuration.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight hcl"><code><span class="nx">resource</span> <span class="s2">"null_resource"</span> <span class="s2">"woodpecker_server"</span> <span class="p">{</span> <span class="c1"># ... connection details ...</span> <span class="nx">provisioner</span> <span class="s2">"file"</span> <span class="p">{</span> <span class="nx">content</span> <span class="p">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">woodpecker_server_config</span> <span class="nx">destination</span> <span class="p">=</span> <span class="s2">"/etc/containers/systemd/woodpecker-server.container"</span> <span class="p">}</span> <span class="nx">provisioner</span> <span class="s2">"remote-exec"</span> <span class="p">{</span> <span class="nx">inline</span> <span class="p">=</span> <span class="p">[</span> <span class="s2">"sudo mkdir -p /var/lib/woodpecker/server"</span><span class="p">,</span> <span class="s2">"sudo chown $USER:$USER -R /var/lib/woodpecker"</span><span class="p">,</span> <span class="s2">"sudo systemctl daemon-reload"</span><span class="p">,</span> <span class="s2">"sudo systemctl enable --now woodpecker-server.service"</span><span class="p">,</span> <span class="p">]</span> <span class="p">}</span> <span class="p">}</span> </code></pre> </div> <h2> Step 4: Taming the Agent &amp; Networking </h2> <p>This was the trickiest part of my journey. I had to deal with SELinux and a annoying DNS conflict.</p> <h3> My DNS Port 53 Conflict </h3> <p>Since I run <strong>PiHole</strong> on the same machine, it was listening on <code>0.0.0.0:53</code>. This completely blocked Podman from starting its own internal DNS service for my build networks.</p> <p><strong>The symptoms I saw:</strong></p> <ul> <li>My agent containers refused to start.</li> <li>I found <code>"Address already in use (os error 98)"</code> in the logs.</li> </ul> <p><strong>How I fixed it:</strong></p> <ol> <li> <strong>I reconfigured PiHole</strong>: I forced it to bind only to my LAN and localhost IPs.</li> <li> <strong>I disabled Podman DNS</strong>: I made a global change to Podman's config so it wouldn't try to claim port 53.</li> </ol> <h4> My Podman Fix (<code>/etc/containers/containers.conf</code>) </h4> <div class="highlight js-code-highlight"> <pre class="highlight ini"><code><span class="nn">[network]</span> <span class="py">dns_enabled</span> <span class="p">=</span> <span class="s">false</span> </code></pre> </div> <h3> My Agent Configuration </h3> <p>I also had to disable SELinux labels for the agent so it could talk to the Podman socket without being blocked.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight ini"><code><span class="nn">[Container]</span> <span class="py">ContainerName</span><span class="p">=</span><span class="s">woodpecker-agent</span> <span class="py">Image</span><span class="p">=</span><span class="s">docker.io/woodpeckerci/woodpecker-agent:v3</span> <span class="py">User</span><span class="p">=</span><span class="s">root</span> <span class="py">SecurityLabelDisable</span><span class="p">=</span><span class="s">true</span> <span class="py">Volume</span><span class="p">=</span><span class="s">/run/podman/podman.sock:/var/run/docker.sock</span> <span class="py">Volume</span><span class="p">=</span><span class="s">/etc/containers:/etc/containers:ro</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_SERVER=192.168.1.100:9000</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_AGENT_SECRET=a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456</span> <span class="py">Environment</span><span class="p">=</span><span class="s">WOODPECKER_BACKEND=docker</span> </code></pre> </div> <h2> How I Verified Everything </h2> <p>Once the services were up, I ran a quick status check and set up my first pipeline.</p> <h3> 1. Verification </h3> <p>I checked that my three core services were healthy:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nb">sudo </span>systemctl status cloudflared woodpecker-server woodpecker-agent </code></pre> </div> <h3> 2. My First Pipeline </h3> <p>I logged into my new dashboard at <code>https://ci.homelab.example</code> and added this <code>.woodpecker.yaml</code> to one of my repos:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">steps</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">test</span> <span class="na">image</span><span class="pi">:</span> <span class="s">alpine:latest</span> <span class="na">commands</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">echo "Hello from Woodpecker CI!"</span> </code></pre> </div> <p>Watching the first green checkmark appear was a great feeling.</p> <h2> Final Thoughts &amp; Troubleshooting </h2> <p>The biggest hurdle for me was definitely the DNS conflict. If you find your builds can't resolve hostnames, check if Podman is fighting another service for port 53.</p> <p>I'm really happy with how this turned out. It's a clean, efficient CI/CD setup that runs perfectly on my Fedora IoT hardware.</p> <ol> <li id="fn1"> <p><a href="proxy.php?url=https://woodpecker-ci.org/" rel="noopener noreferrer">Woodpecker CI</a> ↩</p> </li> <li id="fn2"> <p><a href="proxy.php?url=https://iot.fedoraproject.org/" rel="noopener noreferrer">Fedora IoT</a> ↩</p> </li> <li id="fn3"> <p><a href="proxy.php?url=https://opentofu.org/" rel="noopener noreferrer">OpenTofu</a> ↩</p> </li> <li id="fn4"> <p><a href="proxy.php?url=https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/" rel="noopener noreferrer">Cloudflare Tunnel</a> ↩</p> </li> <li id="fn5"> <p><a href="proxy.php?url=https://woodpecker-ci.org/docs/development/architecture" rel="noopener noreferrer">Woodpecker CI Documentation</a> ↩</p> </li> <li id="fn6"> <p><a href="proxy.php?url=https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html" rel="noopener noreferrer">Podman Quadlet Guide</a> ↩</p> </li> <li id="fn7"> <p><a href="proxy.php?url=https://github.com/settings/applications/new" rel="noopener noreferrer">GitHub New OAuth Application</a> ↩</p> </li> </ol> tofu fedora cloudflare cicd From 4 Minutes to 3 Seconds: How Database Transaction Rollback Revolutionized Test Suite Michael Nikitochkin Sat, 24 Jan 2026 16:15:32 +0000 https://dev.to/miry/from-4-minutes-to-3-seconds-how-database-transaction-rollback-revolutionized-test-suite-4olh https://dev.to/miry/from-4-minutes-to-3-seconds-how-database-transaction-rollback-revolutionized-test-suite-4olh <h2> Executive Summary </h2> <p>In a single afternoon, I transformed my <strong>Crystal/Marten</strong> test suite from a 4-minute ordeal into a 3-second sprint by replacing expensive database <code>TRUNCATE</code> with lightning-fast <strong>transaction rollback</strong>. This 98.8% performance improvement didn't just make developers happier—it fundamentally changed how I approach testing.</p> <p><strong>The Bottom Line:</strong> 447 tests now run in <strong>2.84 seconds</strong> instead of <strong>245.87 seconds</strong>—a <strong>86.5x speedup</strong> that makes <strong>test-driven development (TDD)</strong> practical again.</p> <h2> The Problem: When Tests Become a Bottleneck </h2> <h3> The Performance Crisis </h3> <p>The test suite was destroying developer productivity. Every pull request meant waiting, every bug fix meant coffee breaks. With truncation-based test isolation, the team was hemorrhaging time:</p> <p><strong>Individual test example:</strong> <code>UserTest#test_email_validation</code> - <strong>0.527 seconds</strong></p> <h3> The Root Cause: Truncation Hell </h3> <p>The culprit was my test isolation strategy: <strong>database truncation</strong>. Before each test, I was:</p> <ol> <li> <strong>Dropping and recreating data</strong> across 20+ tables</li> <li> <strong>Performing expensive I/O operations</strong> that scaled with data size</li> </ol> <p>Each test was paying a <strong>~500ms</strong> tax just to clean up after itself. With 447 tests, that's over 3 minutes of pure overhead.</p> <h2> The Solution: Transaction Rollback Strategy </h2> <p>Instead of physically deleting data and resetting sequences, I could simply <strong>wrap each test in a database transaction</strong> and <strong>always roll back</strong>. <strong>PostgreSQL</strong> transactions are designed for exactly this—atomic operations that can be discarded instantly.</p> <h3> How It Works </h3> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># The core insight: Wrap ONLY the test execution</span> <span class="k">def</span> <span class="nf">run_one</span><span class="p">(</span><span class="nb">name</span> <span class="p">:</span> <span class="no">String</span><span class="p">,</span> <span class="nb">proc</span> <span class="p">:</span> <span class="no">Test</span> <span class="o">-&gt;</span><span class="p">)</span> <span class="p">:</span> <span class="no">Nil</span> <span class="c1"># 1. Setup runs OUTSIDE transaction</span> <span class="n">before_setup</span> <span class="n">setup</span> <span class="n">after_setup</span> <span class="c1"># 2. Test runs INSIDE transaction</span> <span class="no">Marten</span><span class="o">::</span><span class="no">DB</span><span class="o">::</span><span class="no">Connection</span><span class="p">.</span><span class="nf">default</span><span class="p">.</span><span class="nf">transaction</span> <span class="k">do</span> <span class="nb">proc</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="nb">self</span><span class="p">)</span> <span class="c1"># Run the actual test</span> <span class="k">raise</span> <span class="no">Marten</span><span class="o">::</span><span class="no">DB</span><span class="o">::</span><span class="no">Errors</span><span class="o">::</span><span class="no">Rollback</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="s2">"Test cleanup"</span><span class="p">)</span> <span class="k">end</span> <span class="c1"># 3. Teardown runs AFTER rollback</span> <span class="n">before_teardown</span> <span class="n">teardown</span> <span class="n">after_teardown</span> <span class="k">end</span> </code></pre> </div> <p><strong>Why This Is Magical:</strong></p> <ol> <li> <strong>Setup/teardown run once</strong> - Database schema loading, migrations, etc.</li> <li> <strong>Only test data is transactional</strong> - All changes disappear instantly</li> <li> <strong>No I/O overhead</strong> - <code>rollback</code> is just a memory operation</li> <li> <strong>Clean isolation</strong> - Each test gets a fresh slate automatically</li> </ol> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b69eoniw4u3xvk519b0.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b69eoniw4u3xvk519b0.png" alt=" " width="800" height="533"></a></p> <h2> Performance Transformation </h2> <h3> Dramatic Individual Test Speedup </h3> <p><strong>Example: <code>User::ValidateTest#test_email_validation</code></strong></p> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>Metric</th> <th>Before</th> <th>After</th> <th>Improvement</th> </tr> </thead> <tbody> <tr> <td>Test Duration</td> <td>0.527s</td> <td>0.002s</td> <td><strong>263x faster</strong></td> </tr> </tbody> </table></div> <h3> Test Suite Revolution </h3> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>Metric</th> <th>Before (Truncation)</th> <th>After (Transaction)</th> <th>Improvement</th> </tr> </thead> <tbody> <tr> <td>Total Duration</td> <td>00:04:05.872s (245.87s)</td> <td>00:00:02.841s (2.84s)</td> <td><strong>86.5x faster</strong></td> </tr> <tr> <td>Runs per Second</td> <td>0.00407 runs/s</td> <td>0.352 runs/s</td> <td><strong>86.5x improvement</strong></td> </tr> <tr> <td>Test Count</td> <td>447 tests</td> <td>447 tests</td> <td>Same coverage</td> </tr> </tbody> </table></div> <h3> What This Means for Developers </h3> <p><strong>Before (Truncation Hell):</strong></p> <ul> <li>"I'll run tests while grabbing coffee" - waiting kills productivity</li> <li>1.8 tests per minute - glacial feedback</li> <li>4+ minute feedback loop - context switching inevitable</li> <li> <strong>TDD feels painful</strong> - testing becomes optional</li> </ul> <p><strong>After (Transaction Magic):</strong></p> <ul> <li>"I'll run tests before every commit" - instant gratification</li> <li>9,450 tests per minute - blazing speed</li> <li>3-second feedback loop - you're still thinking about the code</li> <li> <strong>TDD feels effortless</strong> - testing becomes second nature</li> </ul> <h2> Technical Deep Dive </h2> <h3> The Problem: Expensive Database Surgery </h3> <p>Each test paid ~40-60ms of truncation overhead:</p> <ul> <li> <strong>TRUNCATE operations:</strong> 20-30ms (disk I/O)</li> <li> <strong>Sequence resets:</strong> 10-15ms (catalog updates)</li> <li> <strong>Connection overhead:</strong> 5-10ms</li> </ul> <p>With 447 tests, that's over 3 minutes of pure cleanup time.</p> <h3> The Solution: Instant Rollback </h3> <p>Transaction rollback costs ~2-4ms per test (50-75% reduction):</p> <ul> <li> <strong>Transaction start:</strong> 1-2ms (memory allocation)</li> <li> <strong>Rollback:</strong> 1-2ms (memory discard)</li> <li> <strong>No disk I/O:</strong> Pure memory operation</li> </ul> <p>![IMAGE PLACEHOLDER: Prompt: Performance cost breakdown comparison chart. Left bar showing truncation costs: "TRUNCATE ops (20-30ms)" in red, "Sequence resets (10-15ms)" in orange, "Connection overhead (5-10ms)" in yellow, totaling 40-60ms per test. Right bar showing rollback costs: "Transaction start (1-2ms)" in light blue, "Rollback (1-2ms)" in dark blue, totaling 2-4ms per test. Include annotations: "50-75% reduction" and "Pure memory operation" for rollback. Style: modern bar chart with gradient colors, clear cost labels, benchmark comparison.]</p> <h2> Implementation Blueprint </h2> <p>Override <strong>Minitest</strong>'s <code>run_one</code> method to wrap test execution in database transactions:</p> <ol> <li> <strong>Setup OUTSIDE transaction</strong> - Database schema and reference data loads once per test suite</li> <li> <strong>Test INSIDE transaction</strong> - Each test runs atomically and always rolls back</li> <li> <strong>Teardown AFTER rollback</strong> - Clear caches (<code>Marten::Cache</code>, converted models, email collectors, <strong>WebMock</strong>)</li> </ol> <p><strong>Why Override <code>run_one</code> Instead of Using Standard Hooks?</strong></p> <p>The core issue is how <strong>Marten</strong>'s <code>DB.transaction</code> works<sup id="fnref1">1</sup>:</p> <ul> <li> <strong>Marten</strong> uses <code>yield</code> to execute code inside a transaction block</li> <li>It sets thread-local variables to track the active transaction connection</li> <li>Individual database operations check these thread variables to reuse the transaction connection</li> <li>This pattern requires wrapping the <em>entire test execution</em> from the outside</li> </ul> <p>Standard <strong>Minitest</strong> lifecycle hooks have a fatal limitation<sup id="fnref2">2</sup>, as there is no <code>around_run</code> or wrapper mechanism.</p> <p>By overriding <code>run_one</code>, we control the <em>order</em> of operations and can properly use <code>yield</code>:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>1. Setup (OUTSIDE transaction) ← Reference data persists 2. Begin transaction with yield 3. Test code (INSIDE transaction) ← Thread variables track connection 4. Rollback transaction 5. Teardown (OUTSIDE transaction) ← Clear caches, email collectors </code></pre> </div> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4djsu6cuxt86l0mw0r20.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4djsu6cuxt86l0mw0r20.png" alt=" " width="800" height="387"></a></p> <h2> References </h2> <h3> Related Articles in My CI/CD Optimization Journey </h3> <p>This article is part of my ongoing challenge to optimize tests and CI/CD pipelines for <strong>Crystal</strong> projects. If you're interested in the full optimization story, check out my previous articles:</p> <ul> <li> <a href="proxy.php?url=https://dev.to/miry/crystal-minitest-and-the-shutdown-order-problem-jcn">Crystal Minitest and the Shutdown Order Problem</a> - Understanding test execution lifecycle</li> <li> <a href="proxy.php?url=https://dev.to/miry/optimizing-crystal-build-time-in-woodpecker-ci-415s-to-196s-with-caching-1o5k">Optimizing Crystal Build Time in Woodpecker CI: 415s to 196s with Caching</a> - Build acceleration strategies</li> <li> <a href="proxy.php?url=https://dev.to/miry/speeding-up-postgresql-in-containers-1eeg">Speeding Up PostgreSQL in Containers</a> - Database performance in Containers</li> </ul> <h3> Source Code References </h3> <ol> <li id="fn1"> <p><a href="proxy.php?url=https://github.com/martenframework/marten/blob/02e37d55bbf680bafa6c7b065871a45df68ae2ee/src/marten/db/connection/base.cr#L137" rel="noopener noreferrer">Marten DB Connection - Transaction Implementation</a> ↩</p> </li> <li id="fn2"> <p><a href="proxy.php?url=https://github.com/ysbaddaden/minitest.cr/blob/6d41b570f52e1b424aa5053dae88e2a1014e5bd1/src/test.cr#L28" rel="noopener noreferrer">Minitest - <code>run_one</code> Implementation</a> ↩</p> </li> </ol> testing database performance crystal Crystal Minitest and the Shutdown Order Problem Michael Nikitochkin Sat, 24 Jan 2026 11:26:17 +0000 https://dev.to/miry/crystal-minitest-and-the-shutdown-order-problem-jcn https://dev.to/miry/crystal-minitest-and-the-shutdown-order-problem-jcn <blockquote> <p><strong>TL;DR</strong>: Optimizing <em>Minitest</em> setup by moving initialization from <code>before_setup</code> to module-level broke logging: <code>Log::AsyncDispatcher</code> creates resources early, and they're cleaned up before tests run (which happens in <code>at_exit</code>). Solution: use <code>DirectDispatcher</code> for tests instead.</p> </blockquote> <h2> Story: The Optimization and Discovery </h2> <p>When setting up a <em>Marten</em><sup id="fnref1">1</sup> application to work with <em>Minitest</em><sup id="fnref2">2</sup>, I moved all initialization instructions to <code>before_setup</code> hooks:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="k">class</span> <span class="nc">Minitest</span><span class="o">::</span><span class="no">Test</span> <span class="k">def</span> <span class="nf">before_setup</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">setup</span> <span class="k">if</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">apps</span><span class="p">.</span><span class="nf">app_configs</span><span class="p">.</span><span class="nf">empty?</span> <span class="c1"># Runs before EVERY test</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Spec</span><span class="p">.</span><span class="nf">setup_databases</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>After tests were working reliably, I optimized: why run this repetitive setup for every test? If I moved it to module-level initialization, it would run once when test file loads:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="nb">require</span> <span class="s2">"minitest/autorun"</span> <span class="nb">require</span> <span class="s2">"../src/project"</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">setup</span> <span class="c1"># Runs once at module load</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Spec</span><span class="p">.</span><span class="nf">setup_databases</span> </code></pre> </div> <p>Tests passed initially. But when I ran with <code>DEBUG=1</code> to enable verbose output, something broke:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">DEBUG</span><span class="o">=</span>1 crystal run <span class="nb">test</span>/users_test.cr </code></pre> </div> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>Channel::ClosedError: Channel is closed /usr/share/crystal/src/channel.cr:142:8 in 'send' /usr/share/crystal/src/log/dispatch.cr:55:7 in 'dispatch' </code></pre> </div> <p>The logging channel was closed. I reverted to <code>before_setup</code> as a workaround, but I needed to understand: Why does <em>when</em> setup runs matter more than <em>that</em> it runs?</p> <h2> Investigation: The Key Discovery </h2> <h3> Why Minitest Is Different </h3> <p><em>Minitest</em> doesn't run tests during normal program execution. Instead:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># minitest/src/autorun.cr:8-11</span> <span class="nb">at_exit</span> <span class="k">do</span> <span class="nb">exit</span><span class="p">(</span><span class="no">Minitest</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="no">ARGV</span><span class="p">))</span> <span class="c1"># Tests run during shutdown!</span> <span class="k">end</span> </code></pre> </div> <p>This creates a timing problem. With module-level setup:</p> <ol> <li> <code>Marten.setup</code> initialize a Log<sup id="fnref3">3</sup> instance with <code>Log::AsyncDispatcher</code><sup id="fnref4">4</sup> on booting</li> <li> <code>Log::AsyncDispatcher</code> spawns a background fiber with a channel</li> <li>Main program completes → <em>Crystal</em> runtime cleanup begins</li> <li>Channel may be closed by garbage collection (this is where my assumption lies)</li> <li> <code>at_exit</code> fires → <em>Minitest</em> runs tests</li> <li>Tests call <code>Log.info</code> → channel already closed → <code>Channel::ClosedError</code> </li> </ol> <p>With <code>before_setup</code>, setup happens inside <code>at_exit</code>, so the <code>Log</code> instance is created during shutdown, not during normal execution.</p> <h3> How <code>Log::AsyncDispatcher</code> Works </h3> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># Simplified from: crystal/src/log/dispatch.cr</span> <span class="k">class</span> <span class="nc">Log</span><span class="o">::</span><span class="no">AsyncDispatcher</span> <span class="vi">@channel</span> <span class="o">=</span> <span class="no">Channel</span><span class="p">(</span><span class="no">Entry</span><span class="p">).</span><span class="nf">new</span> <span class="k">def</span> <span class="nf">initialize</span> <span class="vi">@fiber</span> <span class="o">=</span> <span class="n">spawn</span> <span class="k">do</span> <span class="kp">loop</span> <span class="k">do</span> <span class="n">entry</span> <span class="o">=</span> <span class="vi">@channel</span><span class="p">.</span><span class="nf">receive</span> <span class="n">write_entry</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> <span class="k">def</span> <span class="nf">dispatch</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span> <span class="vi">@channel</span><span class="p">.</span><span class="nf">send</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span> <span class="c1"># ← Assumes channel is open!</span> <span class="k">end</span> <span class="k">def</span> <span class="nf">close</span> <span class="p">:</span> <span class="no">Nil</span> <span class="c1"># TODO: this might fail if being closed from different threads</span> <span class="k">unless</span> <span class="vi">@channel</span><span class="p">.</span><span class="nf">closed?</span> <span class="vi">@channel</span><span class="p">.</span><span class="nf">close</span> <span class="vi">@done</span><span class="p">.</span><span class="nf">receive</span> <span class="k">end</span> <span class="k">end</span> <span class="k">def</span> <span class="nf">finalize</span> <span class="p">:</span> <span class="no">Nil</span> <span class="n">close</span> <span class="c1"># ← Channel gets closed here during GC/shutdown</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>During shutdown, garbage collection calls <code>finalize() -&gt; close ()</code>, which closes the channel. But <code>at_exit</code> hooks run <em>after</em> some cleanup, creating the race condition.</p> <h2> Solution: The Fix </h2> <p>The issue isn't optimization itself—it's that <code>Log::AsyncDispatcher</code> isn't suitable for code that runs before <code>at_exit</code>. The solution is to use a dispatcher without background resources for tests:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># config/settings/test.cr</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">configure</span> <span class="ss">:test</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="n">config</span><span class="p">.</span><span class="nf">log_backend</span> <span class="o">=</span> <span class="no">Log</span><span class="o">::</span><span class="no">IOBackend</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span> <span class="ss">dispatcher: </span><span class="no">Log</span><span class="o">::</span><span class="no">DispatchMode</span><span class="o">::</span><span class="no">Direct</span> <span class="c1"># No background fiber</span> <span class="p">)</span> <span class="k">if</span> <span class="no">ENV</span><span class="p">.</span><span class="nf">has_key?</span><span class="p">(</span><span class="s2">"DEBUG"</span><span class="p">)</span> <span class="n">config</span><span class="p">.</span><span class="nf">debug</span> <span class="o">=</span> <span class="kp">true</span> <span class="n">config</span><span class="p">.</span><span class="nf">log_level</span> <span class="o">=</span> <span class="no">Log</span><span class="o">::</span><span class="no">Severity</span><span class="o">::</span><span class="no">Trace</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <h3> Dispatcher Options </h3> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>Dispatcher</th> <th>Best For</th> <th>Pros</th> <th>Cons</th> </tr> </thead> <tbody> <tr> <td><code>AsyncDispatcher</code></td> <td>Production</td> <td>Non-blocking, efficient</td> <td>Unreliable at shutdown</td> </tr> <tr> <td><code>SyncDispatcher</code></td> <td>Threaded tests</td> <td>Thread-safe, reliable</td> <td>Slight mutex overhead</td> </tr> <tr> <td><code>DirectDispatcher</code></td> <td>Single-threaded tests</td> <td>Zero overhead, simple</td> <td>Not thread-safe</td> </tr> </tbody> </table></div> <p>For tests with module-level init, <code>Log::DirectDispatcher</code> is ideal.</p> <h2> Conclusion </h2> <p>The optimization itself was sound—module-level initialization <em>is</em> faster. The issue was incompatibility with <code>Log::AsyncDispatcher</code> in a shutdown context.</p> <p>More broadly, any code that relies on objects with finalize methods could be affected by garbage collection events and <code>at_exit</code> timing—background resources created during normal execution may be cleaned up before tests run.</p> <p>The fix is simple (one configuration change) but reveals a broader principle:</p> <p><strong>Test infrastructure must account for shutdown order and resource cleanup timing, regardless of framework.</strong> Some libraries use <code>at_exit</code> handlers for cleanup in multi-threaded applications, and when tests run after these handlers, any finalized objects (channels, connections, files, caches) become inaccessible.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuha6n3nncsynyrw4sea4.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuha6n3nncsynyrw4sea4.png" alt=" " width="800" height="387"></a></p> <h2> References </h2> <ol> <li id="fn1"> <p><a href="proxy.php?url=https://martenframework.com/" rel="noopener noreferrer">Marten Framework</a> ↩</p> </li> <li id="fn2"> <p><a href="proxy.php?url=https://github.com/ysbaddaden/minitest.cr" rel="noopener noreferrer">Minitest Crystal Port</a> ↩</p> </li> <li id="fn3"> <p><a href="proxy.php?url=https://crystal-lang.org/api/Log.html" rel="noopener noreferrer"><em>Crystal</em> Log Documentation</a> ↩</p> </li> <li id="fn4"> <p><a href="proxy.php?url=https://crystal-lang.org/api/1.19.1/Log/AsyncDispatcher.html" rel="noopener noreferrer">class Log::AsyncDispatcher</a> ↩</p> </li> </ol> crystal testing minitest Optimizing Crystal Build time in Woodpecker CI: 415s to 196s with Caching Michael Nikitochkin Wed, 21 Jan 2026 06:10:11 +0000 https://dev.to/miry/optimizing-crystal-build-time-in-woodpecker-ci-415s-to-196s-with-caching-1o5k https://dev.to/miry/optimizing-crystal-build-time-in-woodpecker-ci-415s-to-196s-with-caching-1o5k <h2> The Problem </h2> <p><strong>Crystal</strong> test builds in <strong>Woodpecker CI</strong> were taking <strong>415 seconds</strong> to complete.<br> Every pipeline run would recompile dependencies from scratch, even though most changes affected application code, not third-party libraries.</p> <p><strong>Crystal</strong>'s compiler is thorough and safe, but recompiling everything on each <strong>CI</strong> run was costly - especially when running tests multiple times per day.</p> <h2> The Solution </h2> <p>Persistent <strong>Crystal</strong> cache storage was implemented using named volumes combined with a custom cache directory:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">test</span><span class="pi">:</span> <span class="na">image</span><span class="pi">:</span> <span class="s">crystallang/crystal:${CRYSTAL_VERSION}</span> <span class="na">environment</span><span class="pi">:</span> <span class="na">CRYSTAL_CACHE_DIR</span><span class="pi">:</span> <span class="s">/cache/crystal</span> <span class="na">volumes</span><span class="pi">:</span> <span class="c1"># Persistent cache for compiled Crystal modules</span> <span class="pi">-</span> <span class="s">crystal-cache-${CRYSTAL_VERSION}:/cache/crystal</span> <span class="na">commands</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">rake test:build</span> <span class="pi">-</span> <span class="s">rake test:run</span> </code></pre> </div> <p><strong>Note:</strong> Testing also included adding <code>tmpfs: ["/tmp:size=2g"]</code> but it provided no measurable improvement. The persistent cache is where the real optimization happens.</p> <h2> How It Works </h2> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mt1flixhw7blgg4rvks.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mt1flixhw7blgg4rvks.png" alt=" " width="800" height="533"></a></p> <h3> 1. Custom Cache Directory </h3> <p>By setting <code>CRYSTAL_CACHE_DIR: /cache/crystal</code>, the <strong>Crystal</strong> compiler stores compiled artifacts in a predictable location instead of the default temporary directory.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">environment</span><span class="pi">:</span> <span class="na">CRYSTAL_CACHE_DIR</span><span class="pi">:</span> <span class="s">/cache/crystal</span> </code></pre> </div> <p>This provides control over where <strong>Crystal</strong> stores:</p> <ul> <li>Compiled standard library modules</li> <li>Compiled dependency (shard) modules</li> <li>Precompiled object files</li> <li> <strong>LLVM</strong> intermediate representations</li> </ul> <h3> 2. Named Volumes for Persistence </h3> <p><strong>Woodpecker CI</strong> uses container volumes to persist data between pipeline runs. A named volume is mounted at the cache directory:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">volumes</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">crystal-cache-${CRYSTAL_VERSION}:/cache/crystal</span> </code></pre> </div> <p><strong>Key insight:</strong> Using <code>${CRYSTAL_VERSION}</code> in the volume name maintains separate caches for different <strong>Crystal</strong> versions (e.g., <code>crystal-cache-1.16.3</code> and <code>crystal-cache-nightly</code>). This prevents cache conflicts when testing against multiple <strong>Crystal</strong> versions in matrix builds.</p> <h3> 3. What About tmpfs for /tmp? </h3> <p>Initial assumptions suggested that adding <code>tmpfs</code> for <code>/tmp</code> would help, since <strong>Crystal</strong> might write temporary files there during compilation. Testing was performed:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">tmpfs</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/tmp:size=2g</span> </code></pre> </div> <p><strong>Result:</strong> No measurable improvement (~196s with or without <code>tmpfs</code>).</p> <p><strong>Why?</strong> <strong>Crystal</strong>'s compilation model doesn't write much to temporary directories:</p> <ul> <li>Most I/O goes to <code>CRYSTAL_CACHE_DIR</code> (which is already optimized with persistent volumes)</li> <li> <strong>Crystal</strong> keeps most compilation state in memory</li> <li>The compiler creates minimal intermediate temp files</li> <li>Container overlay filesystem is "good enough" for the small amount of <code>/tmp</code> usage</li> </ul> <h3> 4. Container Overlay Storage Optimization </h3> <p><strong>Woodpecker</strong> agents using container overlay storage drivers (such as <strong>Podman</strong>'s overlay2) benefit from efficient layer caching.<br> The named volume persists between runs, and only changed files need to be written - combining with the container runtime's copy-on-write mechanism for optimal performance.</p> <p><strong>Important note:</strong> Named volumes are local to each agent. If multiple <strong>Woodpecker</strong> agents are running, each maintains its own separate cache. This means:</p> <ul> <li>First run on a new agent will be slow (cold cache)</li> <li>Subsequent runs on the same agent will be fast (warm cache)</li> <li>Pipelines may experience variable build times depending on which agent executes them</li> </ul> <p>For shared caching across multiple agents, consider external cache storage solutions (<strong>S3</strong>, network volumes, or distributed cache systems).</p> <h2> The Impact </h2> <p><strong>Before (no optimizations):</strong> 415 seconds per test build<br><br> <strong>After (persistent cache only):</strong> 196 seconds per test build<br><br> <strong>After (cache + <code>tmpfs</code> for <code>/tmp</code>):</strong> ~196 seconds (no significant change)<br><br> <strong>Improvement:</strong> 2.1x faster (53% time reduction) ⚡</p> <h3> Why Persistent Cache Helps, But tmpfs Doesn't </h3> <p><strong>Persistent cache (<code>/cache/crystal</code>) - BIG WIN:</strong></p> <ul> <li>✅ Avoids recompiling unchanged dependencies between builds</li> <li>✅ Speeds up subsequent builds dramatically (415s → 196s)</li> <li>✅ Survives across pipeline runs</li> </ul> <p><strong><code>tmpfs</code> for <code>/tmp</code> - Minimal Impact:</strong></p> <ul> <li>⚠️ <strong>Crystal</strong> doesn't write much to <code>/tmp</code> during compilation</li> <li>⚠️ Most I/O goes to <code>CRYSTAL_CACHE_DIR</code> (already on persistent volume)</li> <li>⚠️ Modern container overlay filesystem is "good enough" for the small amount of temp files</li> </ul> <h3> Why This Makes Sense </h3> <p><strong>Crystal</strong>'s compiler architecture is smart:</p> <ol> <li> <strong>Compiled artifacts go to cache directory</strong> - this is where the bulk of I/O happens</li> <li> <strong>Temporary files are minimal</strong> - <strong>Crystal</strong> doesn't create many intermediate temp files</li> <li> <strong>Most work is in-memory</strong> - the compiler keeps most data structures in <strong>RAM</strong> during compilation</li> <li> <strong>Cache hits dominate</strong> - on warm cache, there's very little new compilation happening</li> </ol> <p>The persistent cache eliminated the <strong>expensive recompilation</strong> (hundreds of seconds). The remaining time is mostly:</p> <ul> <li>Linking compiled objects</li> <li>Running tests (database operations)</li> <li>Test framework overhead</li> </ul> <p>Adding <code>tmpfs</code> for <code>/tmp</code> doesn't help because there's simply not much disk I/O happening there.</p> <h3> What Gets Cached? </h3> <p>On the first run, <strong>Crystal</strong> compiles everything:</p> <ul> <li>Standard library (~100MB of compiled code)</li> <li>Third-party shards (dependencies)</li> <li>Application code</li> </ul> <p>On subsequent runs, <strong>Crystal</strong> reuses:</p> <ul> <li>✅ Unchanged standard library modules</li> <li>✅ Unchanged dependency code</li> <li>✅ Unchanged application code</li> <li>❌ Only recompiles what changed</li> </ul> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm6xxu5q4lgf7nzj4w80.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm6xxu5q4lgf7nzj4w80.png" alt=" " width="800" height="533"></a></p> <h2> Matrix Builds: One Cache Per Version </h2> <p><strong>CI</strong> testing runs against multiple <strong>Crystal</strong> versions:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">matrix</span><span class="pi">:</span> <span class="na">CRYSTAL_VERSION</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">1.16.3</span> <span class="pi">-</span> <span class="s">1.19.1</span> <span class="pi">-</span> <span class="s">nightly</span> </code></pre> </div> <p>Each version gets its own cache:</p> <ul> <li> <code>crystal-cache-1.16.3</code> - old version cache</li> <li> <code>crystal-cache-1.19.1</code> - stable version cache</li> <li> <code>crystal-cache-nightly</code> - nightly version cache</li> </ul> <p>This prevents cache corruption when compiler internals change between versions.</p> <h2> Implementation Details </h2> <h3> Complete Woodpecker Configuration </h3> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">steps</span><span class="pi">:</span> <span class="na">test</span><span class="pi">:</span> <span class="na">image</span><span class="pi">:</span> <span class="s">crystallang/crystal:${CRYSTAL_VERSION}</span> <span class="na">environment</span><span class="pi">:</span> <span class="na">DATABASE_URL</span><span class="pi">:</span> <span class="s">postgres://postgres:dbpgpassword@postgres:5432/api_test</span> <span class="na">CRYSTAL_CACHE_DIR</span><span class="pi">:</span> <span class="s">/cache/crystal</span> <span class="na">volumes</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">crystal-cache-${CRYSTAL_VERSION}:/cache/crystal</span> <span class="na">commands</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">crystal env</span> <span class="pi">-</span> <span class="s">rake test</span> </code></pre> </div> <h3> Repository Trust Requirement </h3> <p><strong>Important:</strong> Using <code>volumes:</code> and <code>tmpfs:</code> in <strong>Woodpecker CI</strong> requires the repository to be marked as <strong>"Trusted"</strong>.</p> <p><strong>Why trust is required:</strong></p> <ul> <li> <code>volumes:</code> - Allows mounting host volumes into containers, providing access to persistent storage</li> <li> <code>tmpfs:</code> - Allows mounting in-memory filesystems, requiring elevated container privileges</li> </ul> <p>Both features give containers more access to the host system. <strong>Woodpecker</strong> restricts these features to prevent potentially malicious pipeline configurations from compromising the CI infrastructure.</p> <p><strong>How to enable trust:</strong></p> <ol> <li>Navigate to repository settings in <strong>Woodpecker UI</strong> </li> <li>Enable the <strong>"Trusted"</strong> checkbox</li> <li>Only repository administrators can modify this setting</li> </ol> <p><strong>Security consideration:</strong> Only enable trust for repositories with controlled access and reviewed pipeline configurations. Trusted pipelines can potentially access sensitive data on the CI host system.</p> <h2> Key Takeaways </h2> <ol> <li> <strong>Named volumes survive pipeline runs</strong> - <strong>Woodpecker</strong>'s volume mounting is key for persistence</li> <li> <strong>Caches are per-agent</strong> - each <strong>Woodpecker</strong> agent maintains its own cache, not shared across agents</li> <li> <strong>Repository must be marked as "Trusted"</strong> - required for using <code>volumes:</code> and <code>tmpfs:</code> features</li> <li> <strong>2x speedup is achievable</strong> - especially for projects with many dependencies</li> <li> <strong>Focus on what matters</strong> - persistent cache is the killer feature for <strong>Crystal CI</strong> </li> </ol> <h2> Potential Issues and Solutions </h2> <h3> Problem: Cache grows too large </h3> <p><strong>Solution:</strong> Periodically clean old caches or set retention policies in <strong>Woodpecker</strong> agent configuration.</p> <h3> Problem: Cache corruption after <strong>Crystal</strong> upgrade </h3> <p><strong>Solution:</strong> The <code>${CRYSTAL_VERSION}</code> suffix naturally creates new caches for new versions.</p> <h3> Problem: Multiple agents don't share caches </h3> <p><strong>Solution:</strong> This is by design - named volumes are local to each agent. Each agent maintains its own cache, which means:</p> <ul> <li> <strong>First build on each agent</strong> will take the full 415 seconds (cold cache)</li> <li> <strong>Subsequent builds on the same agent</strong> will take ~196 seconds (warm cache)</li> <li> <strong>Build times vary</strong> depending on which agent picks up the job</li> </ul> <p>For consistent performance across all agents, consider:</p> <ul> <li>Use agent labels to pin jobs to specific agents</li> <li>Implement external cache storage (<strong>S3</strong>, <strong>NFS</strong>, network volumes)</li> <li>Accept variable build times as a trade-off for distributed load</li> </ul> <h3> Problem: Volumes not mounting or permission errors </h3> <p><strong>Solution:</strong> Verify the repository has <strong>"Trusted"</strong> status enabled in <strong>Woodpecker</strong> settings. Without trust, <code>volumes:</code> and <code>tmpfs:</code> directives are silently ignored or produce permission errors.</p> <h2> Comparison with Other CI Systems </h2> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th> <strong>CI</strong> System</th> <th>Cache Strategy</th> </tr> </thead> <tbody> <tr> <td><strong>GitHub Actions</strong></td> <td> <code>actions/cache</code> with path <code>/home/runner/.cache/crystal</code> </td> </tr> <tr> <td><strong>GitLab CI</strong></td> <td> <code>cache:</code> directive with <code>key: ${CI_COMMIT_REF_SLUG}</code> </td> </tr> <tr> <td><strong>Woodpecker</strong></td> <td>Named volumes with version-specific keys</td> </tr> </tbody> </table></div> <p><strong>Woodpecker</strong>'s approach is simpler - no cache upload/download steps, just persistent volumes.</p> crystal ci woodpecker performance Speeding Up PostgreSQL in Containers Michael Nikitochkin Mon, 19 Jan 2026 23:45:50 +0000 https://dev.to/miry/speeding-up-postgresql-in-containers-1eeg https://dev.to/miry/speeding-up-postgresql-in-containers-1eeg <h2> The Problem </h2> <p>Running a test suite on an older <strong>CI</strong> machine with slow disks revealed <strong>PostgreSQL</strong> as a major bottleneck. Each test run was taking over <strong>1 hour</strong> to complete. The culprit? Tests performing numerous database operations, with <code>TRUNCATE</code> commands cleaning up data after each test.</p> <p>With slow disk I/O, <strong>PostgreSQL</strong> was spending most of its time syncing data to disk - operations that were completely unnecessary in a ephemeral <strong>CI</strong> environment where data persistence doesn't matter.</p> <h3> Catching PostgreSQL in the Act </h3> <p>Running <code>top</code> during test execution revealed the smoking gun:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>242503 postgres 20 0 184592 49420 39944 R 81.7 0.3 0:15.66 postgres: postgres api_test 10.89.5.6(43216) TRUNCATE TABLE </code></pre> </div> <p><strong>PostgreSQL</strong> was consuming <strong>81.7% CPU</strong> just to truncate a table! This single <code>TRUNCATE</code> operation ran for over <strong>15 seconds</strong>. On a machine with slow disks, <strong>PostgreSQL</strong> was spending enormous amounts of time on fsync operations, waiting for the kernel to confirm data was written to physical storage - even though we were just emptying tables between tests.</p> <h2> The Solution </h2> <p>Three simple <strong>PostgreSQL</strong> configuration tweaks made a dramatic difference:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">services</span><span class="pi">:</span> <span class="na">postgres</span><span class="pi">:</span> <span class="na">image</span><span class="pi">:</span> <span class="s">postgres:16.11-alpine</span> <span class="na">environment</span><span class="pi">:</span> <span class="na">POSTGRES_INITDB_ARGS</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--nosync"</span> <span class="na">POSTGRES_SHARED_BUFFERS</span><span class="pi">:</span> <span class="s">256MB</span> <span class="na">tmpfs</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/var/lib/postgresql/data:size=1g</span> </code></pre> </div> <h3> 1. <code>--nosync</code> Flag </h3> <p>The <code>--nosync</code> flag tells <strong>PostgreSQL</strong> to skip <code>fsync()</code> calls during database initialization. In a <strong>CI</strong> environment, we don't care about data durability - if the container crashes, we'll just start over. This eliminates expensive disk sync operations that were slowing down database setup.</p> <h3> 2. Increased Shared Buffers </h3> <p>Setting <code>POSTGRES_SHARED_BUFFERS: 256MB</code> (up from the default ~128MB) gives <strong>PostgreSQL</strong> more memory to cache frequently accessed data. This is especially helpful when running tests that repeatedly access the same tables.</p> <h3> 3. tmpfs for Data Directory (The Game Changer) </h3> <p>The biggest performance win came from mounting <strong>PostgreSQL</strong>'s data directory on <code>tmpfs</code> - an in-memory filesystem.<br> This completely eliminates disk I/O for database operations:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">tmpfs</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/var/lib/postgresql/data:size=1g</span> </code></pre> </div> <p>With <code>tmpfs</code>, all database operations happen in <strong>RAM</strong>. This is especially impactful for:</p> <ul> <li> <strong>TRUNCATE operations</strong> - instant cleanup between tests</li> <li> <strong>Index updates</strong> - no disk seeks required</li> <li> <strong>WAL (Write-Ahead Log) writes</strong> - purely memory operations</li> <li> <strong>Checkpoint operations</strong> - no waiting for disk flushes</li> </ul> <p>The 1GB size limit is generous for most test databases. Adjust based on your test data volume.</p> <h2> The Impact </h2> <p><strong>Before:</strong> ~60 minutes per test run<br><br> <strong>After:</strong> ~10 minutes per test run<br><br> <strong>Improvement:</strong> 6x faster! 🚀</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6yr2zjm6grgkxl9s0m5c.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6yr2zjm6grgkxl9s0m5c.png" alt=" " width="800" height="533"></a></p> <h3> Real Test Performance Examples </h3> <p>You should have seen my surprise when I first saw a single test taking 30 seconds in containers.<br> I knew something was terribly wrong. But when I applied the in-memory optimization and<br> saw the numbers drop to what you'd expect on a normal machine - I literally got tears in my eyes.</p> <p><strong>Before <code>tmpfs</code> optimization:</strong><br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>API::FilamentSupplierAssortmentsTest#test_create_validation_negative_price = 25.536s API::FilamentSupplierAssortmentsTest#test_list_with_a_single_assortment = 29.996s API::FilamentSupplierAssortmentsTest#test_list_missing_token = 25.952s </code></pre> </div> <p>Each test was taking <strong>25-30 seconds</strong> even though the actual test logic was minimal!<br> Most of this time was spent waiting for <strong>PostgreSQL</strong> to sync data to disk.</p> <p><strong>After <code>tmpfs</code> optimization:</strong><br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>API::FilamentSupplierAssortmentsTest#test_list_as_uber_curator = 0.474s API::FilamentSupplierAssortmentsTest#test_list_as_assistant = 0.466s API::FilamentSupplierAssortmentsTest#test_for_pressman_without_filament_supplier = 0.420s </code></pre> </div> <p>These same tests now complete in <strong>0.4-0.5 seconds</strong> - a <strong>50-60x improvement per test</strong>! 🎉</p> <h3> Where the Time Was Going </h3> <p>The biggest gains came from reducing disk I/O during:</p> <ul> <li> <strong>TRUNCATE operations between tests</strong> - <strong>PostgreSQL</strong> was syncing empty table states to disk</li> <li> <strong>Database initialization</strong> at the start of each <strong>CI</strong> run</li> <li> <strong>INSERT operations during test setup</strong> - creating test fixtures (users, roles, ...)</li> <li> <strong>Transaction commits</strong> - each test runs in a transaction that gets rolled back</li> <li> <strong>Frequent small writes</strong> during test execution</li> </ul> <p>With slow disks, even simple operations like creating a test user or truncating a table would take seconds instead of milliseconds. The <code>top</code> output above shows a single <code>TRUNCATE TABLE</code> operation taking 15+ seconds and consuming 81.7% <strong>CPU</strong> - most of that was <strong>PostgreSQL</strong> waiting for disk I/O. Multiply that across hundreds of tests, and you get hour-long <strong>CI</strong> runs.</p> <h3> The Math </h3> <ul> <li> <strong>24 tests</strong> in this file alone</li> <li> <strong>Before:</strong> ~27 seconds average per test = <strong>~648 seconds (10.8 minutes)</strong> for one test file</li> <li> <strong>After:</strong> ~0.45 seconds average per test = <strong>~11 seconds</strong> for the same file</li> <li> <strong>Per-file speedup:</strong> 59x faster!</li> </ul> <p>With dozens of test files, the cumulative time savings are massive.</p> <h2> Why This Works for CI </h2> <p>In production, you absolutely want <code>fsync()</code> enabled and conservative settings to ensure data durability. But in <strong>CI</strong>:</p> <ul> <li> <strong>Data is ephemeral</strong> - containers are destroyed after each run</li> <li> <strong>Speed matters more than durability</strong> - faster feedback loops improve developer productivity</li> <li> <strong>Disk I/O is often the bottleneck</strong> - especially on older/slower <strong>CI</strong> machines</li> </ul> <p>By telling <strong>PostgreSQL</strong> "don't worry about crashes, we don't need this data forever," we eliminated unnecessary overhead.</p> <h2> Key Takeaways </h2> <ol> <li> <strong>Profile your CI pipeline</strong> - we discovered disk I/O was the bottleneck, not <strong>CPU</strong> or memory</li> <li> <strong>CI databases don't need production settings</strong> - optimize for speed, not durability</li> <li> <code>tmpfs</code> <strong>is the ultimate disk I/O eliminator</strong> - everything in <strong>RAM</strong> means zero disk bottleneck</li> <li> <strong>Small configuration changes can have big impacts</strong> - three settings saved us 50 minutes per run</li> <li> <strong>Consider your hardware</strong> - these optimizations were especially important on older machines with slow disks</li> <li> <strong>Watch your memory usage</strong> - <code>tmpfs</code> consumes <strong>RAM</strong>; ensure your <strong>CI</strong> runners have enough (1GB+ for the database)</li> </ol> <h2> Implementation in Woodpecker CI </h2> <p>Here's our complete <strong>PostgreSQL</strong> service configuration:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">services</span><span class="pi">:</span> <span class="na">postgres</span><span class="pi">:</span> <span class="na">image</span><span class="pi">:</span> <span class="s">postgres:16.11-alpine</span> <span class="na">environment</span><span class="pi">:</span> <span class="na">POSTGRES_USER</span><span class="pi">:</span> <span class="s">postgres</span> <span class="na">POSTGRES_PASSWORD</span><span class="pi">:</span> <span class="s">dbpgpassword</span> <span class="na">POSTGRES_DB</span><span class="pi">:</span> <span class="s">api_test</span> <span class="na">POSTGRES_INITDB_ARGS</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--nosync"</span> <span class="na">POSTGRES_SHARED_BUFFERS</span><span class="pi">:</span> <span class="s">256MB</span> <span class="na">ports</span><span class="pi">:</span> <span class="pi">-</span> <span class="m">5432</span> <span class="na">tmpfs</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/var/lib/postgresql/data:size=1g</span> </code></pre> </div> <p><strong>Note:</strong> The <code>tmpfs</code> field is officially supported in Woodpecker CI's backend (defined in <a href="proxy.php?url=https://github.com/woodpecker-ci/woodpecker/blob/d1b7e35ca857f183e76171f2ab72841fbed3daf9/pipeline/backend/types/step.go#L35" rel="noopener noreferrer"><code>pipeline/backend/types/step.go</code></a>). If you see schema validation warnings, they may be from outdated documentation - the feature works perfectly.</p> <p><strong>Lucky us!</strong> Not all CI platforms support <code>tmpfs</code> configuration this easily. Woodpecker CI makes it trivial with native Docker support - just add a <code>tmpfs:</code> field and you're done. If you're on GitHub Actions, GitLab CI, or other platforms, you might need workarounds like <code>docker run</code> with <code>--tmpfs</code> flags or custom runner configurations.</p> <p>Simple, effective, and no code changes required - just smarter configuration for the CI environment.</p> <h2> Why Not Just Tune PostgreSQL Settings Instead of tmpfs? </h2> <p><strong>TL;DR: I tried. <code>tmpfs</code> is still faster AND simpler.</strong></p> <p>After seeing the dramatic improvements with <code>tmpfs</code>, I wondered: "Could we achieve similar performance by aggressively tuning <strong>PostgreSQL</strong> settings instead?" This would be useful for environments where <code>tmpfs</code> isn't available or <strong>RAM</strong> is limited.</p> <h3> Tested Aggressive Disk-Based Tuning </h3> <p>Experimenting with disabling all durability features:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="na">services</span><span class="pi">:</span> <span class="na">postgres</span><span class="pi">:</span> <span class="na">command</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">postgres</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">fsync=off</span> <span class="c1"># Skip forced disk syncs</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">synchronous_commit=off</span> <span class="c1"># Async WAL writes</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">wal_level=minimal</span> <span class="c1"># Minimal WAL overhead</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">full_page_writes=off</span> <span class="c1"># Less WAL volume</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">autovacuum=off</span> <span class="c1"># No background vacuum</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">max_wal_size=1GB</span> <span class="c1"># Fewer checkpoints</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">shared_buffers=256MB</span> <span class="c1"># More memory cache</span> </code></pre> </div> <h3> The Results: tmpfs Still Wins </h3> <p>Even with all these aggressive settings, <code>tmpfs</code> was <strong>still faster</strong>.</p> <p><strong>Disk-based (even with <code>fsync=off</code>):</strong></p> <ul> <li>❌ File system overhead - ext4/xfs metadata operations</li> <li>❌ Disk seeks - mechanical latency on <strong>HDDs</strong>, limited <strong>IOPS</strong> on <strong>SSDs</strong> </li> <li>❌ Kernel buffer cache - memory copies between user/kernel space</li> <li>❌ <strong>Docker</strong> overlay2 - additional storage driver overhead</li> <li>❌ Complexity - 7+ settings to manage and understand</li> </ul> <p><strong><code>tmpfs</code>-based:</strong></p> <ul> <li>✅ Pure <strong>RAM</strong> operations - no physical storage involved</li> <li>✅ Zero disk I/O - everything happens in memory</li> <li>✅ Simple configuration - just one <code>tmpfs</code> line</li> <li>✅ Maximum performance - nothing faster than <strong>RAM</strong> </li> </ul> <h2> Bonus: Other PostgreSQL CI Optimizations to Consider </h2> <p>If you're still looking for more speed improvements:</p> <ul> <li> <strong>Disable query logging</strong> - reduces I/O overhead: </li> </ul> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code> <span class="na">command</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">postgres</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">log_statement=none</span> <span class="c1"># Don't log any statements</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">log_min_duration_statement=-1</span> <span class="c1"># Don't log slow queries</span> </code></pre> </div> <ul> <li> <strong>Use <code>fsync=off</code> in postgresql.conf</strong> - similar to <code>--nosync</code> but for runtime (redundant with <code>tmpfs</code>)</li> <li> <strong>Increase <code>work_mem</code></strong> - helps with complex queries in tests</li> </ul> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1tp8ey4rsyhcgtmj007.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1tp8ey4rsyhcgtmj007.png" alt=" " width="800" height="387"></a></p> postgres woodpecker ci performance Debugging a Double Free in Crystal with libxml2, GDB, and Valgrind Michael Nikitochkin Sun, 07 Dec 2025 19:01:29 +0000 https://dev.to/miry/debugging-a-double-free-in-crystal-with-libxml2-gdb-and-valgrind-17h7 https://dev.to/miry/debugging-a-double-free-in-crystal-with-libxml2-gdb-and-valgrind-17h7 <p>This is a personal note about how I tracked down and fixed a double-free bug caused by Crystal’s garbage collector interacting with <code>libxml2</code>. I used <code>gdb</code> and <code>valgrind</code> to trace the issue, understand where memory was allocated and freed, and eventually identify the root cause. I am not an advanced user of these tools, so this write-up serves as a reminder of the steps I took and what I learned along the way.</p> <h2> The Problem </h2> <p>For a few days, some of my tests started crashing intermittently, against the night builds of the Crystal:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">$ </span>bin/drar_test <span class="nt">--seed</span> 6690 <span class="nt">--verbose</span> <span class="nt">--parallel</span> 1 free<span class="o">()</span>: double free detected <span class="k">in </span>tcache 2 </code></pre> </div> <ul> <li>The crashes were <strong>non-deterministic</strong>: they didn’t always occur locally, and sometimes didn’t even appear in CI.</li> <li>The error message itself wasn’t very helpful at first, and I wasn’t sure where the issue was coming from.</li> </ul> <h2> Step 1: Reproducing the Issue </h2> <p>I started by reproducing the failure locally:</p> <ul> <li>I used the same app configuration as in CI.</li> <li>I tested different <code>--seed</code> arguments until I found a seed that reliably triggered the crash.</li> </ul> <h2> Step 2: Initial Investigation with GDB </h2> <p>Since the error originated from <code>free()</code>, I wanted to see what was happening at the crash:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">$ </span>crystal <span class="nt">--version</span> Crystal 1.17.0 <span class="o">[</span>d2c705b53] <span class="o">(</span>2025-07-16<span class="o">)</span> <span class="nv">$ </span>crystal build <span class="nt">--stats</span> <span class="nt">--threads</span> 1 <span class="nt">--time</span> <span class="nt">-o</span> bin/drar_test ./test/ext/std/openssl/cipher_test.cr ... <span class="nv">$ </span>gdb <span class="nt">--args</span> bin/drar_test <span class="nt">--seed</span> 6690 <span class="nt">--verbose</span> <span class="nt">--parallel</span> 1 <span class="o">&gt;</span> run Run options: <span class="nt">--seed</span> 6690 <span class="nt">--verbose</span> <span class="nt">--parallel</span> 1... free<span class="o">()</span>: double free detected <span class="k">in </span>tcache 2 Program received signal SIGABRT, Aborted. __pthread_kill_implementation <span class="o">(</span><span class="nv">threadid</span><span class="o">=</span>&lt;optimized out&gt;, <span class="nv">signo</span><span class="o">=</span>signo@entry<span class="o">=</span>6, <span class="nv">no_tid</span><span class="o">=</span>no_tid@entry<span class="o">=</span>0<span class="o">)</span> at pthread_kill.c:44 44 <span class="k">return </span>INTERNAL_SYSCALL_ERROR_P <span class="o">(</span>ret<span class="o">)</span> ? INTERNAL_SYSCALL_ERRNO <span class="o">(</span>ret<span class="o">)</span> : 0<span class="p">;</span> </code></pre> </div> <p>At the crash, I used <code>bt</code> to print the backtrace:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>&gt; bt ... #6 0x00007ffff7258ad8 in tcache_double_free_verify (e=&lt;optimized out&gt;) at malloc.c:3350 #7 0x00007ffff7e7413b in xmlFreeNodeList (cur=0x1da4db00) at /usr/src/debug/libxml2-2.12.10-5.fc43.x86_64/tree.c:3662 #8 0x00007ffff7e73e68 in xmlFreeDoc (cur=0x1da4b710) at /usr/src/debug/libxml2-2.12.10-5.fc43.x86_64/tree.c:1212 #9 0x0000000003e480a0 in finalize () at /home/miry/src/crystal/crystal/src/xml/document.cr:67 #10 0x0000000000482b86 in -&gt; () at /home/miry/src/crystal/crystal/src/gc/boehm.cr:340 #11 0x00007ffff73cc517 in GC_invoke_finalizers () at extra/../finalize.c:1255 #12 0x00007ffff73cc801 in GC_notify_or_invoke_finalizers () at extra/../finalize.c:1342 #13 GC_notify_or_invoke_finalizers () at extra/../finalize.c:1282 #14 0x00007ffff73d8e77 in GC_generic_malloc_many (lb=&lt;optimized out&gt;, k=1, result=0x7ffff750b130 &lt;first_thread+496&gt;) at extra/../mallocx.c:336 #15 0x00007ffff73e67b5 in GC_malloc_kind (bytes=&lt;optimized out&gt;, kind=&lt;optimized out&gt;) at extra/../thread_local_alloc.c:187 </code></pre> </div> <ul> <li> <strong>The backtrace</strong> revealed that Crystal’s GC was finalizing an <code>XML::Document</code> and calling <code>xmlFreeDoc</code>.</li> <li>This was my first clue that the crash involved <strong>XML nodes being freed twice</strong>.</li> </ul> <h2> Step 3: Using Valgrind </h2> <p>I then ran the same binary under Valgrind:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">$ </span>valgrind <span class="nt">--track-origins</span><span class="o">=</span><span class="nb">yes</span> <span class="nt">--leak-check</span><span class="o">=</span>full bin/drar_test 2&gt; valgrind.logs </code></pre> </div> <p>In the logs, I found:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code> ==232953== Invalid free() / delete / delete[] / realloc() ==232953== at 0x1E2C6E43: free (vg_replace_malloc.c:990) ==232953== by 0x1E33B39B: xmlFreePropList (tree.c:2052) ==232953== by 0x1E33B39B: xmlFreePropList (tree.c:2047) ==232953== by 0x1E33B0A7: xmlFreeNodeList (tree.c:3638) ==232953== by 0x1E33AE67: xmlFreeDoc (tree.c:1212) ==232953== by 0x3EAB88F: *XML::Document#finalize:Nil (document.cr:67) ... ==232953== Address 0x1f4b6780 is 0 bytes inside a block of size 96 free'd ==232953== at 0x1E2C6E43: free (vg_replace_malloc.c:990) ==232953== by 0x1E33B39B: xmlFreePropList (tree.c:2052) ==232953== by 0x1E33B39B: xmlFreePropList (tree.c:2047) ==232953== by 0x1E33B0A7: xmlFreeNodeList (tree.c:3638) ==232953== by 0x1E33AE67: xmlFreeDoc (tree.c:1212) ==232953== by 0x3EAB88F: *XML::Document#finalize:Nil (document.cr:67) ... ==232953== Block was alloc'd at ==232953== at 0x1E2C3B26: malloc (vg_replace_malloc.c:447) ==232953== by 0x1E3374C5: xmlSAX2AttributeNs (SAX2.c:1880) ==232953== by 0x1E3393E8: xmlSAX2StartElementNs (SAX2.c:2299) ==232953== by 0x1E3289F1: xmlParseStartTag2.constprop.0 (parser.c:10091) ==232953== by 0x1E328EBB: xmlParseElementStart (parser.c:10473) ==232953== by 0x1E32AF84: xmlParseElement (parser.c:10406) ==232953== by 0x1E32B267: xmlParseDocument (parser.c:11190) ==232953== by 0x1E332F38: xmlDoRead (parser.c:14835) ==232953== by 0x3EAAE19: *XML::parse&lt;String&gt;:XML::Document (xml.cr:61) ==232953== by 0xFF0FF3: *ActionText::RichText#to_html:String (rich_text.cr:41) ==232953== by 0x41F7A90: *ActionText::RichTextTest#test_render_html_with_image_and_tags:Bool (rich_text_test.cr:70) ==232953== by 0x4B1003: ~proc223Proc(Minitest::Test, Nil)@lib/minitest/src/runnable.cr:17 (runnable.cr:17) </code></pre> </div> <ul> <li> <strong>Valgrind</strong> confirmed the double free: the same memory address was freed twice by the Garbage Collector.</li> <li>It also showed the allocation site, pointing back to XML parsing, which confirmed the findings from <strong>GDB</strong>.</li> <li>The extra information helped identify <strong>where the object was allocated</strong>.</li> </ul> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnebg3becy74hewrvli4.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnebg3becy74hewrvli4.png" alt=" " width="800" height="800"></a></p> <h2> Step 4: Root Cause </h2> <p>The root cause of the problem was my hacks with extra bindings indeed.<br> In my approach I exteneded the Crystal's XML with bindings from libxml2:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="nd">@[Link("xml2")]</span> <span class="k">lib</span> <span class="no">LibXML</span> <span class="k">fun</span> <span class="n">xmlAddChild</span><span class="p">(</span><span class="n">parent</span> <span class="p">:</span> <span class="no">Node</span><span class="o">*</span><span class="p">,</span> <span class="n">child</span> <span class="p">:</span> <span class="no">Node</span><span class="o">*</span><span class="p">)</span> <span class="k">end</span> <span class="k">class</span> <span class="nc">XML</span><span class="o">::</span><span class="no">Node</span> <span class="k">def</span> <span class="nf">add_child</span><span class="p">(</span><span class="n">child</span> <span class="p">:</span> <span class="no">Node</span><span class="p">)</span> <span class="no">LibXML</span><span class="p">.</span><span class="nf">xmlAddChild</span><span class="p">(</span><span class="nb">self</span><span class="p">,</span> <span class="n">child</span><span class="p">)</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>The actual issue appeared in the way I was using it:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="n">html</span> <span class="o">=</span> <span class="no">XML</span><span class="p">.</span><span class="nf">parse_html</span> <span class="s2">"&lt;article&gt;some text &lt;/article&gt;"</span><span class="p">,</span> <span class="ss">options: </span><span class="no">XML</span><span class="o">::</span><span class="no">HTMLParserOptions</span><span class="o">::</span><span class="no">RECOVER</span> <span class="o">|</span> <span class="no">XML</span><span class="o">::</span><span class="no">HTMLParserOptions</span><span class="o">::</span><span class="no">NOIMPLIED</span> <span class="o">|</span> <span class="no">XML</span><span class="o">::</span><span class="no">HTMLParserOptions</span><span class="o">::</span><span class="no">NODEFDTD</span> <span class="n">html</span><span class="p">.</span><span class="nf">xpath_nodes</span><span class="p">(</span><span class="s2">"//action-text-attachment"</span><span class="p">).</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">parent</span><span class="o">|</span> <span class="o">...</span> <span class="n">image</span> <span class="o">=</span> <span class="no">XML</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="s2">"&lt;img src='</span><span class="si">#{</span><span class="n">blob</span><span class="p">.</span><span class="nf">redirect_url</span><span class="si">}</span><span class="s2">'&gt;"</span><span class="p">).</span><span class="nf">xpath_node</span><span class="p">(</span><span class="s2">"//img"</span><span class="p">).</span><span class="nf">not_nil!</span> <span class="n">parent</span><span class="p">.</span><span class="nf">add_child</span><span class="p">(</span><span class="n">image</span><span class="p">)</span> <span class="k">end</span> </code></pre> </div> <p>The problem:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="n">image</span> <span class="o">=</span> <span class="no">XML</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="s2">"&lt;img src='</span><span class="si">#{</span><span class="n">blob</span><span class="p">.</span><span class="nf">redirect_url</span><span class="si">}</span><span class="s2">'&gt;"</span><span class="p">).</span><span class="nf">xpath_node</span><span class="p">(</span><span class="s2">"//img"</span><span class="p">).</span><span class="nf">not_nil!</span> </code></pre> </div> <p>The double-free happens because <strong>libxml2 Nodes belong to a single Document</strong>:</p> <ol> <li> <code>XML.parse</code> creates a temporary <code>XML::Document</code>.</li> <li> <a href="proxy.php?url=https://crystal-lang.org/api/master/XML/Node.html#xpath_node%28path%2Cnamespaces%3Dnil%2Cvariables%3Dnil%29-instance-method" rel="noopener noreferrer"><code>xpath_node</code></a> returns a child node (<code>image</code>) that still belongs to this temporary document.</li> <li> <code>parent.add_child(image)</code> inserts the node into another document (<code>parent</code>) without detaching or copying it.</li> <li>When the temporary document is finalized by Crystal’s GC, it frees all its nodes, including <code>image</code>.</li> <li>The <code>parent</code> document still references the same <code>image</code> node. Later, when the parent document is finalized, it tries to free <code>image</code> again → <strong>double free</strong>.</li> </ol> <p>Valgrind and GDB confirmed this pattern: the same address (<code>0x1f4b6780</code>) was freed twice — first by the temporary document finalizer, second by the <code>parent</code> document finalizer.</p> <h2> Step 5: Fixing the Problem </h2> <p>The solution is to <strong>insert a copy of the node</strong> into the document, rather than the original. The copy is fully independent and can safely be added to another document.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="nd">@[Link("xml2")]</span> <span class="k">lib</span> <span class="no">LibXML</span> <span class="k">fun</span> <span class="n">xmlAddChild</span><span class="p">(</span><span class="n">parent</span> <span class="p">:</span> <span class="no">Node</span><span class="o">*</span><span class="p">,</span> <span class="n">child</span> <span class="p">:</span> <span class="no">Node</span><span class="o">*</span><span class="p">)</span> <span class="k">fun</span> <span class="n">xmlCopyNode</span><span class="p">(</span><span class="n">old</span> <span class="p">:</span> <span class="no">Node</span><span class="o">*</span><span class="p">,</span> <span class="n">extended</span> <span class="p">:</span> <span class="no">Int</span><span class="p">)</span> <span class="p">:</span> <span class="no">Node</span><span class="o">*</span> <span class="k">end</span> <span class="k">class</span> <span class="nc">XML</span><span class="o">::</span><span class="no">Node</span> <span class="k">def</span> <span class="nf">add_child</span><span class="p">(</span><span class="n">child</span> <span class="p">:</span> <span class="no">Node</span><span class="p">)</span> <span class="n">copied_node</span> <span class="o">=</span> <span class="no">LibXML</span><span class="p">.</span><span class="nf">xmlCopyNode</span><span class="p">(</span><span class="n">child</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="no">LibXML</span><span class="p">.</span><span class="nf">xmlAddChild</span><span class="p">(</span><span class="nb">self</span><span class="p">,</span> <span class="n">copied_node</span><span class="p">)</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <ul> <li>Now, the temporary document used to create the node can be <strong>freed safely</strong> without causing crashes.</li> <li>Any attributes, children, or other memory owned by the original document are copied correctly.</li> </ul> <h2> Step 6: Lessons Learned </h2> <ol> <li>Nodes belong to a single document in <code>libxml2</code>; sharing them across documents without copying or detaching is unsafe.</li> <li> <strong>Valgrind</strong> and <strong>GDB</strong> are invaluable debugging tools: <ul> <li> <strong>Valgrind</strong> detects invalid frees and memory issues.</li> <li> <strong>GDB</strong> lets you inspect the backtrace at the crash.</li> </ul> </li> <li>Valgrind can be misleading at first because it does not trigger a crash; instead, you need to read the logs carefully to identify double-free memory addresses. Once found, it shows the allocation and free sites, which greatly helps in investigating the root cause.</li> </ol> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfirxettk2avee1vad46.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfirxettk2avee1vad46.png" alt="That's all folks" width="800" height="387"></a></p> crystal crystallang programming Instrumenting a Marten App with OpenTelemetry Michael Nikitochkin Tue, 13 May 2025 18:39:59 +0000 https://dev.to/miry/instrumenting-a-marten-app-with-opentelemetry-4f21 https://dev.to/miry/instrumenting-a-marten-app-with-opentelemetry-4f21 <p>This article demonstrates how to instrument a <strong>Crystal</strong> application using <strong>OpenTelemetry</strong> with the <strong>Marten web framework</strong><sup id="fnref1">1</sup>.<br> It begins with a basic setup, covers visualizing traces in <strong>Jaeger</strong>, and introduces HTTP request tracing using middleware.<br> After it connects two services and propagating trace context between them to build a complete distributed trace.</p> <h2> 1. Project Setup </h2> <p>Begin by creating a new Marten application named <em>DrukArmy</em> (inspired by <a href="proxy.php?url=https://drukarmy.org.ua/en" rel="noopener noreferrer">https://drukarmy.org.ua/en</a>):<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>marten new project drukarmy <span class="nb">cd </span>drukarmy </code></pre> </div> <p>Add the <code>opentelemetry-sdk</code><sup id="fnref2">2</sup> shard to the <code>shard.yml</code>:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight yaml"><code><span class="c1"># shard.yml</span> <span class="na">dependencies</span><span class="pi">:</span> <span class="s">...</span> <span class="s">opentelemetry-sdk</span><span class="err">:</span> <span class="na">github</span><span class="pi">:</span> <span class="s">wyhaines/opentelemetry-sdk.cr</span> </code></pre> </div> <p>Next, create an initializer in <code>config/initializers/opentelemetry.cr</code> to configure the <strong>OpenTelemetry SDK</strong>.<br> To verify that the setup is working, emit a test span:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># config/initializers/opentelemetry.cr</span> <span class="nb">require</span> <span class="s2">"opentelemetry-sdk"</span> <span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">configure</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="n">config</span><span class="p">.</span><span class="nf">service_name</span> <span class="o">=</span> <span class="s2">"drukarmy"</span> <span class="n">config</span><span class="p">.</span><span class="nf">exporter</span> <span class="o">=</span> <span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Exporter</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">variant: :stdout</span><span class="p">)</span> <span class="k">end</span> <span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">tracer</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"startup"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">root_span</span><span class="o">|</span> <span class="n">root_span</span><span class="p">.</span><span class="nf">consumer!</span> <span class="k">end</span> </code></pre> </div> <p>Run the server to confirm that spans are being emitted:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>marten serve </code></pre> </div> <p>You should see a span named <code>startup</code> printed to the terminal.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight json"><code><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"trace"</span><span class="p">,</span><span class="w"> </span><span class="nl">"traceId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2bc89670000edfb4dab7470af935d3e9"</span><span class="p">,</span><span class="w"> </span><span class="nl">"resource"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"service.name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"drukarmy"</span><span class="p">,</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"spans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"span"</span><span class="p">,</span><span class="w"> </span><span class="nl">"traceId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2bc89670000edfb4dab7470af935d3e9"</span><span class="p">,</span><span class="w"> </span><span class="nl">"spanId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0edfb4dab7000001"</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"startup"</span><span class="p">,</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre> </div> <h2> 2. Viewing Traces </h2> <p>To make this more useful, let’s view spans in <strong>Jaeger</strong><sup id="fnref3">3</sup>, a lightweight UI for working with trace data.<br> Run <strong>Jaeger</strong> in a container:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>docker run <span class="nt">--rm</span> <span class="nt">-p</span> 16686:16686 <span class="nt">-p</span> 4318:4318 quay.io/jaegertracing/jaeger:2.6.0 </code></pre> </div> <p>Next, update the OpenTelemetry configuration to use the <code>http</code> exporter instead of <code>stdout</code>:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># config/initializers/opentelemetry.cr</span> <span class="nb">require</span> <span class="s2">"opentelemetry-sdk"</span> <span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">configure</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="n">config</span><span class="p">.</span><span class="nf">service_name</span> <span class="o">=</span> <span class="s2">"drukarmy"</span> <span class="n">config</span><span class="p">.</span><span class="nf">exporter</span> <span class="o">=</span> <span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Exporter</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">variant: :http</span><span class="p">)</span> <span class="c1"># changed from :stdout</span> <span class="k">end</span> <span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">tracer</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"startup"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">root_span</span><span class="o">|</span> <span class="n">root_span</span><span class="p">.</span><span class="nf">consumer!</span> <span class="k">end</span> </code></pre> </div> <p>This configuration sends spans to the default HTTP endpoint: <a href="proxy.php?url=http://localhost:4318/v1/traces" rel="noopener noreferrer">http://localhost:4318/v1/traces</a>.</p> <p>Visit the <a href="proxy.php?url=http://localhost:16686" rel="noopener noreferrer">Jaeger UI</a> to explore the emitted traces.</p> <blockquote> <p>[!TIP]<br> Set the <code>DEBUG=1</code> environment variable to enable more verbose logging from the OpenTelemetry library.</p> </blockquote> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsr8kot3rk6pc0r9onlc.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsr8kot3rk6pc0r9onlc.png" alt=" " width="800" height="415"></a></p> <h2> 3. Instrumenting HTTP Requests </h2> <h3> 3.1 Add a Middleware </h3> <p>To trace incoming HTTP requests, you can use a <strong>Marten middleware</strong><sup id="fnref4">4</sup>,<br> which allows you to insert tracing logic around request handling.</p> <p>For reference, other frameworks also offer OpenTelemetry instrumentation libraries,<br> which may serve as inspiration<sup id="fnref5">5</sup>.</p> <p>Create a middleware for tracing. For simplicity, place it alongside other handlers:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># src/handlers/opentelemetry_middleware.cr</span> <span class="k">class</span> <span class="nc">OpenTelemetryMiddleware</span> <span class="o">&lt;</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Middleware</span> <span class="k">def</span> <span class="nf">call</span><span class="p">(</span><span class="n">request</span> <span class="p">:</span> <span class="no">Marten</span><span class="o">::</span><span class="no">HTTP</span><span class="o">::</span><span class="no">Request</span><span class="p">,</span> <span class="n">get_response</span> <span class="p">:</span> <span class="no">Proc</span><span class="p">(</span><span class="no">Marten</span><span class="o">::</span><span class="no">HTTP</span><span class="o">::</span><span class="no">Response</span><span class="p">))</span> <span class="p">:</span> <span class="no">Marten</span><span class="o">::</span><span class="no">HTTP</span><span class="o">::</span><span class="no">Response</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">tracer</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"process_request"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">span</span><span class="o">|</span> <span class="n">span</span><span class="p">.</span><span class="nf">server!</span> <span class="c1"># Add standard HTTP attributes</span> <span class="n">span</span><span class="p">[</span><span class="s2">"http.request.method"</span><span class="p">]</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">method</span> <span class="n">span</span><span class="p">[</span><span class="s2">"url.path"</span><span class="p">]</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">path</span> <span class="n">response</span> <span class="o">=</span> <span class="n">get_response</span><span class="p">.</span><span class="nf">call</span> <span class="n">span</span><span class="p">[</span><span class="s2">"http.response.status_code"</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="nf">status</span> <span class="n">response</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>Marten's <code>request</code> and <code>response</code> APIs are described in the official documentation<sup id="fnref6">6</sup>.<br> The attributes above follow OpenTelemetry's semantic conventions for HTTP spans<sup id="fnref7">7</sup>.</p> <p>Register the middleware in <code>config/settings/base.cr</code>, placing it at the top of the middleware stack:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="no">Marten</span><span class="p">.</span><span class="nf">configure</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="o">...</span> <span class="n">config</span><span class="p">.</span><span class="nf">middleware</span> <span class="o">=</span> <span class="p">[</span> <span class="no">OpenTelemetryMiddleware</span><span class="p">,</span> <span class="o">...</span> <span class="p">]</span> <span class="o">...</span> <span class="k">end</span> </code></pre> </div> <p>After adding the middleware, check the <strong>Jaeger UI</strong> to confirm that HTTP request traces are being captured.</p> <h3> 3.2 Create a Sample Handler </h3> <p>Next, define a basic handler to test HTTP request tracing:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># src/handlers/home_handler.cr</span> <span class="k">class</span> <span class="nc">HomeHandler</span> <span class="o">&lt;</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Handler</span> <span class="k">def</span> <span class="nf">get</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">tracer</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"render_the_page"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">span</span><span class="o">|</span> <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="s2">"custom_logic"</span><span class="p">,</span> <span class="s2">"true"</span><span class="p">)</span> <span class="n">respond</span> <span class="sx">%[{"message": "Hello!"}]</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>Update the route configuration to map the root path to the handler:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># config/routes.cr</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">routes</span><span class="p">.</span><span class="nf">draw</span> <span class="k">do</span> <span class="n">path</span> <span class="s2">"/"</span><span class="p">,</span> <span class="no">HomeHandler</span><span class="p">,</span> <span class="ss">name: </span><span class="s2">"home"</span> <span class="o">...</span> </code></pre> </div> <p>Now, when you visit <a href="proxy.php?url=http://localhost:8000/" rel="noopener noreferrer">http://localhost:8000/</a> (e.g <code>curl localhost:8000</code>), a span named <code>render_the_page</code> will appear in <strong>Jaeger</strong>.<br> At this point, you should be able to explore how a single application can generate traces and visualize them using <strong>Jaeger</strong>.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv8n1vawpyl0xilbtwaxq.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv8n1vawpyl0xilbtwaxq.png" alt="Trace with each per request" width="800" height="723"></a></p> <h2> 4. Distributed Tracing </h2> <p>One of the key benefits of <strong>OpenTelemetry</strong> is the ability to correlate telemetry data across multiple services involved in handling the same request.</p> <p>To demonstrate this, we’ll set up a second application and observe how traces from both services can be linked together.</p> <p>The following diagram outlines the interaction between two applications and the expected trace behavior:</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjq6yih1vct3xx8k41yd.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjq6yih1vct3xx8k41yd.png" alt="sequence diagram" width="800" height="266"></a></p> <p>Spans across services are connected using a shared <em>TraceID</em>.<br> Additionally, the parent <em>SpanID</em> helps define the relationship and order of spans within the trace:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="go">request (SERVER, trace=t1, span=s1, service=drukarmy) | -- GET /backend - 200 (CLIENT, trace=t1, span=s2, parent=s1, service=drukarmy) | --- server (SERVER, trace=t1, span=s3, parent=s2, service=backend) </span></code></pre> </div> <p>In the next steps, we’ll build and connect two services and propagate the tracing context between them.</p> <h3> 4.1. Set Up a Second App </h3> <p>To simulate a multi-service environment, duplicate the existing application to serve as a second service:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nb">cd</span> .. <span class="nb">cp</span> <span class="nt">-a</span> drukarmy backend <span class="nb">cd </span>backend </code></pre> </div> <p>Update the service name and port to allow both applications to run concurrently:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># &lt;backend&gt;/config/initializers/opentelemetry.cr</span> <span class="nb">require</span> <span class="s2">"opentelemetry-sdk"</span> <span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">configure</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="n">config</span><span class="p">.</span><span class="nf">service_name</span> <span class="o">=</span> <span class="s2">"backend"</span> <span class="c1"># changed from "drukarmy"</span> <span class="n">config</span><span class="p">.</span><span class="nf">exporter</span> <span class="o">=</span> <span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Exporter</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">variant: :http</span><span class="p">)</span> <span class="k">end</span> <span class="o">...</span> </code></pre> </div> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># &lt;backend&gt;/config/settings/development.cr</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">configure</span> <span class="ss">:development</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="n">config</span><span class="p">.</span><span class="nf">debug</span> <span class="o">=</span> <span class="kp">true</span> <span class="n">config</span><span class="p">.</span><span class="nf">host</span> <span class="o">=</span> <span class="s2">"127.0.0.1"</span> <span class="n">config</span><span class="p">.</span><span class="nf">port</span> <span class="o">=</span> <span class="mi">8001</span> <span class="c1"># changed from 8000</span> <span class="k">end</span> </code></pre> </div> <p>Create a dedicated handler for this service:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># &lt;backend&gt;/src/handlers/backend_handler.cr</span> <span class="k">class</span> <span class="nc">BackendHandler</span> <span class="o">&lt;</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Handler</span> <span class="k">def</span> <span class="nf">get</span> <span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"backend_process"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">span</span><span class="o">|</span> <span class="n">respond</span> <span class="sx">%[{"message": "Hello from Backend"}]</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>Update the routes accordingly:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># &lt;backend&gt;/config/routes.cr</span> <span class="no">Marten</span><span class="p">.</span><span class="nf">routes</span><span class="p">.</span><span class="nf">draw</span> <span class="k">do</span> <span class="n">path</span> <span class="s2">"/backend"</span><span class="p">,</span> <span class="no">BackendHandler</span><span class="p">,</span> <span class="ss">name: </span><span class="s2">"backend"</span> <span class="o">...</span> <span class="k">end</span> </code></pre> </div> <p>Now run the second application:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>marten serve </code></pre> </div> <p>You should now have two services running:</p> <ul> <li> <code>drukarmy</code> on port 8000</li> <li> <code>backend</code> on port 8001</li> </ul> <p>You can test the <code>backend</code> app and then validate the spans in the <strong>Jaeger</strong> UI by running:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">$ </span>curl localhost:8001/backend <span class="o">{</span><span class="s2">"message"</span>: <span class="s2">"Hello from Backend"</span><span class="o">}</span> </code></pre> </div> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn8t3q047zfelzjhcpvhn.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn8t3q047zfelzjhcpvhn.png" alt=" " width="800" height="390"></a></p> <h3> 4.2. Chain of HTTP Calls </h3> <p>Update the <code>HomeHandler</code> in the original <code>drukarmy</code> application to make an outgoing HTTP request<br> to the backend service and propagate the tracing context:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># &lt;drukarmy&gt;/src/handlers/home_handler.cr</span> <span class="k">class</span> <span class="nc">HomeHandler</span> <span class="o">&lt;</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Handler</span> <span class="k">def</span> <span class="nf">dispatch</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">tracer</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"render_the_page"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">span</span><span class="o">|</span> <span class="n">respond</span> <span class="n">client_request</span> <span class="k">end</span> <span class="k">end</span> <span class="k">def</span> <span class="nf">client_request</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">tracer</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"client_request"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">span</span><span class="o">|</span> <span class="n">span</span><span class="p">.</span><span class="nf">client!</span> <span class="n">url</span> <span class="o">=</span> <span class="s2">"http://localhost:8001/backend"</span> <span class="n">span</span><span class="p">[</span><span class="s2">"url.full"</span><span class="p">]</span> <span class="o">=</span> <span class="n">url</span> <span class="n">headers</span> <span class="o">=</span> <span class="no">HTTP</span><span class="o">::</span><span class="no">Headers</span><span class="p">.</span><span class="nf">new</span> <span class="c1"># Propagate the trace context via HTTP headers</span> <span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Propagation</span><span class="o">::</span><span class="no">TraceContext</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">span</span><span class="p">.</span><span class="nf">context</span><span class="p">).</span><span class="nf">inject</span><span class="p">(</span><span class="n">headers</span><span class="p">)</span> <span class="n">response</span> <span class="o">=</span> <span class="no">HTTP</span><span class="o">::</span><span class="no">Client</span><span class="p">.</span><span class="nf">get</span> <span class="n">url</span><span class="p">,</span> <span class="ss">headers: </span><span class="n">headers</span> <span class="n">span</span><span class="p">[</span><span class="s2">"http.response.status_code"</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="nf">status_code</span> <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="nf">status_code</span> <span class="o">!=</span> <span class="mi">200</span> <span class="n">span</span><span class="p">.</span><span class="nf">status</span><span class="p">.</span><span class="nf">error!</span><span class="p">(</span><span class="s2">"Error: </span><span class="si">#{</span><span class="n">response</span><span class="p">.</span><span class="nf">status_code</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> <span class="k">end</span> <span class="n">response</span><span class="p">.</span><span class="nf">body</span> <span class="k">end</span> <span class="k">rescue</span> <span class="n">ex</span> <span class="p">:</span> <span class="no">Socket</span><span class="o">::</span><span class="no">ConnectError</span> <span class="sx">%[{"error": Something goes wrong"}]</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>This handler does two things:</p> <ol> <li>It starts a span for the incoming request (<code>render_the_page</code>).</li> <li>It performs an outgoing HTTP request to the <code>backend</code> service within a child span (<code>client_request</code>).</li> </ol> <p>The key detail here is the use of <code>OpenTelemetry::Propagation::TraceContext</code>,<br> which injects the trace and span identifiers into the request headers.<br> This allows the <code>backend</code> service to associate its span with the same trace.</p> <p>This mechanism is based on the "W3C Trace Context specification"<sup id="fnref8">8</sup>,<br> which defines how trace context should be propagated using standard HTTP headers like <code>traceparent</code> and <code>tracestate</code>.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc09ok44dtnu2i7sj9otp.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc09ok44dtnu2i7sj9otp.png" alt="Traces with client span" width="800" height="723"></a></p> <h3> 4.3. Receive Trace Context in Backend </h3> <p>In the previous step, we propagated the trace context from the <code>drukarmy</code> app to the <code>backend</code> app.<br> However, the <code>backend</code> service still needs to extract and respect that context in order to properly link its span to the original trace.</p> <p>To do this, update the OpenTelemetry middleware in both applications to extract the context using <code>OpenTelemetry::Propagation::TraceContext</code>:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># &lt;drukarmy&gt;/src/handlers/opentelemetry_middleware.cr</span> <span class="c1"># &lt;backend&gt;/src/handlers/opentelemetry_middleware.cr</span> <span class="k">class</span> <span class="nc">OpenTelemetryMiddleware</span> <span class="o">&lt;</span> <span class="no">Marten</span><span class="o">::</span><span class="no">Middleware</span> <span class="k">def</span> <span class="nf">call</span><span class="p">(</span><span class="n">request</span> <span class="p">:</span> <span class="no">Marten</span><span class="o">::</span><span class="no">HTTP</span><span class="o">::</span><span class="no">Request</span><span class="p">,</span> <span class="n">get_response</span> <span class="p">:</span> <span class="no">Proc</span><span class="p">(</span><span class="no">Marten</span><span class="o">::</span><span class="no">HTTP</span><span class="o">::</span><span class="no">Response</span><span class="p">))</span> <span class="p">:</span> <span class="no">Marten</span><span class="o">::</span><span class="no">HTTP</span><span class="o">::</span><span class="no">Response</span> <span class="n">trace</span> <span class="o">=</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="p">.</span><span class="nf">trace</span> <span class="n">traceparent_header</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">headers</span><span class="p">[</span><span class="s2">"traceparent"</span><span class="p">]?</span> <span class="c1"># Extract and assign trace_id from headers</span> <span class="k">if</span> <span class="n">traceparent_header</span> <span class="n">traceparent</span> <span class="o">=</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Propagation</span><span class="o">::</span><span class="no">TraceContext</span><span class="o">::</span><span class="no">TraceParent</span><span class="p">.</span><span class="nf">from_string</span><span class="p">(</span><span class="n">traceparent_header</span><span class="p">)</span> <span class="n">trace</span><span class="p">.</span><span class="nf">trace_id</span> <span class="o">=</span> <span class="n">traceparent</span><span class="p">.</span><span class="nf">trace_id</span> <span class="n">trace</span><span class="p">.</span><span class="nf">span_context</span><span class="p">.</span><span class="nf">trace_id</span> <span class="o">=</span> <span class="n">traceparent</span><span class="p">.</span><span class="nf">trace_id</span> <span class="k">end</span> <span class="n">trace</span><span class="p">.</span><span class="nf">in_span</span><span class="p">(</span><span class="s2">"process_request"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">span</span><span class="o">|</span> <span class="n">span</span><span class="p">.</span><span class="nf">server!</span> <span class="c1"># Reconstruct parent span and span context if traceparent header is present</span> <span class="k">if</span> <span class="n">traceparent_header</span> <span class="n">parent_span</span> <span class="o">=</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Span</span><span class="p">.</span><span class="nf">build</span><span class="p">(</span><span class="s2">"Phantom Parent"</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">pspan</span><span class="o">|</span> <span class="n">pspan</span><span class="p">.</span><span class="nf">is_recording</span> <span class="o">=</span> <span class="kp">false</span> <span class="n">pspan</span><span class="p">.</span><span class="nf">context</span> <span class="o">=</span> <span class="o">::</span><span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">Propagation</span><span class="o">::</span><span class="no">TraceContext</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">span</span><span class="p">.</span><span class="nf">context</span><span class="p">).</span><span class="nf">extract</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="nf">headers</span><span class="p">).</span><span class="nf">as</span><span class="p">(</span><span class="o">::</span><span class="no">OpenTelemetry</span><span class="o">::</span><span class="no">SpanContext</span><span class="p">)</span> <span class="c1"># Prevent duplicate propagation</span> <span class="n">request</span><span class="p">.</span><span class="nf">headers</span><span class="p">.</span><span class="nf">delete</span><span class="p">(</span><span class="s2">"traceparent"</span><span class="p">)</span> <span class="n">request</span><span class="p">.</span><span class="nf">headers</span><span class="p">.</span><span class="nf">delete</span><span class="p">(</span><span class="s2">"tracestate"</span><span class="p">)</span> <span class="k">end</span> <span class="n">span</span><span class="p">.</span><span class="nf">parent</span> <span class="o">=</span> <span class="n">parent_span</span> <span class="k">if</span> <span class="n">parent_span</span> <span class="k">end</span> <span class="c1"># Add standard HTTP attributes</span> <span class="n">span</span><span class="p">[</span><span class="s2">"http.request.method"</span><span class="p">]</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">method</span> <span class="n">span</span><span class="p">[</span><span class="s2">"url.path"</span><span class="p">]</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">path</span> <span class="n">response</span> <span class="o">=</span> <span class="n">get_response</span><span class="p">.</span><span class="nf">call</span> <span class="n">span</span><span class="p">[</span><span class="s2">"http.response.status_code"</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="nf">status</span> <span class="n">response</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre> </div> <p>This middleware:</p> <ul> <li>Extracts the <code>traceparent</code> header and sets the trace ID.</li> <li>Reconstructs the parent span from the incoming context.</li> <li>Starts a server span that now belongs to the same trace as the original request in <code>drukarmy</code>.</li> </ul> <p>Now, when you run:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code>curl localhost:8000 </code></pre> </div> <p>You’ll see a fully connected distributed trace in <strong>Jaeger</strong> that spans both the <code>drukarmy</code> and <code>backend</code> services.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcq6gqwr8qc5bvxhl7sqk.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcq6gqwr8qc5bvxhl7sqk.png" alt="Trace view with 2 services" width="800" height="636"></a></p> <p><strong>Example Propagation Headers</strong></p> <p>When the client span from <code>drukarmy</code> calls the <code>backend</code>, it sends:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>traceparent: 00-05f1aec8000edfb2caa1c7444e97e4d0-0edfb2caa1000004-01 tracestate: </code></pre> </div> <p>Here’s what it means:</p> <ul> <li> <code>trace-id</code>: <code>05f1aec8000edfb2caa1c7444e97e4d0</code> → the shared trace for this request</li> <li> <code>parent-id</code>: <code>0edfb2caa1000004</code> → the span from the drukarmy client</li> <li> <code>trace-flags</code>: <code>01</code> → marks the trace as sampled</li> </ul> <p>With this setup complete, <strong>Jaeger</strong> will show a coherent trace tree with spans from both services correctly linked.</p> <h2> What’s Next? </h2> <p>Now that you have basic and distributed tracing working with OpenTelemetry in a Marten application,<br> here are a few directions to explore next:</p> <ul> <li> <strong>Automated Instrumentation</strong> Use <code>opentelemetry-instrumentation.cr</code><sup id="fnref5">5</sup> to automatically instrument HTTP server and client requests, reducing the need for manual span management.</li> <li> <strong>Semantic Conventions</strong> Enhance the value of your traces by adopting "OpenTelemetry semantic conventions"<sup id="fnref7">7</sup>. These help standardize span attributes such as <code>http.method</code>, <code>db.system</code>, and <code>messaging.operation</code>.</li> <li> <strong>Learn More About Tracing in Crystal</strong> For a more in-depth guide, check out my previous article: "How to begin with Traces in Crystal"<sup id="fnref9">9</sup>.</li> </ul> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimdlpzv9lns9o4zww9vw.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimdlpzv9lns9o4zww9vw.png" alt="That's all folks" width="800" height="387"></a></p> <h2> References </h2> <ol> <li id="fn1"> <p><a href="proxy.php?url=https://martenframework.com" rel="noopener noreferrer">Marten Framework</a> ↩</p> </li> <li id="fn2"> <p><a href="proxy.php?url=https://github.com/wyhaines/opentelemetry-sdk.cr" rel="noopener noreferrer">opentelemetry-sdk.cr</a> ↩</p> </li> <li id="fn3"> <p><a href="proxy.php?url=https://www.jaegertracing.io/" rel="noopener noreferrer">Jaeger</a> ↩</p> </li> <li id="fn4"> <p><a href="proxy.php?url=https://martenframework.com/docs/handlers-and-http/middlewares" rel="noopener noreferrer">Marten Middlewares</a> ↩</p> </li> <li id="fn5"> <p><a href="proxy.php?url=https://github.com/wyhaines/opentelemetry-instrumentation.cr/tree/main/src/opentelemetry/instrumentation/frameworks" rel="noopener noreferrer">opentelemetry-instrumentation.cr: frameworks</a> ↩</p> </li> <li id="fn6"> <p><a href="proxy.php?url=https://martenframework.com/docs/handlers-and-http/introduction#the-request-and-response-objects" rel="noopener noreferrer">Marten: The request and response objects</a> ↩</p> </li> <li id="fn7"> <p><a href="proxy.php?url=https://opentelemetry.io/docs/specs/semconv/http/http-spans/" rel="noopener noreferrer">Semantic conventions for HTTP spans</a> ↩</p> </li> <li id="fn8"> <p><a href="proxy.php?url=https://www.w3.org/TR/trace-context/" rel="noopener noreferrer">W3C: Trace Context</a> ↩</p> </li> <li id="fn9"> <p><a href="proxy.php?url=https://medium.com/p/2fd6a0255447" rel="noopener noreferrer">How to begin with Traces in Crystal by Michael</a> ↩</p> </li> </ol> crystal crystallang programming marten Speeding Up Crystal CI/CD: Fast Drafts, Optimized Builds Michael Nikitochkin Fri, 02 May 2025 20:11:37 +0000 https://dev.to/miry/speeding-up-crystal-cicd-fast-drafts-optimized-builds-47c9 https://dev.to/miry/speeding-up-crystal-cicd-fast-drafts-optimized-builds-47c9 <p>I have started working on a production web application built with Crystal and Marten.<br> With every new feature I add to the project, the compilation time keeps growing—almost exponentially.</p> <p>I found that waiting 50 minutes to build an image isn't worth it for quick experiments. I realized I don't need full performance for development builds.</p> <p>Here's my view on how I can address the problem:</p> <p>I'd like to introduce a "draft" image that builds in around 3 minutes and is ready for deployment—it even starts deploying to the production clusters.</p> <p>After that, it would trigger a "pristine" build with all optimizations enabled, which might take 60 minutes.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8wmox25fzvg8qiuvctfr.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8wmox25fzvg8qiuvctfr.png" alt="Man in front of north lights" width="364" height="1012"></a></p> <p>If a new build is triggered in the meantime, the pristine build is cancelled and replaced by the most recent one job.</p> <p>With this approach, I can still build and test quickly, while eventually delivering a highly optimized version for better performance.</p> crystal crystallang marten cicd Why Infrastructure Engineers Should Start with Backend Development Michael Nikitochkin Thu, 10 Apr 2025 06:00:51 +0000 https://dev.to/miry/why-infrastructure-engineers-should-start-with-backend-development-34cf https://dev.to/miry/why-infrastructure-engineers-should-start-with-backend-development-34cf <p>Infrastructure engineering has evolved far beyond managing servers and spinning up cloud resources. Today, it’s about crafting resilient platforms, improving developer experience, and obsessing over user needs. That’s why I believe every infrastructure or production engineer should spend at least five years building backend applications before moving into infra roles.</p> <p>Here’s why.</p> <h2> 1. Code Quality and Developer Empathy </h2> <p>Working on backend products teaches you the fundamentals of writing clean, maintainable code. You develop a natural sensitivity to things like variable naming, code structure, and debugging workflows. More importantly, you learn how to think like the developers who will be your users in an infra role. This empathy helps you build tools that others actually want to use—not just ones that “work.”</p> <h2> 2. UX Isn't Just for Designers </h2> <p>When you’ve been on the receiving end of poorly documented, overly complex internal tooling, you start to appreciate good UX—yes, even in infra. Backend experience wires your brain to care about latency, clarity, and consistency, not just uptime and throughput. It trains you to ask: Will this make someone’s life easier?</p> <h2> 3. Avoiding Infra for Infra’s Sake </h2> <p>Without application experience, it’s easy to fall into the trap of building systems that are technically impressive but practically unusable. You end up with setups only infra teams understand—and nobody else wants to touch. A strong backend foundation keeps you grounded, reminding you that the goal isn’t to build fancy pipelines or run bleeding-edge stacks. The goal is to support real teams solving real problems.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4y6g8cl4abxm5g2bjj5.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4y6g8cl4abxm5g2bjj5.png" alt=" " width="800" height="533"></a></p> <h2> Final Thoughts </h2> <p>Yes, infrastructure is fun. There’s joy in automation, orchestration, and performance tuning. But without first getting your hands dirty with backend development, you risk building solutions in a vacuum. Start with the app layer, feel the pain, and then go fix it with empathy and purpose.</p> <p><em>That’s what makes a great infra engineer.</em></p> programming infrastructure On-Call Requirements Michael Nikitochkin Mon, 31 Mar 2025 22:19:59 +0000 https://dev.to/miry/on-call-requirements-4955 https://dev.to/miry/on-call-requirements-4955 <h2> Summary </h2> <p>This document outlines on-call requirements for global companies. Since employees are spread across various countries, each with its own labor laws, it's essential to align expectations before joining an on-call rotation.</p> <p>Being on-call comes with responsibilities and limitations. It affects your social life, sleep schedule, and availability. You serve as a crucial safety net for the organization.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftu89uly1d6h21qlt5hqj.jpg" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftu89uly1d6h21qlt5hqj.jpg" alt=" " width="800" height="430"></a></p> <h2> Hardware </h2> <h3> Phone </h3> <p>The company should provide a dedicated on-call phone. It doesn’t need to be high-end but must be secure and support the following apps:</p> <ul> <li> <strong>PagerDuty</strong> or <strong>Opsgenie</strong> – for incident notifications</li> <li> <strong>Mail</strong> – for alerts and updates</li> <li> <strong>Slack</strong>, <strong>Discord</strong>, or <strong>Google Meet</strong> – for team communication</li> <li> <strong>Browser</strong> – for troubleshooting</li> <li> <strong>1Password</strong> (or similar) – for credential management</li> <li> <strong>Two-Factor Authentication</strong> (2FA) apps – FreeOTP, Yubico Authenticator, Authy, etc.</li> </ul> <h3> Mobile Contract </h3> <p>The mobile plan should support incoming calls from <strong>PagerDuty</strong> and allow incident acknowledgment via mobile data. A <em>10GB</em> monthly data plan is typically sufficient. In case of an incident, the developer should be able to connect their laptop and triage the issue from wherever they are. Outgoing calls are only needed for escalation when other methods fail.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuz7emxufqn3s8xzjmcu.jpg" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuz7emxufqn3s8xzjmcu.jpg" alt=" " width="800" height="430"></a></p> <h2> Balanced On-Call </h2> <p>To avoid burnout, no more than 25% of an engineer's time should be spent on-call. Following this rule:</p> <ul> <li>A single-site team requires at least eight engineers for a sustainable rotation.</li> <li>A dual-site team should have at least six engineers per site.</li> <li>Each shift should include both a primary and secondary on-call engineer.</li> </ul> <h2> Quality Balance </h2> <p>Engineers need time for incident response and follow-ups, including writing postmortems. An incident is defined as a sequence of events related to the same contribution factor and should be treated as a single issue.</p> <h2> On-Call Policies &amp; Practices </h2> <p>A well-structured on-call system requires clear policies to ensure smooth operations. Engineers should not have to figure things out when an alert goes off. Instead, proactive planning should include:</p> <ul> <li><strong>Incident severity definitions</strong></li> <li><strong>Playbooks for common issues</strong></li> <li><strong>Clear escalation rules</strong></li> </ul> <p>Aligning these elements in advance helps create an effective and manageable on-call process.</p> sre incidentresponse incidentmanagement resiliency Recording My Crystal Snippets from Today’s Learning Michael Nikitochkin Sat, 15 Mar 2025 11:41:04 +0000 https://dev.to/miry/recording-my-crystal-snippets-from-todays-learning-21ej https://dev.to/miry/recording-my-crystal-snippets-from-todays-learning-21ej <p>I want to document some snippets from today’s learning while working on open-source projects. </p> <h2> 0. Printing Available Methods for an Object of a Class </h2> <p>I found it useful to debug an object’s methods in a way similar to Ruby. Here’s a snippet that helped me with this:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># Print the available methods of a Crystal class</span> <span class="c1"># Usage: `puts Spec::CLI.methods.sort`</span> <span class="k">class</span> <span class="nc">Object</span> <span class="k">macro</span> <span class="nf">methods</span> <span class="p">{{</span> <span class="vi">@type</span><span class="p">.</span><span class="nf">methods</span><span class="p">.</span><span class="nf">map</span> <span class="o">&amp;</span><span class="p">.</span><span class="nf">name</span><span class="p">.</span><span class="nf">stringify</span> <span class="p">}}</span> <span class="k">end</span> <span class="k">end</span> <span class="nb">puts</span> <span class="no">Spec</span><span class="o">::</span><span class="no">CLI</span><span class="p">.</span><span class="nf">methods</span><span class="p">.</span><span class="nf">sort</span> <span class="c1"># =&gt; ["abort!", "add_formatter", ...]</span> </code></pre> </div> <p>More macro examples can be found in the <a href="proxy.php?url=https://crystal-lang.org/reference/1.15/syntax_and_semantics/macros/macro_methods.html" rel="noopener noreferrer">Crystal Macro Methods</a> documentation. </p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjgckammc4harxop12z3h.jpeg" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjgckammc4harxop12z3h.jpeg" alt="phantasy crystal spec" width="800" height="457"></a></p> <h2> 1. Filtering Crystal Spec Tests Based on Tags </h2> <p>I worked on executing different types of tests, including unit and integration tests.<br><br> There are multiple approaches to separating them, and many ideas can be found in this <a href="proxy.php?url=https://forum.crystal-lang.org/t/exclude-all-tests-with-tags/6861/1" rel="noopener noreferrer">Crystal Forum discussion</a>. </p> <p>Here’s the approach I took: </p> <h3> Project Folder Structure </h3> <p>My project’s test structure looks like this:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>$ tree spec spec ├── awscr-s3 ├── awscr-s3_spec.cr ├── fixtures.cr ├── integration │   ├── compose.yml │   └── minio_spec.cr └── spec_helper.cr </code></pre> </div> <p>The integration tests are marked with the tag <code>"integration"</code>. </p> <h3> Filtering Tests Based on Tags </h3> <p>One of the things I love about Crystal is that the code is simple and intuitive to read.<br><br> While learning about the <code>Spec</code> module in the <a href="proxy.php?url=https://crystal-lang.org/api/master/Spec.html" rel="noopener noreferrer">Crystal Spec Documentation</a>, I found links to the source code, which helped me understand how filtering works. </p> <p>Here’s how I implemented tag-based filtering:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="c1"># spec/spec_helper.cr</span> <span class="k">class</span> <span class="nc">Spec</span><span class="o">::</span><span class="no">CLI</span> <span class="k">def</span> <span class="nf">tags</span> <span class="vi">@tags</span> <span class="k">end</span> <span class="k">end</span> <span class="no">Spec</span><span class="p">.</span><span class="nf">around_each</span> <span class="k">do</span> <span class="o">|</span><span class="n">example</span><span class="o">|</span> <span class="n">tags</span> <span class="o">=</span> <span class="no">Spec</span><span class="p">.</span><span class="nf">cli</span><span class="p">.</span><span class="nf">tags</span> <span class="c1"># By default, skip tagged tests and run only unit tests</span> <span class="k">next</span> <span class="k">if</span> <span class="p">(</span><span class="n">tags</span><span class="p">.</span><span class="nf">nil?</span> <span class="o">||</span> <span class="n">tags</span><span class="p">.</span><span class="nf">empty?</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">example</span><span class="p">.</span><span class="nf">example</span><span class="p">.</span><span class="nf">all_tags</span><span class="p">.</span><span class="nf">empty?</span> <span class="n">example</span><span class="p">.</span><span class="nf">run</span> <span class="k">end</span> </code></pre> </div> <h4> Explanation </h4> <p><code>Spec.cli</code> is a command-line interface that parses options and stores them internally in the <code>@tags</code> variable. </p> <p>For example, when running:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>$ crystal spec --tag 'integration' </code></pre> </div> <p>The <code>"integration"</code> tag is stored as a <code>Set</code> in <code>@tags</code>. This allows me to check which filters were enabled without manually parsing the command-line arguments. </p> <p>However, there’s a small drawback: <code>@tags</code> is not publicly accessible. To work around this, I extended the <code>Spec::CLI</code> class and exposed it. (There may be a better way to do this.) </p> <p>The second part of the code is a simple filtering mechanism implemented using <code>Spec.around_each</code>: </p> <ul> <li>It checks the provided tags and then validates the test’s tags. </li> <li>If no tags are specified, all tagged tests are skipped by default. </li> </ul> <p>A simple debug statement like <code>pp! example</code> can help explore more filtering options. </p> <h2> 2. Configuring Test Dependencies Based on Tags </h2> <p>Integration tests allow sending real requests.<br><br> Instead of adding tags to every integration test individually, we can leverage the folder structure (e.g., placing them in an <code>integration</code> folder). </p> <p>Here’s one way to configure <code>WebMock</code> dynamically based on test tags or file location:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight crystal"><code><span class="no">Spec</span><span class="p">.</span><span class="nf">around_each</span> <span class="k">do</span> <span class="o">|</span><span class="n">example</span><span class="o">|</span> <span class="n">integration</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="nf">example</span><span class="p">.</span><span class="nf">all_tags</span><span class="p">.</span><span class="nf">includes?</span><span class="p">(</span><span class="s2">"integration"</span><span class="p">)</span> <span class="o">||</span> <span class="n">example</span><span class="p">.</span><span class="nf">example</span><span class="p">.</span><span class="nf">file</span><span class="p">.</span><span class="nf">includes?</span><span class="p">(</span><span class="s2">"spec/integration"</span><span class="p">)</span> <span class="no">WebMock</span><span class="p">.</span><span class="nf">reset</span> <span class="no">WebMock</span><span class="p">.</span><span class="nf">allow_net_connect</span> <span class="o">=</span> <span class="n">integration</span> <span class="n">example</span><span class="p">.</span><span class="nf">run</span> <span class="k">end</span> </code></pre> </div> <p>That’s all for today! 🚀 </p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4iav7tzz2q1zwal1r5i1.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4iav7tzz2q1zwal1r5i1.png" alt="That's all folks" width="800" height="387"></a></p> crystal programming Involving the Right People in an Incident Michael Nikitochkin Sun, 16 Feb 2025 23:27:04 +0000 https://dev.to/miry/involving-the-right-people-in-an-incident-all-vs-correct-1p2a https://dev.to/miry/involving-the-right-people-in-an-incident-all-vs-correct-1p2a <p>It's been a while since I last wrote about incidents. Lately, I’ve been more focused on backend development in a Ruby and Crystal projects, but after handling a few recent incidents, I wanted to jot down my thoughts.</p> <h3> The Problem: Over-Involving Teams During an Incident </h3> <p>It’s common for an Incident Commander to be paged when something isn’t working. As the Incident Commander, you might see reports from customers. Your responsibility is to identify contributing factors and bring the right people together to stop the bleeding.</p> <p>However, the concept of bringing the "correct" people is sometimes misunderstood. Some Incident Commanders assume this means inviting <em>everyone</em> who might be remotely involved. They create massive video or audio calls, hoping someone will figure out the problem. While this might seem like a thorough approach, it often leads to frustration among teams who are pulled into the incident but have nothing to contribute. They end up waiting passively, leading to wasted time and effort.</p> <p>This broad approach may give Incident Commanders a false sense of control—believing that if all teams are present, they’ve done everything possible. But in reality, each team may assume the issue lies elsewhere, leading to passive listening rather than active problem-solving.</p> <h3> The Consequences of Over-Involvement </h3> <p>Bringing too many people into an incident can have several negative effects:</p> <ul> <li> <strong>High-cost meetings with low productivity:</strong> More people in the call means more noise, more conflicting theories, and a harder time reaching a consensus.</li> <li> <strong>Blame-shifting and distraction:</strong> Each team might focus on their own long-standing issues rather than identifying the real root cause.</li> <li> <strong>Loss of the bigger picture:</strong> With too many perspectives, the core problem can become obscured, making it harder to pinpoint the actual failure.</li> </ul> <p>In complex systems, problems can be hidden under layers of dependencies, making a broad approach ineffective. It’s crucial to separate valid long-term concerns from immediate incident causes.</p> <p><a href="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkz41pritx60wtupg14r6.jpeg" class="article-body-image-wrapper"><img src="proxy.php?url=https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkz41pritx60wtupg14r6.jpeg" alt="Situation room with a lot of folks" width="800" height="800"></a></p> <h3> How to Solve an Incident Without Over-Involving Teams </h3> <p>So, how can an Incident Commander solve an unknown issue efficiently without involving too many people, while still resolving the problem as quickly as possible?</p> <ol> <li> <strong>Stop the bleeding</strong> using all available tools. Start by narrowing down the issue to its closest impact point—typically where users are directly affected. Investigate progressively deeper into microservices and vendor solutions.</li> <li> <strong>Analyze patterns</strong> by building a timeline and reproducing the problem as closely as possible to the reported issue. This is often the hardest step, especially if the issue is intermittent or device-specific.</li> <li> <strong>Leverage observability tools</strong> across mobile, backend services, and database profiling. These tools should be a core part of every playbook.</li> <li> <strong>Identify the success lines</strong> in monitoring reports to determine possible mitigation steps.</li> <li> <strong>Engage teams incrementally</strong> — bring in only the necessary teams one at a time, verifying details with each and syncing on next steps before continuing the investigation. Even if details are shared in an incident channel, it's more effective to request targeted help in short bursts.</li> <li> <strong>Consult experts when needed</strong> — if someone has experience with a similar issue, involve them, but avoid defaulting to large group calls.</li> <li> <strong>Track multiple leads separately</strong> in different threads, summarizing findings regularly.</li> <li> <strong>Mitigation over resolution:</strong> Depending on the incident’s criticality, full resolution might not be immediate. Collaborate closely with 1-2 relevant teams to assess mitigation strategies before broadening involvement.</li> <li> <strong>Maintain focused escalation:</strong> Always escalate and page when necessary. Most people are willing to help, but ensure they have a clear role rather than keeping them in a call unnecessarily.</li> </ol> <h3> Conclusion: All vs. Correct </h3> <p>Should you bring <em>everyone</em> into an incident call? Or should you focus on identifying the <em>correct</em> people? While including all teams might seem like a faster way to solve the problem, understanding the issue through observability tools and selectively involving the right teams is a more effective approach. This minimizes stress and improves resolution time.</p> <p>Does this mean you should hesitate to escalate? Absolutely not — always escalate when necessary. People are generally willing to help, but ensure they have a clear role rather than keeping them in a call unnecessarily.</p> <p>By shifting from an <em>“all-in”</em> approach to a <em>targeted</em>, <em>observability-driven</em> strategy, Incident Commanders can handle incidents more efficiently, reduce noise, and ensure faster recovery.</p> <p>Of course, this isn’t something that can be perfected during an active incident. Understanding company structure, service dependencies, mitigation practices, and observability tools requires preparation. One of the best ways to improve is by reviewing past incidents and occasionally practicing simulated ones using exercises like <em>Wheel of Misfortune</em>.</p> <p>And now, I trust you to make the right call!</p> <p>Check out these resources to learn more:</p> <ul> <li><a href="proxy.php?url=https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents" rel="noopener noreferrer">Shrinking the time to mitigate production incidents—CRE life lessons</a></li> </ul> sre incidentresponse incidentmanagement