DEV Community: fock1e The latest articles on DEV Community by fock1e (@fock1e). https://dev.to/fock1e https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864423%2F707aeb64-9273-49d4-a744-60fa7d3c5ad1.jpeg DEV Community: fock1e https://dev.to/fock1e en Why Linux Kills Your App Without Warning (The OOM Killer, Explained) fock1e Sun, 19 Apr 2026 10:38:00 +0000 https://dev.to/fock1e/why-linux-kills-your-app-without-warning-the-oom-killer-explained-53d0 https://dev.to/fock1e/why-linux-kills-your-app-without-warning-the-oom-killer-explained-53d0 <h2> Why Linux Kills Your App Without Warning (The OOM Killer, Explained) </h2> <p>There's a specific flavor of production incident that everyone encounters eventually: your app just... disappears. No stack trace. No crash log. No application errors. The process is running, and then it isn't. If you've ever been on-call at 2am staring at systemd telling you <code>Main process exited, code=killed, signal=KILL</code> with zero explanation, this post is for you.</p> <p><em>I turned this exact scenario into one of 18 debugging exercises at <a href="proxy.php?url=https%3A%2F%2Fscenar.site" rel="noopener noreferrer">scenar.site</a> - practice it interactively with an AI interviewer. Details at the end.</em></p> <h2> The Setup </h2> <p>Java service on an 8GB server. Runs for hours, sometimes days, then dies. Developers swear there's no bug. You check the logs - nothing. Just normal operation right up until the end, then silence.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight systemd"><code><span class="err">$</span> <span class="err">systemctl</span> <span class="err">status</span> <span class="err">myapp</span> <span class="err">●</span> <span class="err">myapp.service</span> <span class="err">-</span> <span class="err">My</span> <span class="err">Java</span> <span class="err">Application</span> <span class="err">Loaded:</span> <span class="err">loaded</span> <span class="err">(/etc/systemd/system/myapp.service</span><span class="c">; enabled)</span> <span class="err">Active:</span> <span class="err">failed</span> <span class="err">(Result:</span> <span class="err">signal)</span> <span class="err">since</span> <span class="err">Tue</span> <span class="err">2026-01-20</span> <span class="err">14:23:15</span> <span class="err">UTC</span> <span class="nt">Jan 20 14:23:15 app-server-01 systemd[1]: myapp.service: Main process exited, code</span><span class="p">=</span>killed, signal=KILL <span class="err">Jan</span> <span class="err">20</span> <span class="err">14:23:15</span> <span class="err">app-server-01</span> <span class="err">systemd[1]:</span> <span class="err">myapp.service:</span> <span class="err">Failed</span> <span class="err">with</span> <span class="err">result</span> <span class="err">'signal'.</span> </code></pre> </div> <p><code>signal=KILL</code>. SIGKILL. The nuclear option. The signal you can't catch, can't ignore, can't clean up from. Something sent SIGKILL to your Java process.</p> <h2> The First Wrong Turn </h2> <p>Most people (past me included) start grepping application logs. You won't find anything. SIGKILL doesn't let the process finish its current syscall, let alone flush a log buffer. The app didn't crash - it was executed.</p> <p>The right first move is to ask: <em>who could have killed this?</em> Your options are:</p> <ol> <li>A human with sudo (ask around)</li> <li>A monitoring tool with an auto-remediation rule (check)</li> <li>The kernel itself</li> </ol> <p>If it's #3, there's exactly one culprit: the OOM killer.</p> <h2> Finding the OOM Killer in the Act </h2> <p>The OOM killer writes to the kernel ring buffer. You read it with <code>dmesg</code>:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span>dmesg | <span class="nb">grep</span> <span class="nt">-i</span> oom <span class="go">[482341.234] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [482341.234] Out of memory: Killed process 8921 (java) total-vm:6291456kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0 [482341.235] oom_reaper: reaped process 8921 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB </span></code></pre> </div> <p>There it is. The kernel killed PID 8921 (your Java process) to free memory. <code>anon-rss:4194304kB</code> = your process was using 4GB of anonymous RSS when it got killed.</p> <h2> Why Does the Kernel Do This </h2> <p>Linux overcommits memory. When you <code>malloc(1GB)</code>, the kernel says "sure" without actually having 1GB free. It's a bet that most processes won't use all the memory they request. Usually that's fine. When it's not fine, the kernel has to pick a process to kill to free memory - because the alternative is the whole system locking up.</p> <p>The selection is based on <code>oom_score</code> (higher = more likely to be killed). Check it live:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span>ps aux <span class="nt">--sort</span><span class="o">=</span>-%mem | <span class="nb">head</span> <span class="go">USER PID %CPU %MEM RSS COMMAND mysql 2341 15.2 45.5 3728000 /usr/sbin/mysqld root 3456 5.1 28.3 2320000 /usr/bin/prometheus elastic 4567 8.4 18.2 1490000 /usr/share/elasticsearch/jdk/bin/java </span></code></pre> </div> <p>And the OOM scores:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span><span class="nb">cat</span> /proc/2341/oom_score <span class="go">456 </span><span class="gp">$</span><span class="w"> </span><span class="nb">cat</span> /proc/8921/oom_score <span class="go">512 </span></code></pre> </div> <p>The Java app had the highest score, so it got picked. Scores weight recent memory usage, process age, and <code>oom_score_adj</code>.</p> <h2> The Short-Term Fix </h2> <p>You have three options for right now:</p> <p><strong>1. Reduce the Java heap.</strong> If your systemd unit has <code>-Xmx4g</code>, the JVM will absolutely use 4GB. Drop it:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">ExecStart</span><span class="o">=</span>/usr/bin/java <span class="nt">-Xmx2g</span> <span class="nt">-jar</span> /opt/myapp/app.jar </code></pre> </div> <p><strong>2. Protect critical processes with <code>oom_score_adj</code>.</strong> Range is -1000 (never kill) to 1000 (kill first):<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight shell"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="nt">-500</span> <span class="o">&gt;</span> /proc/<span class="si">$(</span>pidof java<span class="si">)</span>/oom_score_adj </code></pre> </div> <p>Don't set -1000 unless you're <em>sure</em> you want that process to be the last thing running before the kernel panics. I've seen "protected" apps keep a dying system from recovering.</p> <p><strong>3. Add swap.</strong> Buys you time but swapping kills performance. Emergency only.</p> <h2> The Long-Term Fix </h2> <p>The real problem is usually that the server is overcommitted. MySQL wants 4GB, Prometheus wants 2GB, Elasticsearch wants 1.5GB, Java wants 4GB - on an 8GB box. That's 11.5GB of ask on 8GB of capacity. Something has to give eventually.</p> <p>The right answers:</p> <ul> <li> <strong>Use cgroups / systemd <code>MemoryMax=</code></strong> to enforce limits per service. This is the proper fix. Each service gets a guaranteed ceiling. If it exceeds its own cgroup limit, it gets killed inside its cgroup (via <code>oom_kill_disable=0</code>) without taking the whole box down. </li> </ul> <div class="highlight js-code-highlight"> <pre class="highlight systemd"><code><span class="k">[Service]</span> <span class="nt">ExecStart</span><span class="p">=</span>/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar <span class="nt">MemoryMax</span><span class="p">=</span>2.5G <span class="nt">MemoryHigh</span><span class="p">=</span>2G </code></pre> </div> <ul> <li><p><strong>Move workloads off shared hosts.</strong> Put the DB on its own box, the app servers on theirs. Stop co-locating memory-hungry services.</p></li> <li><p><strong>Monitor memory pressure</strong>, not just memory usage. <code>/proc/pressure/memory</code> (PSI) tells you when processes are stalled waiting for memory, which is a much earlier signal than "out of memory" alerts.</p></li> </ul> <h2> Prevention Checklist </h2> <p>Before the next OOM kill:</p> <ol> <li>Every service has a <code>MemoryMax=</code> in its systemd unit</li> <li>Alert on <code>memory available &lt; 10%</code> for 5 minutes, not just on events after death</li> <li>Alert on memory PSI (<code>avg10 &gt; 10</code>) - catches swapping and thrashing before OOM</li> <li>Java apps have <code>-XX:+HeapDumpOnOutOfMemoryError</code> so you get something when the JVM itself runs out of heap (different from OS OOM)</li> <li>Document which processes are "protected" (<code>oom_score_adj &lt; 0</code>) and why</li> </ol> <h2> What Interviewers Look For </h2> <p>If this comes up in an SRE interview, they're testing for:</p> <ul> <li>Do you know SIGKILL can't be caught, so absence of logs is a clue, not a failure?</li> <li>Do you go to <code>dmesg</code> without being told to?</li> <li>Can you explain <em>why</em> the kernel kills processes (overcommit, not "a bug")?</li> <li>Do you talk about cgroups as the structural fix, not just tuning <code>oom_score_adj</code>?</li> <li>Can you distinguish OS-level OOM from JVM-level OutOfMemoryError?</li> </ul> <p>The last one catches a surprising number of people - they conflate the two and assume bumping <code>-Xmx</code> fixes everything, when actually a bigger heap makes OS-level OOM <em>more</em> likely.</p> <h2> Practice It Interactively </h2> <p>This scenario is one of 18 on <a href="proxy.php?url=https%3A%2F%2Fscenar.site" rel="noopener noreferrer">scenar.site</a>. You describe your debugging approach in plain English, an AI simulates the broken server with realistic command output, and plays interviewer by pushing back on your reasoning. Free tier gets you started, no credit card.</p> linux devops sre sysadmin Disk Has Space But Can't Create Files? (Linux Inode Exhaustion) fock1e Fri, 10 Apr 2026 20:51:57 +0000 https://dev.to/fock1e/disk-has-space-but-cant-create-files-linux-inode-exhaustion-8ll https://dev.to/fock1e/disk-has-space-but-cant-create-files-linux-inode-exhaustion-8ll <p>One of the most confusing Linux errors I've debugged: a production server reporting "No space left on device" while <code>df -h</code> clearly showed 50GB free. I lost an hour to it the first time. Here's what was actually going on.</p> <p><em>I turned scenarios like this into an interactive practice tool at <a href="proxy.php?url=https%3A%2F%2Fscenar.site" rel="noopener noreferrer">scenar.site</a> - you debug simulated servers by talking to an AI interviewer. More at the end.</em></p> <h2> The Setup </h2> <p>I was on-call for a logging pipeline. Rsyslog kept crashing, and the logs were full of this:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>rsyslog[8421]: cannot create '/var/log/syslog.1': No space left on device systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE </code></pre> </div> <p>First instinct: the disk is full. Easy fix, right?</p> <h2> The Investigation </h2> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span><span class="nb">df</span> <span class="nt">-h</span> <span class="go">Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 45G 50G 48% / /dev/sda2 20G 8.0G 11G 42% /var/log </span></code></pre> </div> <p>Wait, what? 50GB free on root, 11GB free on /var/log. The disk isn't full. But the error clearly said "No space left on device". So what's going on?</p> <p>This is the moment where a lot of people (including past me) start doing random things: restarting services, clearing caches, rebooting the machine. None of it works.</p> <h2> The Key Insight </h2> <p>A Linux filesystem tracks two resources, not one:</p> <ol> <li> <strong>Disk space</strong> (what <code>df -h</code> shows) - how many bytes are used</li> <li> <strong>Inodes</strong> - how many files can exist</li> </ol> <p>Every file, directory, and symlink on the filesystem consumes exactly one inode. When you run out of inodes, you can't create new files even if you have terabytes of free space. The kernel returns <code>ENOSPC</code> which the userspace translates to "No space left on device" - the same error as being actually out of space. That's where the confusion comes from.</p> <h2> The Actual Diagnosis </h2> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span><span class="nb">df</span> <span class="nt">-i</span> <span class="go">Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 6553600 6553598 2 100% / /dev/sda2 1310720 1310718 2 100% /var/log </span></code></pre> </div> <p>There it is. 100% inodes used. Zero free. The filesystem literally cannot create another file.</p> <p>Now: where are all these inodes going? Inode exhaustion almost always means "a lot of small files in one place". Time to find them:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span>find /var/log <span class="nt">-type</span> f | <span class="nb">wc</span> <span class="nt">-l</span> <span class="go">1310715 </span></code></pre> </div> <p>Over a million files in /var/log. That's the culprit. Let me see what they look like:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span><span class="nb">ls</span> /var/log/ | <span class="nb">head</span> <span class="go">session_000001.log session_000002.log session_000003.log session_000004.log session_000005.log </span><span class="c">... </span></code></pre> </div> <p>Session logs. Let me check the sizes:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span>find /var/log <span class="nt">-name</span> <span class="s1">'session_*'</span> <span class="nt">-printf</span> <span class="s1">'%s\n'</span> | <span class="nb">sort</span> <span class="nt">-u</span> <span class="go">0 </span></code></pre> </div> <p>Every single one is 0 bytes. Millions of empty files. Someone wrote a debug script, forgot to clean up, and it's been creating empty session logs for months. Each file is 0 bytes of disk space but consumes exactly one inode.</p> <h2> The Fix </h2> <p>Delete the empty files. Don't do it with <code>rm</code> directly - the argument list will be too long. Use <code>find</code>:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span>find /var/log <span class="nt">-type</span> f <span class="nt">-name</span> <span class="s1">'session_*'</span> <span class="nt">-delete</span> </code></pre> </div> <p>This took about 30 seconds on that machine. Then verify:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span><span class="nb">df</span> <span class="nt">-i</span> <span class="go">Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 6553600 2883 6550717 1% / /dev/sda2 1310720 1003 1309717 1% /var/log </span></code></pre> </div> <p>Restart the service:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight console"><code><span class="gp">$</span><span class="w"> </span>systemctl restart rsyslog <span class="gp">$</span><span class="w"> </span>systemctl status rsyslog <span class="go"> Active: active (running) </span></code></pre> </div> <p>Fixed.</p> <h2> Prevention </h2> <p>A few things I put in place after this:</p> <ol> <li> <strong>Monitor inode usage, not just disk space.</strong> Most monitoring setups check <code>df -h</code> but forget <code>df -i</code>. Add an alert at 85% inode usage.</li> <li> <strong>Set up logrotate</strong> for any directory that accumulates log files. The default logrotate config handles most system logs but custom paths need their own config.</li> <li> <strong>Code review any script that creates files in production.</strong> The script that caused this was "just a debug helper" that was never removed.</li> <li> <strong>Use <code>find ... -delete</code> for cleanup</strong>, not <code>rm</code> with glob patterns. Glob expansion will hit the ARG_MAX limit with millions of files.</li> </ol> <h2> What Interviewers Look For </h2> <p>If this comes up in an SRE interview, the interviewer isn't just checking if you know <code>df -i</code>. They want to see:</p> <ul> <li>Do you check the actual error message carefully? ("No space left" has two possible causes)</li> <li>Do you form a hypothesis before running commands? (Running <code>df -h</code>, <code>df -i</code>, <code>find</code>, each answering a specific question)</li> <li>Can you explain the underlying concept? (inodes as a separate resource)</li> <li>Do you think about prevention, not just the immediate fix?</li> </ul> <h2> Practice This Interactively </h2> <p>I built <a href="proxy.php?url=https%3A%2F%2Fscenar.site" rel="noopener noreferrer">scenar.site</a> to practice exactly these kinds of scenarios. You describe your debugging approach in plain English, an AI simulates a broken server and returns realistic command output, and tracks your reasoning. This scenario is one of 18 built-in ones. Free tier gets you started, no credit card.</p> linux devops sre sysadmin Why I Built Scenar.io - An AI-Powered DevOps Interview Practice Tool fock1e Tue, 07 Apr 2026 12:24:32 +0000 https://dev.to/fock1e/why-i-built-scenario-an-ai-powered-devops-interview-practice-tool-2ha4 https://dev.to/fock1e/why-i-built-scenario-an-ai-powered-devops-interview-practice-tool-2ha4 <h1> Why I Built Scenar.io </h1> <h2> How It Started </h2> <p>I was prepping for a Google SRE interview and struggling with the debugging portion. Not the knowledge - I knew the commands, I'd fixed real incidents at work. The problem was practicing under interview conditions: thinking out loud, explaining your reasoning, having someone challenge your approach.</p> <p>I started using Claude in the terminal to simulate it. I'd describe a scenario, ask it to act as a broken server, and practice talking through my debugging process. After a few weeks I realized I was spending more time setting up the prompts than actually practicing. I had this whole system - hidden server states, clue tracking, difficulty levels - and it hit me: this should just be a tool.</p> <p>I looked at what already existed. SadServers makes you type exact commands into a real terminal. LeetCode is for coding, not ops. Flashcards test recall, not problem-solving. Nothing let you practice the way interviews actually work - talking through your approach, getting realistic output, having someone push back on your reasoning.</p> <p>So I turned my Claude practice sessions into a proper product.</p> <h2> What Scenar.io Actually Is </h2> <p>Scenar.io is an interactive debugging practice tool where an AI plays the role of both the server you're investigating and the interviewer watching you work. You describe what you'd do in natural language - "I'd check if nginx is running" - and the AI returns realistic command output based on a hidden server state, then asks follow-up questions like a real interviewer would.</p> <p>The key distinction: this is a conversation, not a terminal. You say "let me check what's listening on port 80" instead of typing <code>ss -tlnp | grep :80</code>. The AI understands your intent and shows you what you'd actually see.</p> <h2> A Quick Example </h2> <p>Here's what a session looks like with the "Nginx Won't Start" scenario:</p> <blockquote> <p><strong>Interviewer:</strong> You're on-call and receive an alert: "Website is down!" The web server (web-server-01) is reporting that nginx won't start. How would you begin investigating?</p> <p><strong>You:</strong> First I'd check the status of the nginx service to see what error it's throwing.</p> <p><strong>Interviewer:</strong></p> <pre class="highlight plaintext"><code>$ systemctl status nginx nginx.service - A high performance web server Loaded: loaded (/lib/systemd/system/nginx.service; enabled) Active: failed (Result: exit-code) since Wed 2024-01-17 14:30:22 UTC Process: 4521 ExitCode=1/FAILURE Jan 17 14:30:22 web-server-01 nginx[4521]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) </code></pre> <p>Interesting - the bind failed. What does that tell you, and what would you check next?</p> </blockquote> <p>From there you work through the problem: find that Apache is squatting on port 80, stop it, start nginx, verify the fix. The AI tracks your progress through stages - root cause identification, short-term fix, long-term prevention - just like a real interview rubric.</p> <h2> Three Ways to Practice </h2> <p>Beyond the 18 built-in debugging scenarios (disk full, OOM killer, DNS failures, container crashloops, compromised binaries, and more), there's:</p> <ul> <li> <strong>Verbal interviews</strong> - Conceptual Q&amp;A on Linux, networking, containers, security, and system design. The AI scores your answers on accuracy, completeness, and communication.</li> <li> <strong>Sandbox mode</strong> - Open-ended exploration of simulated servers with no specific bug to find. Practice poking around a Kubernetes node or auditing a web stack.</li> </ul> <p>You pick your interviewer style too: supportive mentor (easy), neutral professional (medium), or Socratic challenger (hard) who makes you justify every decision.</p> <h2> The Tech Behind It </h2> <p>For the Dev.to crowd - the stack is Svelte 5 on the frontend with a Bun + Hono backend, Turso (libSQL) for the database with Drizzle ORM, and Claude Sonnet 4.5 via OpenRouter for the AI. Deployed on Fly.io with GitHub Actions.</p> <p>The interesting technical bit is how the AI simulation works. Each scenario has a <code>hidden_state</code> - a JSON blob describing the full server state (running processes, disk usage, service statuses, log files, network connections). The AI receives this state along with the user's command and returns realistic output that's consistent with the hidden state. A hallucination detection layer compares the AI's output against the state to catch fabricated data.</p> <p>The AI prompt has a dual-role structure: first act as a "server simulator" that must produce command output, then act as an "interviewer" that asks follow-up questions. This prevents the common failure mode where the AI skips the output and just says "Good thinking, what else would you check?"</p> <h2> What's Free, What's Not </h2> <p>The free tier gives you 5 debugging sessions, 3 verbal interviews, and 2 sandbox sessions per month. Enough to get a feel for it and practice regularly.</p> <p>Pro is $9/month for unlimited everything, custom scenario generation (describe any topic and the AI builds a scenario for you), and all difficulty modes.</p> <p><strong>If you're reading this early: the first 100 subscribers get Pro for $5/month with code <code>M3OTEYOQ</code>.</strong> That price locks in permanently.</p> <h2> Try It </h2> <p>The whole thing is live at <a href="proxy.php?url=https%3A%2F%2Fscenar.site" rel="noopener noreferrer">scenar.site</a>. Sign in with GitHub, pick a scenario, and start debugging. No credit card needed for the free tier.</p> <p>I built this because I needed it. If you're prepping for DevOps or SRE interviews - or you just want to sharpen your debugging instincts - I'd genuinely appreciate you giving it a shot and telling me what you think.</p> <p>What scenarios would you want to see? What would make this more useful for your prep? I'm one engineer building this, and feedback directly shapes what gets built next.</p> devops sre interview ai