DEV Community: fock1e

Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

fock1e — Sun, 19 Apr 2026 10:38:00 +0000

Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

There's a specific flavor of production incident that everyone encounters eventually: your app just... disappears. No stack trace. No crash log. No application errors. The process is running, and then it isn't. If you've ever been on-call at 2am staring at systemd telling you Main process exited, code=killed, signal=KILL with zero explanation, this post is for you.

I turned this exact scenario into one of 18 debugging exercises at scenar.site - practice it interactively with an AI interviewer. Details at the end.

The Setup

Java service on an 8GB server. Runs for hours, sometimes days, then dies. Developers swear there's no bug. You check the logs - nothing. Just normal operation right up until the end, then silence.

$ systemctl status myapp
● myapp.service - My Java Application
   Loaded: loaded (/etc/systemd/system/myapp.service; enabled)
   Active: failed (Result: signal) since Tue 2026-01-20 14:23:15 UTC

Jan 20 14:23:15 app-server-01 systemd[1]: myapp.service: Main process exited, code=killed, signal=KILL
Jan 20 14:23:15 app-server-01 systemd[1]: myapp.service: Failed with result 'signal'.

signal=KILL. SIGKILL. The nuclear option. The signal you can't catch, can't ignore, can't clean up from. Something sent SIGKILL to your Java process.

The First Wrong Turn

Most people (past me included) start grepping application logs. You won't find anything. SIGKILL doesn't let the process finish its current syscall, let alone flush a log buffer. The app didn't crash - it was executed.

The right first move is to ask: who could have killed this? Your options are:

A human with sudo (ask around)
A monitoring tool with an auto-remediation rule (check)
The kernel itself

If it's #3, there's exactly one culprit: the OOM killer.

Finding the OOM Killer in the Act

The OOM killer writes to the kernel ring buffer. You read it with dmesg:

$ dmesg | grep -i oom
[482341.234] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[482341.234] Out of memory: Killed process 8921 (java) total-vm:6291456kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0
[482341.235] oom_reaper: reaped process 8921 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

There it is. The kernel killed PID 8921 (your Java process) to free memory. anon-rss:4194304kB = your process was using 4GB of anonymous RSS when it got killed.

Why Does the Kernel Do This

Linux overcommits memory. When you malloc(1GB), the kernel says "sure" without actually having 1GB free. It's a bet that most processes won't use all the memory they request. Usually that's fine. When it's not fine, the kernel has to pick a process to kill to free memory - because the alternative is the whole system locking up.

The selection is based on oom_score (higher = more likely to be killed). Check it live:

$ ps aux --sort=-%mem | head
USER       PID  %CPU %MEM    RSS COMMAND
mysql     2341  15.2 45.5 3728000 /usr/sbin/mysqld
root      3456   5.1 28.3 2320000 /usr/bin/prometheus
elastic   4567   8.4 18.2 1490000 /usr/share/elasticsearch/jdk/bin/java

And the OOM scores:

$ cat /proc/2341/oom_score
456
$ cat /proc/8921/oom_score
512

The Java app had the highest score, so it got picked. Scores weight recent memory usage, process age, and oom_score_adj.

The Short-Term Fix

You have three options for right now:

1. Reduce the Java heap. If your systemd unit has -Xmx4g, the JVM will absolutely use 4GB. Drop it:

ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar

2. Protect critical processes with oom_score_adj. Range is -1000 (never kill) to 1000 (kill first):

$ echo -500 > /proc/$(pidof java)/oom_score_adj

Don't set -1000 unless you're sure you want that process to be the last thing running before the kernel panics. I've seen "protected" apps keep a dying system from recovering.

3. Add swap. Buys you time but swapping kills performance. Emergency only.

The Long-Term Fix

The real problem is usually that the server is overcommitted. MySQL wants 4GB, Prometheus wants 2GB, Elasticsearch wants 1.5GB, Java wants 4GB - on an 8GB box. That's 11.5GB of ask on 8GB of capacity. Something has to give eventually.

The right answers:

Use cgroups / systemd MemoryMax= to enforce limits per service. This is the proper fix. Each service gets a guaranteed ceiling. If it exceeds its own cgroup limit, it gets killed inside its cgroup (via oom_kill_disable=0) without taking the whole box down.

[Service]
ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar
MemoryMax=2.5G
MemoryHigh=2G

Move workloads off shared hosts. Put the DB on its own box, the app servers on theirs. Stop co-locating memory-hungry services.
Monitor memory pressure, not just memory usage. /proc/pressure/memory (PSI) tells you when processes are stalled waiting for memory, which is a much earlier signal than "out of memory" alerts.

Prevention Checklist

Before the next OOM kill:

Every service has a MemoryMax= in its systemd unit
Alert on memory available < 10% for 5 minutes, not just on events after death
Alert on memory PSI (avg10 > 10) - catches swapping and thrashing before OOM
Java apps have -XX:+HeapDumpOnOutOfMemoryError so you get something when the JVM itself runs out of heap (different from OS OOM)
Document which processes are "protected" (oom_score_adj < 0) and why

What Interviewers Look For

If this comes up in an SRE interview, they're testing for:

Do you know SIGKILL can't be caught, so absence of logs is a clue, not a failure?
Do you go to dmesg without being told to?
Can you explain why the kernel kills processes (overcommit, not "a bug")?
Do you talk about cgroups as the structural fix, not just tuning oom_score_adj?
Can you distinguish OS-level OOM from JVM-level OutOfMemoryError?

The last one catches a surprising number of people - they conflate the two and assume bumping -Xmx fixes everything, when actually a bigger heap makes OS-level OOM more likely.

Practice It Interactively

This scenario is one of 18 on scenar.site. You describe your debugging approach in plain English, an AI simulates the broken server with realistic command output, and plays interviewer by pushing back on your reasoning. Free tier gets you started, no credit card.

Disk Has Space But Can't Create Files? (Linux Inode Exhaustion)

fock1e — Fri, 10 Apr 2026 20:51:57 +0000

One of the most confusing Linux errors I've debugged: a production server reporting "No space left on device" while df -h clearly showed 50GB free. I lost an hour to it the first time. Here's what was actually going on.

I turned scenarios like this into an interactive practice tool at scenar.site - you debug simulated servers by talking to an AI interviewer. More at the end.

The Setup

I was on-call for a logging pipeline. Rsyslog kept crashing, and the logs were full of this:

rsyslog[8421]: cannot create '/var/log/syslog.1': No space left on device
systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE

First instinct: the disk is full. Easy fix, right?

The Investigation

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   45G   50G  48% /
/dev/sda2        20G  8.0G   11G  42% /var/log

Wait, what? 50GB free on root, 11GB free on /var/log. The disk isn't full. But the error clearly said "No space left on device". So what's going on?

This is the moment where a lot of people (including past me) start doing random things: restarting services, clearing caches, rebooting the machine. None of it works.

The Key Insight

A Linux filesystem tracks two resources, not one:

Disk space (what df -h shows) - how many bytes are used
Inodes - how many files can exist

Every file, directory, and symlink on the filesystem consumes exactly one inode. When you run out of inodes, you can't create new files even if you have terabytes of free space. The kernel returns ENOSPC which the userspace translates to "No space left on device" - the same error as being actually out of space. That's where the confusion comes from.

The Actual Diagnosis

$ df -i
Filesystem      Inodes    IUsed IFree IUse% Mounted on
/dev/sda1      6553600  6553598     2  100% /
/dev/sda2      1310720  1310718     2  100% /var/log

There it is. 100% inodes used. Zero free. The filesystem literally cannot create another file.

Now: where are all these inodes going? Inode exhaustion almost always means "a lot of small files in one place". Time to find them:

$ find /var/log -type f | wc -l
1310715

Over a million files in /var/log. That's the culprit. Let me see what they look like:

$ ls /var/log/ | head
session_000001.log
session_000002.log
session_000003.log
session_000004.log
session_000005.log
...

Session logs. Let me check the sizes:

$ find /var/log -name 'session_*' -printf '%s\n' | sort -u
0

Every single one is 0 bytes. Millions of empty files. Someone wrote a debug script, forgot to clean up, and it's been creating empty session logs for months. Each file is 0 bytes of disk space but consumes exactly one inode.

The Fix

Delete the empty files. Don't do it with rm directly - the argument list will be too long. Use find:

$ find /var/log -type f -name 'session_*' -delete

This took about 30 seconds on that machine. Then verify:

$ df -i
Filesystem      Inodes    IUsed  IFree IUse% Mounted on
/dev/sda1      6553600    2883 6550717    1% /
/dev/sda2      1310720    1003 1309717    1% /var/log

Restart the service:

$ systemctl restart rsyslog
$ systemctl status rsyslog
   Active: active (running)

Fixed.

Prevention

A few things I put in place after this:

Monitor inode usage, not just disk space. Most monitoring setups check df -h but forget df -i. Add an alert at 85% inode usage.
Set up logrotate for any directory that accumulates log files. The default logrotate config handles most system logs but custom paths need their own config.
Code review any script that creates files in production. The script that caused this was "just a debug helper" that was never removed.
Use find ... -delete for cleanup, not rm with glob patterns. Glob expansion will hit the ARG_MAX limit with millions of files.

What Interviewers Look For

If this comes up in an SRE interview, the interviewer isn't just checking if you know df -i. They want to see:

Do you check the actual error message carefully? ("No space left" has two possible causes)
Do you form a hypothesis before running commands? (Running df -h, df -i, find, each answering a specific question)
Can you explain the underlying concept? (inodes as a separate resource)
Do you think about prevention, not just the immediate fix?

Practice This Interactively

I built scenar.site to practice exactly these kinds of scenarios. You describe your debugging approach in plain English, an AI simulates a broken server and returns realistic command output, and tracks your reasoning. This scenario is one of 18 built-in ones. Free tier gets you started, no credit card.

Why I Built Scenar.io - An AI-Powered DevOps Interview Practice Tool

fock1e — Tue, 07 Apr 2026 12:24:32 +0000

Why I Built Scenar.io

How It Started

I was prepping for a Google SRE interview and struggling with the debugging portion. Not the knowledge - I knew the commands, I'd fixed real incidents at work. The problem was practicing under interview conditions: thinking out loud, explaining your reasoning, having someone challenge your approach.

I started using Claude in the terminal to simulate it. I'd describe a scenario, ask it to act as a broken server, and practice talking through my debugging process. After a few weeks I realized I was spending more time setting up the prompts than actually practicing. I had this whole system - hidden server states, clue tracking, difficulty levels - and it hit me: this should just be a tool.

I looked at what already existed. SadServers makes you type exact commands into a real terminal. LeetCode is for coding, not ops. Flashcards test recall, not problem-solving. Nothing let you practice the way interviews actually work - talking through your approach, getting realistic output, having someone push back on your reasoning.

So I turned my Claude practice sessions into a proper product.

What Scenar.io Actually Is

Scenar.io is an interactive debugging practice tool where an AI plays the role of both the server you're investigating and the interviewer watching you work. You describe what you'd do in natural language - "I'd check if nginx is running" - and the AI returns realistic command output based on a hidden server state, then asks follow-up questions like a real interviewer would.

The key distinction: this is a conversation, not a terminal. You say "let me check what's listening on port 80" instead of typing ss -tlnp | grep :80. The AI understands your intent and shows you what you'd actually see.

A Quick Example

Here's what a session looks like with the "Nginx Won't Start" scenario:

Interviewer: You're on-call and receive an alert: "Website is down!" The web server (web-server-01) is reporting that nginx won't start. How would you begin investigating?

You: First I'd check the status of the nginx service to see what error it's throwing.

Interviewer:
$ systemctl status nginx
nginx.service - A high performance web server
   Loaded: loaded (/lib/systemd/system/nginx.service; enabled)
   Active: failed (Result: exit-code) since Wed 2024-01-17 14:30:22 UTC
  Process: 4521 ExitCode=1/FAILURE

Jan 17 14:30:22 web-server-01 nginx[4521]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
Interesting - the bind failed. What does that tell you, and what would you check next?

From there you work through the problem: find that Apache is squatting on port 80, stop it, start nginx, verify the fix. The AI tracks your progress through stages - root cause identification, short-term fix, long-term prevention - just like a real interview rubric.

Three Ways to Practice

Beyond the 18 built-in debugging scenarios (disk full, OOM killer, DNS failures, container crashloops, compromised binaries, and more), there's:

Verbal interviews - Conceptual Q&A on Linux, networking, containers, security, and system design. The AI scores your answers on accuracy, completeness, and communication.
Sandbox mode - Open-ended exploration of simulated servers with no specific bug to find. Practice poking around a Kubernetes node or auditing a web stack.

You pick your interviewer style too: supportive mentor (easy), neutral professional (medium), or Socratic challenger (hard) who makes you justify every decision.

The Tech Behind It

For the Dev.to crowd - the stack is Svelte 5 on the frontend with a Bun + Hono backend, Turso (libSQL) for the database with Drizzle ORM, and Claude Sonnet 4.5 via OpenRouter for the AI. Deployed on Fly.io with GitHub Actions.

The interesting technical bit is how the AI simulation works. Each scenario has a hidden_state - a JSON blob describing the full server state (running processes, disk usage, service statuses, log files, network connections). The AI receives this state along with the user's command and returns realistic output that's consistent with the hidden state. A hallucination detection layer compares the AI's output against the state to catch fabricated data.

The AI prompt has a dual-role structure: first act as a "server simulator" that must produce command output, then act as an "interviewer" that asks follow-up questions. This prevents the common failure mode where the AI skips the output and just says "Good thinking, what else would you check?"

What's Free, What's Not

The free tier gives you 5 debugging sessions, 3 verbal interviews, and 2 sandbox sessions per month. Enough to get a feel for it and practice regularly.

Pro is $9/month for unlimited everything, custom scenario generation (describe any topic and the AI builds a scenario for you), and all difficulty modes.

If you're reading this early: the first 100 subscribers get Pro for $5/month with code M3OTEYOQ. That price locks in permanently.

Try It

The whole thing is live at scenar.site. Sign in with GitHub, pick a scenario, and start debugging. No credit card needed for the free tier.

I built this because I needed it. If you're prepping for DevOps or SRE interviews - or you just want to sharpen your debugging instincts - I'd genuinely appreciate you giving it a shot and telling me what you think.

What scenarios would you want to see? What would make this more useful for your prep? I'm one engineer building this, and feedback directly shapes what gets built next.