Debug Production Outages in Your Browser
Errloom is an interactive learning platform that teaches developers how to debug real-world production outages. Practice troubleshooting common issues like 404 errors, N+1 queries, memory leaks, CORS errors, and environment variable misconfigurations—all in your browser with no setup required.
Interactive debugging scenarios with real production outage examples
Step-by-step guided debugging with code editor, logs, and interactive tasks
- First Contributions Welcome!
- Features
- Available Scenarios
- Quick Start
- Project Structure
- Contributing New Scenarios
- Tech Stack
- Contributing
- Deployment
- License
New to open source? Start here! We've made it super easy to make your first contribution:
-
🎓 Create a New Beginner Scenario (Estimated: 30-45 minutes)
- Use our scenario template to create a debugging scenario
- Examples: Race condition bug, API timeout, SQL injection fix, Docker networking issue
- Follow the step-by-step guide
- Just add one YAML file—no code changes needed!
- Why it's easy: Template provided, auto-validation via CI, clear examples to follow
-
📝 Improve Scenario Documentation (Estimated: 15-20 minutes)
- Pick any scenario in
public/scenarios/and enhance the explanations - Add real-world context, better hints, or clearer task instructions
- Improve completion summaries or add helpful resource links
- Why it's easy: Small, focused changes with immediate impact
- Pick any scenario in
-
💡 Add Code Comments to Core Components (Estimated: 20-30 minutes)
- Document functions in ScenarioPlayer.tsx or TaskPanel.tsx
- Add JSDoc comments explaining complex logic
- Help future contributors understand the codebase
- Why it's easy: No functional changes, just explanatory comments
# 1. Fork and clone
git clone https://github.com/YOUR_USERNAME/errloom.git
cd errloom
# 2. Create a branch
git checkout -b my-first-contribution
# 3. Make your changes and test
npm install
npm run dev
# 4. Commit and push
git add .
git commit -m "Add: brief description of your change"
git push origin my-first-contribution
# 5. Open a Pull Request on GitHubNeed help? Open an issue with the good first issue label or check our Code of Conduct and Issue Templates.
- 🎯 Interactive Scenarios - Real-world production debugging exercises from beginner to advanced
- 📊 Three Difficulty Levels - Beginner (5 scenarios), Intermediate (5 scenarios), Advanced (5 real incidents)
- 💻 Multiple Task Types - Multiple choice questions, code fixes, and interactive log analysis
- 🎨 Modern Dark Theme UI - Beautiful dark gradient interface with orange/red accents
- 🔧 Monaco Editor - Full-featured code editor with syntax highlighting (VS Code's editor)
- 📈 Progress Tracking - Automatic progress tracking with localStorage persistence
- ⏱️ Scenario Timer - Track your completion time and beat target times
- 🔄 Smart Navigation - Auto-navigate to next scenario after completion
- 📊 Analytics Integration - Built-in Vercel Analytics for usage tracking
- 🎓 Educational Resources - Curated links to additional learning materials
- 🏆 Real Incident Context - Learn from actual outages at Reddit, GitLab, Discord, AWS, and Cloudflare
- 🚀 No Setup Required - 100% browser-based, no cloud or backend needed
- 💾 Persistent Progress - Your progress is saved locally and persists across sessions
-
The 404 Error (15 min)
- Learn: HTTP, Logs, Routing
- Debug a missing route causing 404 errors
-
The Slow API (20 min)
- Learn: Performance, Database Queries, N+1 Problem
- Fix an N+1 query problem slowing down your homepage
-
The Memory Leak (18 min)
- Learn: Memory Management, Event Listeners, Cleanup
- Track down event listeners causing memory leaks
-
The CORS Error (17 min)
- Learn: CORS, HTTP Headers, Browser Security
- Understand why your API works in Postman but not in browsers
-
The Missing Environment Variable (16 min)
- Learn: Environment Variables, Configuration, Deployment
- Fix production crashes caused by missing configuration
-
Database Connection Pool Exhausted (25 min)
- Learn: Connection Pooling, Resource Management, Database Scaling
- Black Friday traffic spike! Your database is idle but all requests timeout
-
The Cache Stampede (28 min)
- Learn: Caching Strategies, Cache Invalidation, Race Conditions
- You cleared the cache. Now 10,000 users are hammering your database
-
The Failed Deployment (22 min)
- Learn: Deployment Strategies, Rollback Procedures, Database Migrations
- Rolling back the code made things WORSE. The database still has the new schema
-
The Message Queue Backup (24 min)
- Learn: Message Queues, Async Processing, Backpressure
- 8,000 welcome emails stuck in queue. Users are waiting hours
-
The Rate Limit Cascade (23 min)
- Learn: Rate Limiting, API Dependencies, Circuit Breakers
- Stripe is rate limiting you. User retries make it 8x worse
Real production disasters from major tech companies!
-
Reddit's Kubernetes Meltdown (35 min)
- Learn: Kubernetes, Auto-scaling, Resource Limits, Cascading Failures
- Death spiral: 50 healthy pods → 500 crashing pods in 60 seconds
-
GitLab's Accidental Database Deletion (30 min)
- Learn: Database Replication, Backup & Recovery, Human Error Prevention
- You ran rm -rf on production. 276GB deleted. Backups are broken. Now what?
-
Discord's Redis Cascade Failure (32 min)
- Learn: Redis Clustering, Cache Failures, Circuit Breakers, Graceful Degradation
- @everyone to 5M users. 50K req/sec to one key. Redis melting down
-
AWS S3 Cascade Failure (33 min)
- Learn: Service Dependencies, Blast Radius, Circuit Breakers, Incident Response
- Typo removed 500 servers. S3 down. Half the internet down. Can't report outage!
-
Cloudflare's BGP Route Leak (30 min)
- Learn: BGP Routing, Network Infrastructure, Global Outages, Internet Architecture
- Small ISP misconfigured BGP. Global traffic routing through Pennsylvania
- Node.js 18+
- npm or yarn
# Clone the repository
git clone https://github.com/OSP06/errloom.git
cd errloom
# Install dependencies
npm install
# Start the development server
npm run devThe app will be available at https://errloom.dev
# Build the application
npm run build
# Preview the production build
npm run previewerrloom/
├── public/
│ └── scenarios/ # YAML scenario definitions
│ ├── beginner/
│ │ ├── 404-error.yaml
│ │ ├── slow-api.yaml
│ │ ├── memory-leak.yaml
│ │ ├── cors-error.yaml
│ │ └── env-variable.yaml
│ ├── intermediate/
│ │ ├── db-pool-exhausted.yaml
│ │ ├── cache-stampede.yaml
│ │ ├── failed-deployment.yaml
│ │ ├── queue-backup.yaml
│ │ └── rate-limit-cascade.yaml
│ └── advanced/
│ ├── reddit-k8s-outage.yaml
│ ├── gitlab-data-loss.yaml
│ ├── discord-redis-failure.yaml
│ ├── aws-cascade-failure.yaml
│ └── cloudflare-bgp-hijack.yaml
├── src/
│ ├── components/ # React components
│ │ ├── ScenarioPlayer.tsx # Main scenario player with timer
│ │ ├── TaskPanel.tsx # Task handling (MCQ, code, logs)
│ │ ├── InteractiveLogViewer.tsx # Clickable log entries
│ │ ├── RealIncidentContext.tsx # Real incident information
│ │ ├── ScenarioTimer.tsx # Timer and performance tracking
│ │ ├── CodeEditor.tsx # Monaco editor wrapper
│ │ └── TabNavigation.tsx # Tab navigation component
│ ├── lib/
│ │ ├── types.ts # TypeScript type definitions
│ │ ├── scenarioLoader.ts # YAML scenario loader
│ │ └── progressStore.ts # Zustand store for progress tracking
│ ├── pages/
│ │ ├── Landing.tsx # Landing page with dark theme
│ │ └── ScenarioList.tsx # Scenario list with progress bar
│ ├── App.tsx
│ └── main.tsx # Entry point with Analytics
├── vercel.json # Vercel deployment configuration
├── package.json
├── vite.config.ts
├── tailwind.config.js
└── tsconfig.json
Easy Contribution Process: Just add a YAML file! The scenario manifest auto-generates during build, and CI automatically validates your submission.
We provide a comprehensive scenario template to get you started:
# Copy the template
cp scenarios/templates/scenario-template.yaml public/scenarios/beginner/my-scenario.yaml
# Edit your scenario
# See scenarios/templates/README.md for detailed guidanceHere's what a complete scenario looks like (e.g., public/scenarios/beginner/404-error.yaml):
public/scenarios/beginner/
└── 404-error.yaml # Single YAML file defines everything!
What gets tested when you submit:
- ✅ YAML syntax validation
- ✅ Required fields check (id, level, title, duration, description, teaches, context, tabs, tasks, completion)
- ✅ ID matches filename
- ✅ Level matches directory (beginner/intermediate/advanced)
- ✅ Task validation strings (
must_contain) actually appear in solutions - ✅ YAML linting for consistent formatting
All validation runs automatically via GitHub Actions when you open a PR!
Create a new file in public/scenarios/{level}/{your-scenario-id}.yaml:
id: "your-scenario-id" # Must match filename (without .yaml)
level: "beginner" # beginner | intermediate | advanced
title: "Your Scenario Title"
duration: "20 minutes" # Estimated completion time
description: "Brief one-line description for the scenario list" # REQUIRED
teaches:
- "Concept 1"
- "Concept 2"
- "Concept 3"
context: |
Background story and setup for the scenario...
Explain the situation the user is facing.
tabs:
- name: "Error"
type: "text"
content: |
Error message or information...
- name: "Logs"
type: "logs"
content:
- time: "10:23:45"
level: "ERROR"
message: "Error description"
is_answer: true # Marks critical log entries
- name: "Code"
type: "code"
content:
language: "javascript" # javascript | python | yaml | etc.
content: |
// Code snippet
tasks:
- type: "multiple-choice"
question: "What's causing the issue?"
options:
- "Option 1"
- "Option 2"
- "Correct answer"
correct: 2 # 0-based index
explanation: |
Explanation of the correct answer...
- type: "code-fix"
instructions: "Fix the code below"
starting_code: |
// Broken code here
validation:
must_contain:
- "required string 1"
- "required string 2"
solution: |
// Fixed code here
explanation: |
Explanation of the fix...
completion:
summary: |
# What You Learned
Summary of key concepts...
resources:
- title: "Resource Title"
url: "https://example.com"
# Optional: For real incident scenarios
real_incident:
company: "Company Name"
date: "YYYY-MM-DD"
duration: "X hours"
impact: "Impact description"
cause: "Root cause"
story: "Detailed story of the incident"No code changes needed! When you run npm run build, the manifest automatically:
- Discovers your new scenario
- Extracts metadata (id, title, description, duration, teaches)
- Generates
public/scenarios/index.json - Updates scenario counts and durations on the landing page
# Generate manifest manually (optional - build does this automatically)
npm run generate-manifest
# Start dev server
npm run dev
# Your scenario will appear in the appropriate level!Before submitting your scenario:
-
idmatches filename (without .yaml extension) -
descriptionfield is present (required for scenario list) -
durationis realistic (test your scenario) -
teachesarray has 2-4 clear concepts - All
must_containstrings appear in yoursolution -
explanationfields are helpful and educational - Test the scenario end-to-end
Here's a real example from 404-error.yaml:
File: public/scenarios/beginner/404-error.yaml
id: "404-error"
level: "beginner"
title: "The 404 Error"
duration: "15 minutes"
description: "Your first production bug. Learn to read logs and understand HTTP status codes."
teaches:
- "HTTP Status Codes"
- "Log Analysis"
- "Routing Debugging"
context: |
You just launched your first website! A user reports: "The About link is broken."
Your job: Figure out what's wrong and fix it.
tabs:
- name: "Error"
type: "text"
content: |
404 Not Found
The page you're looking for doesn't exist.
- name: "Logs"
type: "logs"
content:
- time: "10:24:12"
level: "INFO"
message: "GET /home → 200 OK"
- time: "10:24:15"
level: "ERROR"
message: "GET /about → 404 Not Found"
is_answer: true # This log entry is highlighted for learning
tasks:
- type: "multiple-choice"
question: "What does HTTP 404 mean?"
options:
- "Server error"
- "Resource not found"
- "Forbidden access"
correct: 1 # 0-indexed: "Resource not found"
explanation: |
Correct! HTTP 404 means the requested resource was not found on the server.
- type: "code-fix"
instructions: "Fix the routing code to handle /about"
starting_code: |
app.get('/home', (req, res) => res.send('Home'));
// Missing /about route!
validation:
must_contain: ["'/about'", "res.send"]
solution: |
app.get('/home', (req, res) => res.send('Home'));
app.get('/about', (req, res) => res.send('About'));
explanation: |
You added the missing route handler for /about. Great job!
completion:
summary: |
# 🎉 Congratulations!
You've debugged your first 404 error! Key lessons:
1. HTTP 404 = Resource Not Found
2. Always check server logs for error details
3. Routing configuration must match expected URLsTesting Your Scenario:
# 1. Create your scenario YAML file
vim public/scenarios/beginner/my-scenario.yaml
# 2. Generate manifest (or just build)
npm run generate-manifest
# 3. Start dev server
npm run dev
# 4. Navigate to http://localhost:5173/scenario/beginner/my-scenario
# 5. Complete all tasks to verify they work correctly
# 6. Check the completion time matches your estimateWhat you'll see in the UI:
- Clean terminal-style interface with dark theme
- Tabs for Error, Logs, and Code
- Interactive log viewer with clickable entries
- Monaco code editor for code-fix tasks
- Timer tracking your completion time
- Progress saving automatically
Each scenario includes metadata for better organization:
- Duration: Realistic time estimate (test it yourself!)
- Teaches: 2-4 key concepts (e.g., "Caching", "Database", "Networking")
- Level: beginner | intermediate | advanced
- Tags (coming soon): Filter scenarios by topic (Redis, Kubernetes, API, etc.)
See existing scenarios for reference:
- Beginner: 404-error.yaml - HTTP routing and logs
- Intermediate: cache-stampede.yaml - Cache invalidation and race conditions
- Advanced: reddit-k8s-outage.yaml - Real Kubernetes incident from Reddit 2018
- Frontend Framework: React 19.2 with TypeScript 5.9
- Build Tool: Vite 7.2 - Lightning fast HMR and builds
- Routing: React Router DOM 7.9 - Client-side routing
- State Management: Zustand 5.0 - Lightweight state management with persistence
- Styling: Tailwind CSS 4.1 - Utility-first CSS with dark theme
- Code Editor: Monaco Editor 4.7 - VS Code's powerful editor
- Icons: Lucide React 0.554 - Beautiful icon library
- YAML Parsing: js-yaml 4.1 - Scenario configuration parsing
- Analytics: Vercel Analytics 1.5 - Privacy-friendly analytics
- Deployment: Vercel - Serverless deployment platform
Contributions are welcome! We've built a contributor-friendly process with templates, automated validation, and clear guidelines.
- Add New Scenarios - Use our scenario template for beginner/intermediate/advanced scenarios
- Improve Existing Scenarios - Enhance explanations, add hints, or improve task descriptions
- Bug Fixes - Report bugs using our bug report template
- Feature Requests - Suggest improvements via feature request template
- Documentation - Improve documentation and examples
- Code Reviews - Help review scenario submissions from other contributors
# Fork and clone the repository by using your username.
git clone https://github.com/{your_username}/errloom.git
# Create a feature branch
git checkout -b feature/your-feature-name
# Make your changes and test
npm run dev
# Run linting
npm run lint
# Build to verify
npm run build
# Commit and push
git add .
git commit -m "Add your feature description"
git push origin feature/your-feature-name
# Open a Pull Request using our PR templateWhen you open a PR:
-
Automated CI Checks run automatically:
- TypeScript type checking
- ESLint code quality checks
- Build verification
- Scenario YAML validation (for scenario PRs)
-
Peer Review: PRs are labeled for triage:
under review- Awaiting maintainer reviewneeds changes- Feedback provided, changes requestedaccepted- Approved and ready to merge
-
Scenario PRs: For new scenarios, include a brief test report in your PR:
- Did you complete it end-to-end?
- Does the completion time match the estimate?
- Are validation strings working correctly?
Please read our Code of Conduct before contributing. We're committed to providing a welcoming and inclusive environment for all contributors.
Questions? Open an issue with the question label or check our Issue Templates for guidance.
Live site: The main Errloom site at errloom.dev is hosted on Vercel and automatically deploys when changes are pushed to main.
Errloom is a 100% static site with no backend required.
Quick Deploy to Vercel:
- Click the button above
- Connect your GitHub account
- Vercel auto-configures and deploys
- Future pushes to main auto-deploy (no manual steps needed!)
Build the static files and deploy to any host:
npm run build
# Output: dist/ folder with static HTML, CSS, JSCompatible with: Netlify, GitHub Pages, AWS S3, Cloudflare Pages, Docker (nginx), or any static host.
Self-hosting benefits:
- Full control and customization
- No vendor lock-in (standard static files)
- Free on most platforms
- Optional: Remove analytics for complete privacy
Errloom automatically tracks your progress using localStorage:
- Persistent Progress: Your completion status is saved locally
- Cross-Session: Progress persists even after closing the browser
- Level-Based Tracking: Each difficulty level tracks separately
- Visual Indicators: Progress bars and completion badges
- Color Palette: Dark theme with orange (#f97316) and red (#dc2626) accents
- Typography: System fonts with clear hierarchy
- Components: Consistent rounded corners, shadows, and hover states
- Responsive: Mobile-first design that scales to desktop
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by real-world production incidents and outage postmortems from Reddit, GitLab, Discord, AWS, and Cloudflare
- Built with modern web technologies and best practices
- Community-driven scenario contributions
- Powered by Vercel for deployment and analytics
If you find this project helpful, please consider:
- ⭐ Starring the repository
- 🐛 Reporting bugs or issues
- 💡 Suggesting new scenarios
- 🔗 Sharing with others
For questions, suggestions, or feedback:
- Open an issue on GitHub
- Email:[email protected]
Made with ❤️ for developers learning production debugging
🔥 Start debugging now at errloom.dev