🧠 AIOps Bootcamp

Transform from DevOps/SWE to Job-Ready AIOps Engineer in 8 Weeks
The most comprehensive open-source AIOps curriculum on the internet.

Quick Start • Resources • Roadmap • Why This? • Tech Stack • Capstones • Careers

💡 Why AIOps? Why Now?

"By 2026, 70% of enterprises will have adopted AIOps to automate IT operations, up from 10% in 2020." — Gartner

The operations landscape is shifting. Manual monitoring, alert-per-metric runbooks, and 3 AM on-call fire drills are being replaced by intelligent systems that detect anomalies before users notice, correlate 10,000 alerts into 1 actionable insight, and self-heal without human intervention.

This bootcamp is your bridge from traditional DevOps to the AI-powered future of operations.

📖 Read the full background story →

📊 Bootcamp at a Glance

mindmap
  root((AIOps Bootcamp))
    Foundations
      Observability
      Prometheus & Grafana
      OpenTelemetry
    Data Engineering
      Log Pipelines
      Metrics at Scale
      Time Series DBs
    Machine Learning
      Supervised & Unsupervised
      Anomaly Detection
      Deep Learning (LSTM)
    Operations Intelligence
      Alert Correlation
      Root Cause Analysis
      Auto-Remediation
    Generative AI
      LLM Agents
      RAG for Runbooks
      ChatOps Bots
    Cloud & Enterprise
      AWS Deployments
      Terraform IaC
      Glean Knowledge Graph
      Internal Developer Portals

📈 By the Numbers

Metric	Value
📅 Duration	8 weeks (320+ hours of content)
📆 Daily sessions	47+ structured learning days
📁 Total files	6,795 files across the repository
💻 Lines of code	1,980,000+ lines (Python, Terraform, Shell, YAML)
🐍 Python scripts	2,957 files
📝 Documentation	279 Markdown files (32,800+ lines)
📊 Mermaid diagrams	104 architecture & flow diagrams
☁️ Terraform configs	5 production-ready IaC modules
🐳 Docker configs	3 containerized environments
🧪 Hands-on projects	15+ end-to-end projects
🏗️ Capstone projects	6 enterprise-grade capstones

🚀 Quick Start

# 1. Clone the repository
git clone https://github.com/pxkundu/AIOps-Bootcamp.git
cd AIOps-Bootcamp

# 2. Check prerequisites
cat PREREQUISITES.md

# 3. Start your journey
cd week-01-fundamentals
cat README.md

📋 Prerequisites

Skill	Level	What You Need
🐍 Python	Intermediate	Functions, classes, pandas, numpy
🖥️ Linux/CLI	Basic	Terminal navigation, shell scripts
📦 Git/GitHub	Basic	Clone, commit, push, branches
☁️ Cloud	Foundational	VMs, containers, networking concepts
🔧 DevOps	Foundational	CI/CD concepts, basic monitoring

👉 Detailed Prerequisites Guide

📚 Reference & Resources

Curated docs beyond the weekly modules—books, cheatsheets, interview prep, and AI tooling.

Resource	What you'll find
Reading list	Books, papers, blogs, and official tool docs
Model Context Protocol (MCP)	MCP concepts, MCP hosts by LLM ecosystem, server discovery & registry, security
Claude Code CLI + MCP	Best practices for Claude Code, MCP power patterns, use-case architectures (Mermaid)
Cheatsheets	PromQL, Ansible, Docker, Kubernetes, and more
Interview prep	Common AIOps interview questions

🗺️ The 8-Week Roadmap

graph LR
    W1["🏗️ Week 1<br/>Foundations"]
    W2["📊 Week 2<br/>Data Eng"]
    W3["🧠 Week 3<br/>ML Basics"]
    W4["🔍 Week 4<br/>Anomaly Detection"]
    W5["⚡ Week 5<br/>Remediation"]
    W6["🔔 Week 6<br/>Alerting"]
    W7["🤖 Week 7<br/>Gen AI + LLMs"]
    W8["🏆 Week 8<br/>Capstones"]

    W1 --> W2 --> W3 --> W4 --> W5 --> W6 --> W7 --> W8

    style W1 fill:#6c5ce7,color:#fff
    style W2 fill:#0984e3,color:#fff
    style W3 fill:#00b894,color:#fff
    style W4 fill:#fdcb6e,color:#000
    style W5 fill:#e17055,color:#fff
    style W6 fill:#d63031,color:#fff
    style W7 fill:#6c5ce7,color:#fff
    style W8 fill:#2d3436,color:#fff

Week-by-Week Breakdown

🏗️ Week 1: Observability Fundamentals

"You can't fix what you can't see."

5 days covering the AIOps landscape, the 3 pillars of observability (metrics, logs, traces), tool evaluation, and hands-on instrumentation.

Day	Topic
1	Introduction to AIOps & Industry Landscape
2	Three Pillars: Metrics, Logs, Traces
3	The Observability Stack
4	Instrumentation & OpenTelemetry
5	Tool Evaluation & Comparison

🔧 Tools: Prometheus, Grafana, Jaeger, OpenTelemetry 📦 Start Week 1 →

📊 Week 2: Data Engineering for AIOps

"Data is the fuel for intelligent operations."

6 days mastering log pipelines, metrics processing, feature engineering, and time-series database architecture.

Day	Topic
1	Log Collection & Parsing
2	Storage & Analytics
3–4	Metrics Processing & Feature Engineering
5–6	Time Series Databases (Prometheus, InfluxDB)

🔧 Tools: ELK Stack, Fluentd, InfluxDB, Loki 📦 Start Week 2 →

🧠 Week 3: ML Fundamentals for Operations

"Teaching machines to understand your infrastructure."

5 days from statistical foundations to supervised/unsupervised learning, model evaluation, and AutoML — all with real ops data.

Day	Topic
1	Statistics & Probability for Ops
2	Exploratory Data Analysis (EDA)
3	Supervised Learning (Classification, Regression)
4	Unsupervised Learning (Clustering, PCA)
5	Model Evaluation & AutoML

🔧 Tools: scikit-learn, pandas, matplotlib, MLflow 📦 Start Week 3 →

🔍 Week 4: Anomaly Detection & Forecasting

"Detecting the unknown unknowns in your systems."

5 days covering time-series analysis, forecasting with ARIMA/Prophet, detection algorithms (Isolation Forest, DBSCAN), and deep learning with LSTM.

Day	Topic
1	Time Series Fundamentals
2	Forecasting (ARIMA, Prophet)
3	Detection Algorithms
4	Deep Learning (LSTM, Autoencoders)
5	Anomaly Detection Capstone

🔧 Tools: Prophet, PyTorch, LSTM, Isolation Forest 📦 Start Week 4 →

⚡ Week 5: Auto-Remediation & Self-Healing

"From 'knowing' to 'acting' — building the self-healing cloud."

5 days progressing from rule-based to context-aware to event-driven remediation, including reinforcement learning for control systems.

Day	Topic
1	Rule-Based Remediation
2	Context-Aware Decisions
3	Event-Driven Automation
4	RL for Control Systems
5	Self-Healing Capstone

🔧 Tools: Ansible, Kubernetes, Python, RL Agents 📦 Start Week 5 →

🔔 Week 6: Intelligent Alerting & RCA

"From alert fatigue to actionable insights."

5 days + master project covering commercial tools (Datadog, Dynatrace), Grafana/Prometheus alerting, topology-aware RCA, and causal inference.

Day	Topic
1	Datadog Intelligent Monitoring
2	Dynatrace AI-Powered Ops
3	Grafana & Prometheus Alerting
4	Topology-Aware RCA
5	Causal Inference for RCA

🔧 Tools: Datadog, Dynatrace, Grafana, NetworkX 📦 Start Week 6 →

🤖 Week 7: Generative AI for AIOps

"LLMs meet infrastructure — the future is here."

7 days covering RAG for runbooks, LLM agents, ChatOps bots, and a complete game-based learning experience.

Day	Topic
1	AI-Powered Runbooks
2	Feedback Loops & Continuous Learning
3	LLM Integration for Ops
4	Autonomous LLM Agents
5	ChatOps & Slack Bots
6	AIOps Game-Based Learning
7	Gen AI Capstone

🔧 Tools: OpenAI, LangChain, RAG, ChromaDB 📦 Start Week 7 →

🏆 Week 8: Enterprise Capstone Projects

"Prove your skills with real-world enterprise solutions."

6 capstone projects deploying production-ready platforms on AWS using Glean, OpenClaw, Terraform, and Knowledge Graphs.

Day	Capstone
1	Glean Analytics & Pipeline Security
2	AWS Lightsail & OpenClaw AI
3	IDP: OpenWebUI + Bedrock on EC2
4	Enterprise Observability Hub
5	Knowledge Graph IDP on AWS
6	AI Transformation Platform

🔧 Tools: Glean, AWS, Terraform, ECS Fargate, RDS 📦 Start Week 8 →

🌟 What Makes This Bootcamp Different?

🏗️

Production-Ready

Not toy examples. Every project deploys on AWS with Terraform, Docker, and CI/CD — the way real companies do it.

🧠

AI-Native

From scikit-learn to LLM agents. This isn't just monitoring with dashboards — it's intelligence.

🎮

Gamified Learning

Quests, arena challenges, and interactive games keep you engaged. Learning AIOps should be fun.

💰

Enterprise-Grade

Week 8 capstones build real enterprise platforms — IDPs, observability hubs, and knowledge graphs worth $48M+/year.

🛠️ Technology Universe

graph TD
    subgraph "Observability & Monitoring"
        PROM["Prometheus"]
        GRAF["Grafana"]
        JAEG["Jaeger"]
        OTEL["OpenTelemetry"]
        DD["Datadog"]
        DYN["Dynatrace"]
    end

    subgraph "Data & ML"
        SKL["scikit-learn"]
        PT["PyTorch"]
        PROPH["Prophet"]
        MLF["MLflow"]
        PD["pandas"]
    end

    subgraph "AI & LLMs"
        OAI["OpenAI"]
        LC["LangChain"]
        RAG["RAG Pipeline"]
        GLEAN["Glean KG"]
    end

    subgraph "Cloud & Infrastructure"
        AWS["AWS (ECS, RDS, S3)"]
        TF["Terraform"]
        K8S["Kubernetes"]
        DOCK["Docker"]
    end

    subgraph "Operations"
        ANS["Ansible"]
        PD2["PagerDuty"]
        SLACK["Slack"]
        JIRA["Jira"]
    end

📋 Full Technology Stack (Click to expand)

Category	Technologies
Observability	Prometheus, Grafana, Jaeger, OpenTelemetry, Loki
Commercial APM	Datadog, Dynatrace
Logging	Elasticsearch, Fluentd, Kibana, Loki
ML/AI	scikit-learn, PyTorch, Prophet, ARIMA, Isolation Forest, LSTM
LLM/GenAI	OpenAI API, LangChain, RAG, ChromaDB, AWS Bedrock
Enterprise AI	Glean Knowledge Graph, Glean Connectors, MCP Protocol
Automation	Ansible, Python, Kubernetes Operators
IaC	Terraform, CloudFormation
Containers	Docker, Docker Compose, ECS Fargate
Cloud	AWS (EC2, ECS, RDS, S3, ALB, IAM, Bedrock, Lightsail)
Databases	PostgreSQL, InfluxDB, Redis, ChromaDB
CI/CD	GitHub Actions
Incident Mgmt	PagerDuty, Slack, Jira

🏆 Capstone Projects

Week 8 features six enterprise-grade platform builds plus a Day 7 governance capstone on safe AI implementation:

#	Project	Tech Stack	What You Build
1	Glean-SEC Analytics	Glean, Python, Flask	Security-focused data analytics with Glean connectors and MCP tools
2	Cloud AI with OpenClaw	AWS Lightsail, Bedrock	AI-powered operations platform on AWS with 3 AIOps use cases
3	IDP on EC2	OpenWebUI, Terraform, RDS	Internal Developer Platform with dual LLM support (OpenAI + Bedrock)
4	Observability Hub	Glean Connectors, Flask	Multi-source monitoring with alert correlation and MCP action server
5	Knowledge Graph IDP	Glean KG, ECS Fargate, RDS	Enterprise IDP on AWS implementing all 4 Knowledge Graph pillars
6	AI Transformation Platform	Glean WAI, Flask, Python	10-pillar maturity assessment with AI agents for sludge detection and ROI
7	AI Governance & Guardrails	Flask, YAML, policy-as-code	Governance control plane: guardrails, risk tiers, audit trail (replace stub LLM with your provider)

🤯 Things You Probably Didn't Know

Facts about the AIOps industry and this bootcamp that might surprise you.

💡	Fact
🔔	The average enterprise SRE team receives 4,000+ alerts per week — 95% of which are noise. Week 6 teaches you to fix that.
💸	A single hour of downtime costs Fortune 500 companies $300,000 on average (Gartner). AIOps reduces MTTR by 70%.
🔍	Engineers spend 1.7 hours every day searching for information across 80–200 SaaS tools. The Knowledge Graph IDP in Week 8 Day 5 solves this.
🧠	This bootcamp contains 1.98 million lines of code — more than the Linux kernel 1.0 release (176K lines).
📝	The 279 Markdown documentation files in this repo contain 32,800+ lines — equivalent to a 500-page technical book.
📊	There are 104 Mermaid architecture diagrams embedded in the docs — you could print a wall-sized architecture gallery.
🤖	Week 7's LLM agents can autonomously diagnose and remediate infrastructure incidents — no human in the loop.
🏗️	The Week 8 Knowledge Graph IDP demonstrates a platform that can save an enterprise $48.6 million/year in recovered productivity.
🎮	This bootcamp includes game-based learning — arena challenges, quests, and interactive scenarios to make learning AIOps fun.
⏱️	The entire bootcamp represents 320+ hours of structured learning — equivalent to a full semester university course.

📁 Repository Structure

AIOps-Bootcamp/
│
├── 📖 README.md                    ← You are here
├── 📖 BACKGROUND.md                ← Why AIOps matters
├── 📖 PREREQUISITES.md             ← What you need to start
├── 📖 CONTRIBUTING.md              ← How to contribute
│
├── 🏗️ week-01-fundamentals/        ← Observability & AIOps basics
│   ├── day-01-intro/               ← AIOps landscape
│   ├── day-02-pillars/             ← Metrics, Logs, Traces
│   ├── day-03-stack/               ← The observability stack
│   ├── day-04-instrumentation/     ← OpenTelemetry hands-on
│   ├── day-05-tools/               ← Tool evaluation
│   └── final-assessment/           ← Week 1 assessment
│
├── 📊 week-02-data-engineering/     ← Data pipelines for ops
│   ├── day-01-logs/                ← Log collection & parsing
│   ├── day-02-storage-analytics/   ← Storage solutions
│   ├── day-03-metrics/             ← Metrics processing
│   └── day-05-06-tsdb/             ← Time series databases
│
├── 🧠 week-03-ml-fundamentals/     ← ML for operations
│   ├── day-01-stats/               ← Statistics & probability
│   ├── day-02-eda/                 ← Exploratory data analysis
│   ├── day-03-supervised/          ← Classification & regression
│   ├── day-04-unsupervised/        ← Clustering & PCA
│   └── day-05-evaluation-automl/   ← Model eval & AutoML
│
├── 🔍 week-04-anomaly-detection/   ← Detecting the unknown unknowns
│   ├── day-01-time-series/         ← Time series fundamentals
│   ├── day-02-forecasting/         ← ARIMA & Prophet
│   ├── day-03-algorithms/          ← Isolation Forest, DBSCAN
│   ├── day-04-deep-learning/       ← LSTM & Autoencoders
│   └── day-05-capstone/            ← Detection capstone
│
├── ⚡ week-05-auto-healing/          ← Self-healing systems
│   ├── day-01-rule-based/          ← Rule-based remediation
│   ├── day-02-context/             ← Context-aware decisions
│   ├── day-03-event-driven/        ← Event-driven automation
│   ├── day-04-rl-control/          ← Reinforcement learning
│   └── day-05-capstone/            ← Remediation capstone
│
├── 🔔 week-06-alerting/            ← Intelligent alerting & RCA
│   ├── day-01-datadog/             ← Datadog monitoring
│   ├── day-02-dynatrace/           ← Dynatrace AI ops
│   ├── day-03-grafana-prom/        ← Grafana & Prometheus
│   ├── day-04-topology-rca/        ← Topology-aware RCA
│   ├── day-05-causality/           ← Causal inference
│   └── master-project/             ← Week 6 master project
│
├── 🤖 week-07-genai-ops/            ← Generative AI for AIOps
│   ├── day-01-runbooks/            ← AI-powered runbooks
│   ├── day-02-loops/               ← Feedback loops
│   ├── day-03-llm/                 ← LLM integration
│   ├── day-04-llm-agents/          ← Autonomous agents
│   ├── day-05-chatops/             ← ChatOps & Slack bots
│   ├── day-06-aiops-game/          ← Game-based learning
│   └── day-07-capstone/            ← Gen AI capstone
│
├── 🏆 week-08-capstone/            ← Enterprise capstone projects
│   ├── day-01-glean-analytics/     ← Glean security analytics
│   ├── day-02-openclaw-aws/        ← OpenClaw on AWS Lightsail
│   ├── day-03-idp-platform/        ← IDP: OpenWebUI + Bedrock
│   ├── day-04-observability/       ← Observability Hub
│   ├── day-05-knowledge-graph-idp/ ← Knowledge Graph IDP on AWS
│   ├── day-06-ai-transformation/   ← AI Transformation Platform
│   ├── day-07-ai-governance-guardrails/ ← AI Implementation & Guardrails
│   └── day-08-mcp-deep-dive/       ← Model Context Protocol (MCP) Deep Dive
│
├── 📚 resources/                    ← Reading list, MCP & Claude Code guides, cheatsheets, interview prep
├── 🐳 infrastructure/              ← Docker & K8s configs
└── 💬 community/                   ← Discussions & showcase

🎯 Learning Journey

graph TD
    START["🎯 START HERE"]
    
    START --> F["🏗️ FOUNDATIONS<br/>Weeks 1-2"]
    F --> ML["🧠 ML & DETECTION<br/>Weeks 3-4"]
    ML --> OPS["⚡ OPERATIONS<br/>Weeks 5-6"]
    OPS --> AI["🤖 AI & ENTERPRISE<br/>Weeks 7-8"]
    AI --> CERT["🏆 CERTIFICATE<br/>AIOps Engineer"]

    F -.- F1["Prometheus<br/>Grafana<br/>OpenTelemetry"]
    ML -.- ML1["scikit-learn<br/>LSTM<br/>Anomaly Detection"]
    OPS -.- OPS1["Auto-Remediation<br/>Alert Correlation<br/>RCA"]
    AI -.- AI1["LLM Agents<br/>Knowledge Graphs<br/>Enterprise IDPs"]

    style START fill:#6c5ce7,color:#fff
    style CERT fill:#00b894,color:#fff
    style F fill:#0984e3,color:#fff
    style ML fill:#fdcb6e,color:#000
    style OPS fill:#e17055,color:#fff
    style AI fill:#d63031,color:#fff

📊 Assessment & Certification

Component	Weight	Description
🔬 Weekly Labs	40%	Hands-on coding exercises each day
🏗️ Mini-Projects	30%	End-of-week integration projects
❓ Quizzes	15%	Conceptual understanding checks
👥 Peer Reviews	15%	Code review and collaboration

Complete all weeks + capstone with 70%+ → 🎓 AIOps Engineer Certificate & GitHub Badge

🎯 Career Outcomes

🧠

AIOps Engineer
$130K – $200K

🛠️

ML Platform Engineer
$140K – $210K

🔧

SRE (AI/ML Focus)
$135K – $195K

📊

Observability Engineer
$125K – $180K

Market Signal: AIOps job postings grew 42% YoY in 2025 (LinkedIn Economic Graph). The demand far outstrips supply of qualified engineers.

🤝 Contributing

We welcome contributions! Whether it's fixing a typo, adding a new exercise, or contributing a whole week of content.

See CONTRIBUTING.md for guidelines.

📜 License

This project is licensed under the MIT License — see LICENSE for details.

Ready to become an AIOps Engineer?

_{⭐ Star this repo if you find it useful — it helps others discover it!}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
all-about-claude		all-about-claude
community		community
infrastructure/docker-compose		infrastructure/docker-compose
resources		resources
week-01-fundamentals		week-01-fundamentals
week-02-data-engineering		week-02-data-engineering
week-03-ml-fundamentals		week-03-ml-fundamentals
week-04-anomaly-detection		week-04-anomaly-detection
week-05-auto-healing		week-05-auto-healing
week-06-alerting		week-06-alerting
week-07-genai-ops		week-07-genai-ops
week-08-capstone		week-08-capstone
.gitignore		.gitignore
BACKGROUND.md		BACKGROUND.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PREREQUISITES.md		PREREQUISITES.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧠 AIOps Bootcamp

💡 Why AIOps? Why Now?

📊 Bootcamp at a Glance

📈 By the Numbers

🚀 Quick Start

📋 Prerequisites

📚 Reference & Resources

🗺️ The 8-Week Roadmap

Week-by-Week Breakdown

🏗️ Week 1: Observability Fundamentals

📊 Week 2: Data Engineering for AIOps

🧠 Week 3: ML Fundamentals for Operations

🔍 Week 4: Anomaly Detection & Forecasting

⚡ Week 5: Auto-Remediation & Self-Healing

🔔 Week 6: Intelligent Alerting & RCA

🤖 Week 7: Generative AI for AIOps

🏆 Week 8: Enterprise Capstone Projects

🌟 What Makes This Bootcamp Different?

🏗️

Production-Ready

🧠

AI-Native

🎮

Gamified Learning

💰

Enterprise-Grade

🛠️ Technology Universe

🏆 Capstone Projects

🤯 Things You Probably Didn't Know

📁 Repository Structure

🎯 Learning Journey

📊 Assessment & Certification

🎯 Career Outcomes

🧠

🛠️

🔧

📊

🤝 Contributing

📜 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages