🔧 AIOps Self-Healing Infrastructure

Autonomous cloud infrastructure that detects, diagnoses, and remediates failures without human intervention — reducing incident recovery time by 60%+ and sustaining 99.9%+ uptime.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        AWS Cloud (us-east-1)                        │
│                                                                     │
│   ┌──────────────────────────────────────────────────────────┐     │
│   │                    VPC (10.0.0.0/16)                     │     │
│   │                                                          │     │
│   │   ┌─────────────┐          ┌─────────────┐              │     │
│   │   │  Public      │          │  Public      │              │     │
│   │   │  Subnet AZ1  │          │  Subnet AZ2  │              │     │
│   │   │  10.0.1.0/24 │          │  10.0.2.0/24 │              │     │
│   │   └──────┬───────┘          └───────┬──────┘              │     │
│   │          │                          │                     │     │
│   │   ┌──────▼──────────────────────────▼──────┐              │     │
│   │   │       Auto Scaling Group (1–4)          │              │     │
│   │   │   ┌──────────┐    ┌──────────┐         │              │     │
│   │   │   │  EC2 #1  │    │  EC2 #2  │  ...    │              │     │
│   │   │   │ +CW Agent│    │ +CW Agent│         │              │     │
│   │   │   │ +SSM     │    │ +SSM     │         │              │     │
│   │   │   └────┬─────┘    └────┬─────┘         │              │     │
│   │   └────────┼───────────────┼───────────────┘              │     │
│   └────────────┼───────────────┼──────────────────────────────┘     │
│                │               │                                     │
│         Metrics│               │                                     │
│                ▼               ▼                                     │
│   ┌────────────────────────────────────┐                            │
│   │     CloudWatch                      │                            │
│   │  ┌──────────────────────────────┐  │                            │
│   │  │ Anomaly Detection Band (CPU) │  │                            │
│   │  │ Status Check Alarm           │  │                            │
│   │  │ High CPU Threshold Alarm     │  │                            │
│   │  └──────────────┬───────────────┘  │                            │
│   └─────────────────┼──────────────────┘                            │
│                     │ ALARM                                          │
│                     ▼                                                │
│   ┌─────────────────────────────────┐                               │
│   │         SNS Topic               │                               │
│   │   (self-healing-infra-alerts)   │                               │
│   └───────┬─────────────────────────┘                               │
│           │                   │                                      │
│    Lambda │                   │ Email                                │
│           ▼                   ▼                                      │
│   ┌───────────────┐   ┌──────────────┐                              │
│   │    Lambda     │   │  Operator    │                              │
│   │  remediate()  │   │  (optional)  │                              │
│   │               │   └──────────────┘                              │
│   │ 1. Diagnose   │                                                  │
│   │ 2. SSM Cmds   │──────SSM RunCommand──────► EC2 Instances        │
│   │ 3. Remediate  │                                                  │
│   │ 4. Notify     │──────SNS Outcome──────────► Operator            │
│   └───────────────┘                                                  │
└─────────────────────────────────────────────────────────────────────┘

Remediation Flow

CloudWatch Alarm FIRES
        │
        ▼
   SNS publishes
        │
        ▼
Lambda triggered
        │
        ├─► Get InService instances from ASG
        │
        ├─► Run SSM diagnostics (CPU, memory, services, disk)
        │
        ├─► Tier 1: Restart services (httpd, CW agent, clear cache)
        │
        ├─► Tier 2: Mark instance Unhealthy → ASG replaces it
        │
        └─► Publish outcome report to SNS

Features

Feature	Implementation
Anomaly-based alerting	CloudWatch Anomaly Detection Band on CPU
Status check monitoring	CloudWatch StatusCheckFailed alarm
Autonomous remediation	Lambda + SSM RunShellScript
Tiered response	Restart → Mark Unhealthy → ASG replacement
Diagnostic collection	CPU, memory, disk, service health via SSM
Auto-scaling	Scale-out on sustained high CPU
Observability	CloudWatch Dashboard with live alarm widgets
IaC — Terraform	Full Terraform in `/terraform`
IaC — CloudFormation	Full CFN template in `/cloudformation`
Chaos testing	`scripts/chaos_test.py` simulates failures

Project Structure

01-self-healing-infrastructure/
├── terraform/
│   ├── main.tf          # VPC, EC2, ASG, Lambda, CloudWatch, SNS
│   ├── variables.tf     # All configurable inputs
│   └── outputs.tf       # Key resource references
├── cloudformation/
│   └── template.yaml    # Equivalent CloudFormation stack
├── lambda/
│   └── remediate.py     # Full remediation logic with diagnostics
├── scripts/
│   └── chaos_test.py    # Simulate CPU/memory/service/ASG failures
└── README.md

Quick Start

Prerequisites

AWS CLI configured (aws configure)
Terraform >= 1.3.0 OR AWS Console access for CloudFormation
Python 3.8+ (for chaos testing)

Deploy with Terraform

cd terraform

# Initialise
terraform init

# Preview changes
terraform plan -var="[email protected]"

# Deploy
terraform apply -var="[email protected]"

# Get outputs
terraform output

Deploy with CloudFormation

aws cloudformation create-stack \
  --stack-name self-healing-infra \
  --template-body file://cloudformation/template.yaml \
  --parameters \
      ParameterKey=ProjectName,ParameterValue=self-healing-infra \
      ParameterKey=AlertEmail,[email protected] \
  --capabilities CAPABILITY_NAMED_IAM \
  --region us-east-1

Run Chaos Tests

# Install dependencies
pip install boto3

# Get an instance ID from your ASG
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-instances \
  --query 'AutoScalingInstances[0].InstanceId' --output text)

# Simulate CPU spike
python scripts/chaos_test.py --instance-id $INSTANCE_ID --test cpu

# Kill a service
python scripts/chaos_test.py --instance-id $INSTANCE_ID --test service-kill

# Mark an instance unhealthy (triggers ASG replacement)
python scripts/chaos_test.py \
  --asg-name self-healing-infra-asg --test mark-unhealthy

# Watch alarms in real-time
python scripts/chaos_test.py --project self-healing-infra --test watch

Configuration

Variable	Default	Description
`aws_region`	`us-east-1`	Deployment region
`instance_type`	`t2.micro`	Free-tier eligible
`asg_desired`	`2`	Starting instance count
`asg_min` / `asg_max`	`1` / `4`	Scaling bounds
`cpu_threshold`	`75`	% CPU for scale-out
`alert_email`	`""`	Email for notifications

Cost Estimate (Free Tier)

Resource	Free Tier	Cost if exceeded
EC2 t2.micro	750 hrs/month	~$0.0116/hr
Lambda	1M requests/month	~$0.20/1M
CloudWatch Alarms	10 free	$0.10/alarm/month
SNS	1M publishes	$0.50/1M
Estimated monthly	$0 in free tier	< $5/month

Key Design Decisions

Why SNS between CloudWatch and Lambda? Decoupling the alarm from the remediation function means multiple consumers can react to the same event — Lambda remediates, email notifies an operator, and future integrations (PagerDuty, Slack) can subscribe without changing the alarm configuration.

Why tiered remediation? Not every failure requires the nuclear option of terminating an instance. Restarting a crashed service is faster and cheaper than ASG replacement. The tier system mirrors how a skilled SRE would triage: least invasive first.

Why CloudWatch Anomaly Detection over static thresholds? Static thresholds (e.g. "alert at 80% CPU") generate false positives during expected load spikes — deployments, batch jobs, morning traffic. Anomaly Detection learns the baseline pattern and alerts only on genuinely unexpected behaviour, reducing alert fatigue.

Teardown

# Terraform
terraform destroy

# CloudFormation
aws cloudformation delete-stack --stack-name self-healing-infra

Related Projects

02-observability-ml-alerting — Full-stack Prometheus/Grafana/ELK observability platform
03-secure-aws-infrastructure — IaC with KMS, IAM hardening, VPC security
04-kubernetes-orchestration — EKS with zero-downtime deployments
05-devsecops-pipeline — CI/CD with SonarQube and Trivy

Built by Thomas Asamba | github.com/thomasasamba-bot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔧 AIOps Self-Healing Infrastructure

Architecture

Remediation Flow

Features

Project Structure

Quick Start

Prerequisites

Deploy with Terraform

Deploy with CloudFormation

Run Chaos Tests

Configuration

Cost Estimate (Free Tier)

Key Design Decisions

Teardown

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cloudformation		cloudformation
lambda		lambda
scripts		scripts
terraform		terraform
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🔧 AIOps Self-Healing Infrastructure

Architecture

Remediation Flow

Features

Project Structure

Quick Start

Prerequisites

Deploy with Terraform

Deploy with CloudFormation

Run Chaos Tests

Configuration

Cost Estimate (Free Tier)

Key Design Decisions

Teardown

Related Projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages