Skip to content

thomasasamba-bot/01-self-healing-infrastructure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔧 AIOps Self-Healing Infrastructure

Autonomous cloud infrastructure that detects, diagnoses, and remediates failures without human intervention — reducing incident recovery time by 60%+ and sustaining 99.9%+ uptime.


Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        AWS Cloud (us-east-1)                        │
│                                                                     │
│   ┌──────────────────────────────────────────────────────────┐     │
│   │                    VPC (10.0.0.0/16)                     │     │
│   │                                                          │     │
│   │   ┌─────────────┐          ┌─────────────┐              │     │
│   │   │  Public      │          │  Public      │              │     │
│   │   │  Subnet AZ1  │          │  Subnet AZ2  │              │     │
│   │   │  10.0.1.0/24 │          │  10.0.2.0/24 │              │     │
│   │   └──────┬───────┘          └───────┬──────┘              │     │
│   │          │                          │                     │     │
│   │   ┌──────▼──────────────────────────▼──────┐              │     │
│   │   │       Auto Scaling Group (1–4)          │              │     │
│   │   │   ┌──────────┐    ┌──────────┐         │              │     │
│   │   │   │  EC2 #1  │    │  EC2 #2  │  ...    │              │     │
│   │   │   │ +CW Agent│    │ +CW Agent│         │              │     │
│   │   │   │ +SSM     │    │ +SSM     │         │              │     │
│   │   │   └────┬─────┘    └────┬─────┘         │              │     │
│   │   └────────┼───────────────┼───────────────┘              │     │
│   └────────────┼───────────────┼──────────────────────────────┘     │
│                │               │                                     │
│         Metrics│               │                                     │
│                ▼               ▼                                     │
│   ┌────────────────────────────────────┐                            │
│   │     CloudWatch                      │                            │
│   │  ┌──────────────────────────────┐  │                            │
│   │  │ Anomaly Detection Band (CPU) │  │                            │
│   │  │ Status Check Alarm           │  │                            │
│   │  │ High CPU Threshold Alarm     │  │                            │
│   │  └──────────────┬───────────────┘  │                            │
│   └─────────────────┼──────────────────┘                            │
│                     │ ALARM                                          │
│                     ▼                                                │
│   ┌─────────────────────────────────┐                               │
│   │         SNS Topic               │                               │
│   │   (self-healing-infra-alerts)   │                               │
│   └───────┬─────────────────────────┘                               │
│           │                   │                                      │
│    Lambda │                   │ Email                                │
│           ▼                   ▼                                      │
│   ┌───────────────┐   ┌──────────────┐                              │
│   │    Lambda     │   │  Operator    │                              │
│   │  remediate()  │   │  (optional)  │                              │
│   │               │   └──────────────┘                              │
│   │ 1. Diagnose   │                                                  │
│   │ 2. SSM Cmds   │──────SSM RunCommand──────► EC2 Instances        │
│   │ 3. Remediate  │                                                  │
│   │ 4. Notify     │──────SNS Outcome──────────► Operator            │
│   └───────────────┘                                                  │
└─────────────────────────────────────────────────────────────────────┘

Remediation Flow

CloudWatch Alarm FIRES
        │
        ▼
   SNS publishes
        │
        ▼
Lambda triggered
        │
        ├─► Get InService instances from ASG
        │
        ├─► Run SSM diagnostics (CPU, memory, services, disk)
        │
        ├─► Tier 1: Restart services (httpd, CW agent, clear cache)
        │
        ├─► Tier 2: Mark instance Unhealthy → ASG replaces it
        │
        └─► Publish outcome report to SNS

Features

Feature Implementation
Anomaly-based alerting CloudWatch Anomaly Detection Band on CPU
Status check monitoring CloudWatch StatusCheckFailed alarm
Autonomous remediation Lambda + SSM RunShellScript
Tiered response Restart → Mark Unhealthy → ASG replacement
Diagnostic collection CPU, memory, disk, service health via SSM
Auto-scaling Scale-out on sustained high CPU
Observability CloudWatch Dashboard with live alarm widgets
IaC — Terraform Full Terraform in /terraform
IaC — CloudFormation Full CFN template in /cloudformation
Chaos testing scripts/chaos_test.py simulates failures

Project Structure

01-self-healing-infrastructure/
├── terraform/
│   ├── main.tf          # VPC, EC2, ASG, Lambda, CloudWatch, SNS
│   ├── variables.tf     # All configurable inputs
│   └── outputs.tf       # Key resource references
├── cloudformation/
│   └── template.yaml    # Equivalent CloudFormation stack
├── lambda/
│   └── remediate.py     # Full remediation logic with diagnostics
├── scripts/
│   └── chaos_test.py    # Simulate CPU/memory/service/ASG failures
└── README.md

Quick Start

Prerequisites

  • AWS CLI configured (aws configure)
  • Terraform >= 1.3.0 OR AWS Console access for CloudFormation
  • Python 3.8+ (for chaos testing)

Deploy with Terraform

cd terraform

# Initialise
terraform init

# Preview changes
terraform plan -var="[email protected]"

# Deploy
terraform apply -var="[email protected]"

# Get outputs
terraform output

Deploy with CloudFormation

aws cloudformation create-stack \
  --stack-name self-healing-infra \
  --template-body file://cloudformation/template.yaml \
  --parameters \
      ParameterKey=ProjectName,ParameterValue=self-healing-infra \
      ParameterKey=AlertEmail,[email protected] \
  --capabilities CAPABILITY_NAMED_IAM \
  --region us-east-1

Run Chaos Tests

# Install dependencies
pip install boto3

# Get an instance ID from your ASG
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-instances \
  --query 'AutoScalingInstances[0].InstanceId' --output text)

# Simulate CPU spike
python scripts/chaos_test.py --instance-id $INSTANCE_ID --test cpu

# Kill a service
python scripts/chaos_test.py --instance-id $INSTANCE_ID --test service-kill

# Mark an instance unhealthy (triggers ASG replacement)
python scripts/chaos_test.py \
  --asg-name self-healing-infra-asg --test mark-unhealthy

# Watch alarms in real-time
python scripts/chaos_test.py --project self-healing-infra --test watch

Configuration

Variable Default Description
aws_region us-east-1 Deployment region
instance_type t2.micro Free-tier eligible
asg_desired 2 Starting instance count
asg_min / asg_max 1 / 4 Scaling bounds
cpu_threshold 75 % CPU for scale-out
alert_email "" Email for notifications

Cost Estimate (Free Tier)

Resource Free Tier Cost if exceeded
EC2 t2.micro 750 hrs/month ~$0.0116/hr
Lambda 1M requests/month ~$0.20/1M
CloudWatch Alarms 10 free $0.10/alarm/month
SNS 1M publishes $0.50/1M
Estimated monthly $0 in free tier < $5/month

Key Design Decisions

Why SNS between CloudWatch and Lambda? Decoupling the alarm from the remediation function means multiple consumers can react to the same event — Lambda remediates, email notifies an operator, and future integrations (PagerDuty, Slack) can subscribe without changing the alarm configuration.

Why tiered remediation? Not every failure requires the nuclear option of terminating an instance. Restarting a crashed service is faster and cheaper than ASG replacement. The tier system mirrors how a skilled SRE would triage: least invasive first.

Why CloudWatch Anomaly Detection over static thresholds? Static thresholds (e.g. "alert at 80% CPU") generate false positives during expected load spikes — deployments, batch jobs, morning traffic. Anomaly Detection learns the baseline pattern and alerts only on genuinely unexpected behaviour, reducing alert fatigue.


Teardown

# Terraform
terraform destroy

# CloudFormation
aws cloudformation delete-stack --stack-name self-healing-infra

Related Projects


Built by Thomas Asamba | github.com/thomasasamba-bot

About

AIOps-driven self-healing AWS platform — CloudWatch anomaly detection → SNS → Lambda/SSM auto-remediation. Autonomously detects and recovers degraded EC2 instances without human intervention. 60%+ MTTR reduction. Terraform + CloudFormation + chaos testing scripts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors