Autonomous cloud infrastructure that detects, diagnoses, and remediates failures without human intervention — reducing incident recovery time by 60%+ and sustaining 99.9%+ uptime.
┌─────────────────────────────────────────────────────────────────────┐
│ AWS Cloud (us-east-1) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ VPC (10.0.0.0/16) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Public │ │ Public │ │ │
│ │ │ Subnet AZ1 │ │ Subnet AZ2 │ │ │
│ │ │ 10.0.1.0/24 │ │ 10.0.2.0/24 │ │ │
│ │ └──────┬───────┘ └───────┬──────┘ │ │
│ │ │ │ │ │
│ │ ┌──────▼──────────────────────────▼──────┐ │ │
│ │ │ Auto Scaling Group (1–4) │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ │ EC2 #1 │ │ EC2 #2 │ ... │ │ │
│ │ │ │ +CW Agent│ │ +CW Agent│ │ │ │
│ │ │ │ +SSM │ │ +SSM │ │ │ │
│ │ │ └────┬─────┘ └────┬─────┘ │ │ │
│ │ └────────┼───────────────┼───────────────┘ │ │
│ └────────────┼───────────────┼──────────────────────────────┘ │
│ │ │ │
│ Metrics│ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ CloudWatch │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ Anomaly Detection Band (CPU) │ │ │
│ │ │ Status Check Alarm │ │ │
│ │ │ High CPU Threshold Alarm │ │ │
│ │ └──────────────┬───────────────┘ │ │
│ └─────────────────┼──────────────────┘ │
│ │ ALARM │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ SNS Topic │ │
│ │ (self-healing-infra-alerts) │ │
│ └───────┬─────────────────────────┘ │
│ │ │ │
│ Lambda │ │ Email │
│ ▼ ▼ │
│ ┌───────────────┐ ┌──────────────┐ │
│ │ Lambda │ │ Operator │ │
│ │ remediate() │ │ (optional) │ │
│ │ │ └──────────────┘ │
│ │ 1. Diagnose │ │
│ │ 2. SSM Cmds │──────SSM RunCommand──────► EC2 Instances │
│ │ 3. Remediate │ │
│ │ 4. Notify │──────SNS Outcome──────────► Operator │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
CloudWatch Alarm FIRES
│
▼
SNS publishes
│
▼
Lambda triggered
│
├─► Get InService instances from ASG
│
├─► Run SSM diagnostics (CPU, memory, services, disk)
│
├─► Tier 1: Restart services (httpd, CW agent, clear cache)
│
├─► Tier 2: Mark instance Unhealthy → ASG replaces it
│
└─► Publish outcome report to SNS
| Feature | Implementation |
|---|---|
| Anomaly-based alerting | CloudWatch Anomaly Detection Band on CPU |
| Status check monitoring | CloudWatch StatusCheckFailed alarm |
| Autonomous remediation | Lambda + SSM RunShellScript |
| Tiered response | Restart → Mark Unhealthy → ASG replacement |
| Diagnostic collection | CPU, memory, disk, service health via SSM |
| Auto-scaling | Scale-out on sustained high CPU |
| Observability | CloudWatch Dashboard with live alarm widgets |
| IaC — Terraform | Full Terraform in /terraform |
| IaC — CloudFormation | Full CFN template in /cloudformation |
| Chaos testing | scripts/chaos_test.py simulates failures |
01-self-healing-infrastructure/
├── terraform/
│ ├── main.tf # VPC, EC2, ASG, Lambda, CloudWatch, SNS
│ ├── variables.tf # All configurable inputs
│ └── outputs.tf # Key resource references
├── cloudformation/
│ └── template.yaml # Equivalent CloudFormation stack
├── lambda/
│ └── remediate.py # Full remediation logic with diagnostics
├── scripts/
│ └── chaos_test.py # Simulate CPU/memory/service/ASG failures
└── README.md
- AWS CLI configured (
aws configure) - Terraform >= 1.3.0 OR AWS Console access for CloudFormation
- Python 3.8+ (for chaos testing)
cd terraform
# Initialise
terraform init
# Preview changes
terraform plan -var="[email protected]"
# Deploy
terraform apply -var="[email protected]"
# Get outputs
terraform outputaws cloudformation create-stack \
--stack-name self-healing-infra \
--template-body file://cloudformation/template.yaml \
--parameters \
ParameterKey=ProjectName,ParameterValue=self-healing-infra \
ParameterKey=AlertEmail,[email protected] \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1# Install dependencies
pip install boto3
# Get an instance ID from your ASG
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-instances \
--query 'AutoScalingInstances[0].InstanceId' --output text)
# Simulate CPU spike
python scripts/chaos_test.py --instance-id $INSTANCE_ID --test cpu
# Kill a service
python scripts/chaos_test.py --instance-id $INSTANCE_ID --test service-kill
# Mark an instance unhealthy (triggers ASG replacement)
python scripts/chaos_test.py \
--asg-name self-healing-infra-asg --test mark-unhealthy
# Watch alarms in real-time
python scripts/chaos_test.py --project self-healing-infra --test watch| Variable | Default | Description |
|---|---|---|
aws_region |
us-east-1 |
Deployment region |
instance_type |
t2.micro |
Free-tier eligible |
asg_desired |
2 |
Starting instance count |
asg_min / asg_max |
1 / 4 |
Scaling bounds |
cpu_threshold |
75 |
% CPU for scale-out |
alert_email |
"" |
Email for notifications |
| Resource | Free Tier | Cost if exceeded |
|---|---|---|
| EC2 t2.micro | 750 hrs/month | ~$0.0116/hr |
| Lambda | 1M requests/month | ~$0.20/1M |
| CloudWatch Alarms | 10 free | $0.10/alarm/month |
| SNS | 1M publishes | $0.50/1M |
| Estimated monthly | $0 in free tier | < $5/month |
Why SNS between CloudWatch and Lambda? Decoupling the alarm from the remediation function means multiple consumers can react to the same event — Lambda remediates, email notifies an operator, and future integrations (PagerDuty, Slack) can subscribe without changing the alarm configuration.
Why tiered remediation? Not every failure requires the nuclear option of terminating an instance. Restarting a crashed service is faster and cheaper than ASG replacement. The tier system mirrors how a skilled SRE would triage: least invasive first.
Why CloudWatch Anomaly Detection over static thresholds? Static thresholds (e.g. "alert at 80% CPU") generate false positives during expected load spikes — deployments, batch jobs, morning traffic. Anomaly Detection learns the baseline pattern and alerts only on genuinely unexpected behaviour, reducing alert fatigue.
# Terraform
terraform destroy
# CloudFormation
aws cloudformation delete-stack --stack-name self-healing-infra- 02-observability-ml-alerting — Full-stack Prometheus/Grafana/ELK observability platform
- 03-secure-aws-infrastructure — IaC with KMS, IAM hardening, VPC security
- 04-kubernetes-orchestration — EKS with zero-downtime deployments
- 05-devsecops-pipeline — CI/CD with SonarQube and Trivy
Built by Thomas Asamba | github.com/thomasasamba-bot