Skip to content

thisis-Shitanshu/sre-mini-prod-aws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Write up: SRE Mini Production Stack on AWS

SRE Mini Production Stack on AWS

A production SRE environment on a single AWS EC2 VM that demonstrates ops work: Infrastructure as Code, monitoring, alerting, centralized logging, runbooks, and incident simulations.

What’s inside

  • Terraform: EC2 + security group with safe exposure (app public, UIs IP-restricted)
  • Flask service: /healthz, /metrics, plus controlled failure/latency endpoints
  • Prometheus + Alertmanager: scrape + alert rules + severity routing
  • Grafana: dashboards for Golden Signals and host health
  • OpenSearch + Fluent Bit: centralized JSON logs + searchable incident evidence
  • Runbooks + Postmortems: repeatable incident response artifacts
  • Incident simulations: 5xx spike + latency spike with evidence across metrics/alerts/logs

Architecture (high level)

  • Public: App on port 80
  • Restricted to your IP (via Security Group): Grafana 3000, Prometheus 9090, Alertmanager 9093, OpenSearch Dashboards 5601
  • Not public: OpenSearch 9200 (bound to localhost in Docker)

Prerequisites

Local machine:

  • Terraform, AWS CLI, Git, SSH client
  • AWS credentials configured (aws configure or env vars)
  • An EC2 key pair (.pem)
  • Your public IP (for /32 restrictions): curl -s https://checkip.amazonaws.com

AWS:

  • Default region is ap-south-1 (change in terraform/variables.tf if needed)

Quick start

1) Provision the VM (Terraform)

cd terraform
terraform init
terraform apply \ 
  -var="key_name=sre-mini-key" \
  -var="my_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"

SSH into the VM:

ssh -i ~/.ssh/path/to/key.pem ubuntu@<PUBLIC_IP>

2) Deploy the stack (Docker Compose)

From your local machine, copy the repo to the VM:

scp -i ~/.ssh/path/to/key.pem -r \
  app observability logging scripts runbooks postmortems docs README.md .gitignore \ ubuntu@<PUBLIC_IP>:/home/ubuntu/sre-mini-prod-aws

On the VM:

cd ~/sre-mini-prod-aws/observability
docker compose up -d --build
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

3) Verify endpoints

On the VM:

curl -s http://localhost/healthz
curl -s http://localhost/metrics | head

From your laptop:

  • App: http://<PUBLIC_IP>/
  • Grafana: http://<PUBLIC_IP>:3000/ (admin / admin123)
  • Prometheus: http://<PUBLIC_IP>:9090/
  • OpenSearch Dashboards: http://<PUBLIC_IP>:5601/

Dashboards

Create two dashboards in Grafana and store panel PromQL in: observability/grafana/dashboards/README.md

  • Dashboard A (Golden Signals): Traffic, Errors, Latency p95, Saturation (CPU/Mem)
  • Dashboard B (Host): CPU, Memory, Disk /, Network RX/TX

Incident simulations

On the VM:

chmod +x scripts/*.sh
bash scripts/load_normal.sh http://localhost
bash scripts/generate_5xx.sh http://localhost
bash scripts/generate_latency.sh http://localhost

Evidence to check:

  • Grafana panels spike (errors or p95 latency)

  • Prometheus alerts fire

  • Alert receiver prints payloads:

    docker logs -f alert_receiver
  • OpenSearch Discover shows logs (index pattern sre-mini-logs*, time field ts)

Cleanup (avoid AWS costs)

cd terraform
terraform destroy \
  -var="key_name=<YOUR_KEYPAIR_NAME>" \
  -var="my_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"

About

Mini production-like SRE stack on a single AWS EC2 VM using Terraform + Docker Compose.

Topics

Resources

Stars

Watchers

Forks

Contributors