Write up: SRE Mini Production Stack on AWS
A production SRE environment on a single AWS EC2 VM that demonstrates ops work: Infrastructure as Code, monitoring, alerting, centralized logging, runbooks, and incident simulations.
- Terraform: EC2 + security group with safe exposure (app public, UIs IP-restricted)
- Flask service:
/healthz,/metrics, plus controlled failure/latency endpoints - Prometheus + Alertmanager: scrape + alert rules + severity routing
- Grafana: dashboards for Golden Signals and host health
- OpenSearch + Fluent Bit: centralized JSON logs + searchable incident evidence
- Runbooks + Postmortems: repeatable incident response artifacts
- Incident simulations: 5xx spike + latency spike with evidence across metrics/alerts/logs
- Public: App on port 80
- Restricted to your IP (via Security Group): Grafana 3000, Prometheus 9090, Alertmanager 9093, OpenSearch Dashboards 5601
- Not public: OpenSearch 9200 (bound to localhost in Docker)
Local machine:
- Terraform, AWS CLI, Git, SSH client
- AWS credentials configured (
aws configureor env vars) - An EC2 key pair (
.pem) - Your public IP (for
/32restrictions):curl -s https://checkip.amazonaws.com
AWS:
- Default region is
ap-south-1(change interraform/variables.tfif needed)
cd terraform
terraform init
terraform apply \
-var="key_name=sre-mini-key" \
-var="my_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"SSH into the VM:
ssh -i ~/.ssh/path/to/key.pem ubuntu@<PUBLIC_IP>From your local machine, copy the repo to the VM:
scp -i ~/.ssh/path/to/key.pem -r \
app observability logging scripts runbooks postmortems docs README.md .gitignore \ ubuntu@<PUBLIC_IP>:/home/ubuntu/sre-mini-prod-awsOn the VM:
cd ~/sre-mini-prod-aws/observability
docker compose up -d --build
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"On the VM:
curl -s http://localhost/healthz
curl -s http://localhost/metrics | headFrom your laptop:
- App:
http://<PUBLIC_IP>/ - Grafana:
http://<PUBLIC_IP>:3000/(admin / admin123) - Prometheus:
http://<PUBLIC_IP>:9090/ - OpenSearch Dashboards:
http://<PUBLIC_IP>:5601/
Create two dashboards in Grafana and store panel PromQL in:
observability/grafana/dashboards/README.md
- Dashboard A (Golden Signals): Traffic, Errors, Latency p95, Saturation (CPU/Mem)
- Dashboard B (Host): CPU, Memory, Disk
/, Network RX/TX
On the VM:
chmod +x scripts/*.sh
bash scripts/load_normal.sh http://localhost
bash scripts/generate_5xx.sh http://localhost
bash scripts/generate_latency.sh http://localhostEvidence to check:
-
Grafana panels spike (errors or p95 latency)
-
Prometheus alerts fire
-
Alert receiver prints payloads:
docker logs -f alert_receiver
-
OpenSearch Discover shows logs (index pattern
sre-mini-logs*, time fieldts)
cd terraform
terraform destroy \
-var="key_name=<YOUR_KEYPAIR_NAME>" \
-var="my_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"