SRE Mini Production Stack on AWS

Write up: SRE Mini Production Stack on AWS

SRE Mini Production Stack on AWS

A production SRE environment on a single AWS EC2 VM that demonstrates ops work: Infrastructure as Code, monitoring, alerting, centralized logging, runbooks, and incident simulations.

What’s inside

Terraform: EC2 + security group with safe exposure (app public, UIs IP-restricted)
Flask service: /healthz, /metrics, plus controlled failure/latency endpoints
Prometheus + Alertmanager: scrape + alert rules + severity routing
Grafana: dashboards for Golden Signals and host health
OpenSearch + Fluent Bit: centralized JSON logs + searchable incident evidence
Runbooks + Postmortems: repeatable incident response artifacts
Incident simulations: 5xx spike + latency spike with evidence across metrics/alerts/logs

Architecture (high level)

Public: App on port 80
Restricted to your IP (via Security Group): Grafana 3000, Prometheus 9090, Alertmanager 9093, OpenSearch Dashboards 5601
Not public: OpenSearch 9200 (bound to localhost in Docker)

Prerequisites

Local machine:

Terraform, AWS CLI, Git, SSH client
AWS credentials configured (aws configure or env vars)
An EC2 key pair (.pem)
Your public IP (for /32 restrictions): curl -s https://checkip.amazonaws.com

AWS:

Default region is ap-south-1 (change in terraform/variables.tf if needed)

Quick start

1) Provision the VM (Terraform)

cd terraform
terraform init
terraform apply \ 
  -var="key_name=sre-mini-key" \
  -var="my_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"

SSH into the VM:

ssh -i ~/.ssh/path/to/key.pem ubuntu@<PUBLIC_IP>

2) Deploy the stack (Docker Compose)

From your local machine, copy the repo to the VM:

scp -i ~/.ssh/path/to/key.pem -r \
  app observability logging scripts runbooks postmortems docs README.md .gitignore \ ubuntu@<PUBLIC_IP>:/home/ubuntu/sre-mini-prod-aws

On the VM:

cd ~/sre-mini-prod-aws/observability
docker compose up -d --build
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

3) Verify endpoints

On the VM:

curl -s http://localhost/healthz
curl -s http://localhost/metrics | head

From your laptop:

App: http://<PUBLIC_IP>/
Grafana: http://<PUBLIC_IP>:3000/ (admin / admin123)
Prometheus: http://<PUBLIC_IP>:9090/
OpenSearch Dashboards: http://<PUBLIC_IP>:5601/

Dashboards

Create two dashboards in Grafana and store panel PromQL in: observability/grafana/dashboards/README.md

Dashboard A (Golden Signals): Traffic, Errors, Latency p95, Saturation (CPU/Mem)
Dashboard B (Host): CPU, Memory, Disk /, Network RX/TX

Incident simulations

On the VM:

chmod +x scripts/*.sh
bash scripts/load_normal.sh http://localhost
bash scripts/generate_5xx.sh http://localhost
bash scripts/generate_latency.sh http://localhost

Evidence to check:

Grafana panels spike (errors or p95 latency)
Prometheus alerts fire
Alert receiver prints payloads:
```
docker logs -f alert_receiver
```
OpenSearch Discover shows logs (index pattern sre-mini-logs*, time field ts)

Cleanup (avoid AWS costs)

cd terraform
terraform destroy \
  -var="key_name=<YOUR_KEYPAIR_NAME>" \
  -var="my_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRE Mini Production Stack on AWS

What’s inside

Architecture (high level)

Prerequisites

Quick start

1) Provision the VM (Terraform)

2) Deploy the stack (Docker Compose)

3) Verify endpoints

Dashboards

Incident simulations

Cleanup (avoid AWS costs)

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
docs		docs
logging/fluent-bit		logging/fluent-bit
observability		observability
postmortems		postmortems
runbooks		runbooks
scripts		scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SRE Mini Production Stack on AWS

What’s inside

Architecture (high level)

Prerequisites

Quick start

1) Provision the VM (Terraform)

2) Deploy the stack (Docker Compose)

3) Verify endpoints

Dashboards

Incident simulations

Cleanup (avoid AWS costs)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages