π Building & operating scalable cloud-native systems
π€ Working on AI/ML & GPU infrastructure
π Focused on production reliability, incident response & troubleshooting
- π§ Design and operate Kubernetes clusters (EKS / GKE)
- π Handle production incidents and troubleshoot distributed systems
- π Build observability stacks (Prometheus, Grafana, Loki)
- π€ Support AI/ML workloads (LLMs, GPU infrastructure)
- βοΈ Automate infrastructure using Terraform & GitOps
- π Improve system performance, scalability, and cost efficiency



