Cloud DevOps & SRE Engineer with 3+ years of hands-on experience designing, automating, and operating cloud-native infrastructure at scale. Passionate about building resilient systems, automating everything, and driving observability from day one.
- π Β Working extensively on AWS, GCP, Azure & Huawei Cloud β certified across all four
- βοΈ Β Managing 50+ microservices on production Kubernetes clusters
- ποΈ Β Built HA Kubernetes clusters from scratch on on-premises VMs
- π Β Managing multiple K8s clusters using Rancher as a centralized control plane across on-prem and cloud
- π Β GitOps with ArgoCD ApplicationSets β multi-app, multi-environment (dev/staging/prod) via Kustomize overlays
- π Β Running PostgreSQL HA clusters in K8s using CloudNativePG (CNPG) with WAL archiving & PITR
- β¬οΈ Β Performed zero-downtime Kubernetes cluster upgrades (v1.30 β v1.34) across 4 minor versions on bare-metal
- π‘ Β Deep expertise in full-stack observability β Prometheus, Grafana, Loki, Alertmanager & more
- π± Β Currently advancing skills in Platform Engineering & FinOps
- π¬ Β Ask me about K8s, Docker, CI/CD, IaC, Monitoring, SRE practices
- π« Β Reach me at [email protected]
| Cloud Provider | Certification |
|---|---|
| βοΈ AWS | AWS Certified (Solutions Architect) |
| π Google Cloud | GCP Associate Cloud Engineer |
| π· Microsoft Azure | Azure Fundamentals AZ-900 |
| π₯ Huawei Cloud | HCCDA |
Designed and deployed a production-grade, highly available Kubernetes cluster on bare-metal VMs with multi-master setup, etcd clustering, and automated failover β all managed via Rancher.
Deployed and managed Rancher as a centralized control plane for managing multiple Kubernetes clusters across on-premises and cloud environments.
- Imported and managed multiple K8s clusters (on-prem HA + cloud-managed) from a single Rancher dashboard
- Configured role-based access control (RBAC) across clusters β mapping teams to namespaces and projects with fine-grained permissions
- Used Rancher Projects to group namespaces and enforce resource quotas and network policies across clusters
- Managed cluster catalogs and Helm app deployments via Rancher Apps & Marketplace
- Monitored all clusters centrally using Rancher's integrated Prometheus & Grafana stack
- Used Rancher to perform node pool scaling, OS upgrades, and certificate rotation without touching kubeconfig directly
Built a complete observability stack: Prometheus (metrics) β Grafana (dashboards) β Loki + Promtail (logs) β Alertmanager (notifications via Slack/PagerDuty) + Blackbox Exporter for external API/endpoint uptime monitoring.
Deployed and maintained a self-hosted Harbor registry with role-based access control, image vulnerability scanning, and replication policies integrated into CI/CD pipelines.
Configured ephemeral self-hosted runners on Kubernetes for secure, scalable CI/CD β reducing pipeline costs and enabling workloads that require access to private network resources.
Implemented a GitOps platform with ArgoCD managing 50+ microservices across
dev,staging, andproductionenvironments. Used ArgoCD ApplicationSets with Git directory generators to auto-deploy new apps from repo structure β zero manual ArgoCD config per app. Each environment maps to a dedicated overlay in a monorepo (apps/<service>/overlays/<env>/) using Kustomize, with Helm charts for third-party dependencies. Sync waves enforce deployment ordering; health checks gate promotions between environments.
π gitops-repo/
βββ apps/
β βββ payment-service/
β β βββ base/
β β βββ overlays/
β β βββ dev/ β lower replicas, debug logging
β β βββ staging/ β mirror of prod resources
β β βββ production/ β HPA, PDB, resource limits enforced
β βββ auth-service/ ...
βββ infrastructure/
β βββ monitoring/ β Prometheus, Grafana, Loki stack
β βββ ingress/ β Nginx Ingress + cert-manager
β βββ cnpg/ β CloudNativePG operator + clusters
βββ applicationsets/ β ArgoCD ApplicationSet manifests
Deployed and operated CloudNativePG operator to run highly available PostgreSQL clusters natively inside Kubernetes β replacing external managed DB services for cost savings and full control.
- Provisioned primary + 2 replica PostgreSQL clusters with streaming replication and automatic failover
- Configured continuous WAL archiving to S3-compatible object storage for point-in-time recovery (PITR)
- Managed scheduled backups, connection pooling via PgBouncer, and TLS-encrypted client connections
- Integrated CNPG cluster credentials with External Secrets Operator β HashiCorp Vault pipeline
- Monitored replication lag, WAL sender/receiver status, and backup freshness via dedicated Grafana dashboards (CNPG community dashboard)
Performed a zero-downtime, rolling in-place upgrade of a production on-premises Kubernetes cluster across 4 minor versions (1.30 β 1.31 β 1.32 β 1.33 β 1.34), following Kubernetes' one-minor-version-at-a-time policy.
Upgrade sequence per version:
1. Review API deprecations & release notes for each target version 2. Upgrade kubeadm on first control-plane node β apply new control-plane config 3. Upgrade remaining control-plane nodes (HA etcd stays healthy throughout) 4. Upgrade kubelet + kubectl on all control-plane nodes 5. cordon β drain worker node β upgrade kubeadm/kubelet/kubectl β uncordon 6. Validate: kubectl get nodes, pod health, etcd member list, CNI/CSI compatibility
- Pre-validated deprecated API removals (e.g.,
policy/v1beta1 PodSecurityPolicygone in 1.25+,flowcontrol.apiserver.k8s.io/v1beta2in 1.32) and migrated manifests ahead of upgrade- Verified CNI plugin (Calico/Flannel) and CSI driver compatibility matrix before each hop
- Ran Rancher UI upgrade path in parallel for clusters managed via Rancher, using its built-in node drain + upgrade orchestration
- Validated workloads, Ingress, PVCs, and CNPG cluster health at every version boundary before proceeding
π SLO / SLA Definition β Error budgets for every critical service
π Blameless Post-mortems β RCA docs after every incident
π¦ Traffic Management β Canary & blue-green deployments via K8s + Argo Rollouts
π Secrets Management β HashiCorp Vault + External Secrets Operator
π¦ GitOps β ArgoCD for declarative, auditable deployments
π Capacity Planning β HPA / VPA / Cluster Autoscaler on cloud & on-prem
π Service Mesh β Istio for mTLS, traffic shaping & observability
π‘οΈ Security Hardening β Pod Security Admission, NetworkPolicies, image scanning


