Bilal Khan engrbilal1

👨‍💻 About Me

Cloud DevOps & SRE Engineer with 3+ years of hands-on experience designing, automating, and operating cloud-native infrastructure at scale. Passionate about building resilient systems, automating everything, and driving observability from day one.

🔭 Working extensively on AWS, GCP, Azure & Huawei Cloud — certified across all four
⚙️ Managing 50+ microservices on production Kubernetes clusters
🏗️ Built HA Kubernetes clusters from scratch on on-premises VMs
🐄 Managing multiple K8s clusters using Rancher as a centralized control plane across on-prem and cloud
🔄 GitOps with ArgoCD ApplicationSets — multi-app, multi-environment (dev/staging/prod) via Kustomize overlays
🐘 Running PostgreSQL HA clusters in K8s using CloudNativePG (CNPG) with WAL archiving & PITR
⬆️ Performed zero-downtime Kubernetes cluster upgrades (v1.30 → v1.34) across 4 minor versions on bare-metal
📡 Deep expertise in full-stack observability — Prometheus, Grafana, Loki, Alertmanager & more
🌱 Currently advancing skills in Platform Engineering & FinOps
💬 Ask me about K8s, Docker, CI/CD, IaC, Monitoring, SRE practices
📫 Reach me at [email protected]

🏅 Certifications

Cloud Provider	Certification
☁️ AWS	AWS Certified (Solutions Architect)
🌐 Google Cloud	GCP Associate Cloud Engineer
🔷 Microsoft Azure	Azure Fundamentals AZ-900
🟥 Huawei Cloud	HCCDA

🛠️ Tech Stack & Expertise

☁️ Cloud Platforms

🐳 Containers & Orchestration

🔁 CI/CD & GitOps

📦 Infrastructure as Code

📊 Observability & SRE

🗄️ Databases & Messaging

🌐 Networking & Security

🐧 OS & Scripting

🚀 Key Projects & Highlights

🏗️ HA Kubernetes Cluster (On-Prem)

Designed and deployed a production-grade, highly available Kubernetes cluster on bare-metal VMs with multi-master setup, etcd clustering, and automated failover — all managed via Rancher.

🐄 Rancher — Multi-Cluster Kubernetes Management

Deployed and managed Rancher as a centralized control plane for managing multiple Kubernetes clusters across on-premises and cloud environments.

Imported and managed multiple K8s clusters (on-prem HA + cloud-managed) from a single Rancher dashboard

Configured role-based access control (RBAC) across clusters — mapping teams to namespaces and projects with fine-grained permissions

Used Rancher Projects to group namespaces and enforce resource quotas and network policies across clusters

Managed cluster catalogs and Helm app deployments via Rancher Apps & Marketplace

Monitored all clusters centrally using Rancher's integrated Prometheus & Grafana stack

Used Rancher to perform node pool scaling, OS upgrades, and certificate rotation without touching kubeconfig directly

🔭 Full-Stack Observability Platform

Built a complete observability stack: Prometheus (metrics) → Grafana (dashboards) → Loki + Promtail (logs) → Alertmanager (notifications via Slack/PagerDuty) + Blackbox Exporter for external API/endpoint uptime monitoring.

🐳 Private Container Registry — Harbor

Deployed and maintained a self-hosted Harbor registry with role-based access control, image vulnerability scanning, and replication policies integrated into CI/CD pipelines.

⚡ Self-Hosted GitHub Actions Runners

Configured ephemeral self-hosted runners on Kubernetes for secure, scalable CI/CD — reducing pipeline costs and enabling workloads that require access to private network resources.

🔄 GitOps — Multi-App, Multi-Environment at Scale

Implemented a GitOps platform with ArgoCD managing 50+ microservices across dev, staging, and production environments. Used ArgoCD ApplicationSets with Git directory generators to auto-deploy new apps from repo structure — zero manual ArgoCD config per app. Each environment maps to a dedicated overlay in a monorepo (apps/<service>/overlays/<env>/) using Kustomize, with Helm charts for third-party dependencies. Sync waves enforce deployment ordering; health checks gate promotions between environments.

📁 gitops-repo/
├── apps/
│   ├── payment-service/
│   │   ├── base/
│   │   └── overlays/
│   │       ├── dev/        ← lower replicas, debug logging
│   │       ├── staging/    ← mirror of prod resources
│   │       └── production/ ← HPA, PDB, resource limits enforced
│   └── auth-service/ ...
├── infrastructure/
│   ├── monitoring/         ← Prometheus, Grafana, Loki stack
│   ├── ingress/            ← Nginx Ingress + cert-manager
│   └── cnpg/               ← CloudNativePG operator + clusters
└── applicationsets/        ← ArgoCD ApplicationSet manifests

🐘 CloudNativePG (CNPG) — PostgreSQL HA in Kubernetes

Deployed and operated CloudNativePG operator to run highly available PostgreSQL clusters natively inside Kubernetes — replacing external managed DB services for cost savings and full control.

Provisioned primary + 2 replica PostgreSQL clusters with streaming replication and automatic failover

Configured continuous WAL archiving to S3-compatible object storage for point-in-time recovery (PITR)

Managed scheduled backups, connection pooling via PgBouncer, and TLS-encrypted client connections

Integrated CNPG cluster credentials with External Secrets Operator → HashiCorp Vault pipeline

Monitored replication lag, WAL sender/receiver status, and backup freshness via dedicated Grafana dashboards (CNPG community dashboard)

⬆️ Kubernetes Cluster Upgrade — v1.30 → v1.34

Performed a zero-downtime, rolling in-place upgrade of a production on-premises Kubernetes cluster across 4 minor versions (1.30 → 1.31 → 1.32 → 1.33 → 1.34), following Kubernetes' one-minor-version-at-a-time policy.

Upgrade sequence per version:
1. Review API deprecations & release notes for each target version
2. Upgrade kubeadm on first control-plane node → apply new control-plane config
3. Upgrade remaining control-plane nodes (HA etcd stays healthy throughout)
4. Upgrade kubelet + kubectl on all control-plane nodes
5. cordon → drain worker node → upgrade kubeadm/kubelet/kubectl → uncordon
6. Validate: kubectl get nodes, pod health, etcd member list, CNI/CSI compatibility
Pre-validated deprecated API removals (e.g., policy/v1beta1 PodSecurityPolicy gone in 1.25+, flowcontrol.apiserver.k8s.io/v1beta2 in 1.32) and migrated manifests ahead of upgrade

Verified CNI plugin (Calico/Flannel) and CSI driver compatibility matrix before each hop

Ran Rancher UI upgrade path in parallel for clusters managed via Rancher, using its built-in node drain + upgrade orchestration

Validated workloads, Ingress, PVCs, and CNPG cluster health at every version boundary before proceeding

📐 SRE Practices I Follow

📏  SLO / SLA Definition     →  Error budgets for every critical service
🔁  Blameless Post-mortems   →  RCA docs after every incident
🚦  Traffic Management       →  Canary & blue-green deployments via K8s + Argo Rollouts
🔐  Secrets Management       →  HashiCorp Vault + External Secrets Operator
📦  GitOps                   →  ArgoCD for declarative, auditable deployments
📉  Capacity Planning        →  HPA / VPA / Cluster Autoscaler on cloud & on-prem
🌐  Service Mesh             →  Istio for mTLS, traffic shaping & observability
🛡️  Security Hardening       →  Pod Security Admission, NetworkPolicies, image scanning

📊 GitHub Stats

🤝 Connect With Me

"Infrastructure is code. Reliability is a feature. Automation is the goal."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bilal Khan engrbilal1

Achievements

Achievements

Block or report engrbilal1

👨‍💻 About Me

🏅 Certifications

🛠️ Tech Stack & Expertise

☁️ Cloud Platforms

🐳 Containers & Orchestration

🔁 CI/CD & GitOps

📦 Infrastructure as Code

📊 Observability & SRE

🗄️ Databases & Messaging

🌐 Networking & Security

🐧 OS & Scripting

🚀 Key Projects & Highlights

🏗️ HA Kubernetes Cluster (On-Prem)

🐄 Rancher — Multi-Cluster Kubernetes Management

🔭 Full-Stack Observability Platform

🐳 Private Container Registry — Harbor

⚡ Self-Hosted GitHub Actions Runners

🔄 GitOps — Multi-App, Multi-Environment at Scale

🐘 CloudNativePG (CNPG) — PostgreSQL HA in Kubernetes

⬆️ Kubernetes Cluster Upgrade — v1.30 → v1.34

📐 SRE Practices I Follow

📊 GitHub Stats

🤝 Connect With Me

Popular repositories Loading

Uh oh!