Skip to content
View kaushal-shivaprakashan's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report kaushal-shivaprakashan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

🎯 Executive Summary

I'm a Full-Stack Systems Engineer with deep expertise in distributed data systems, cloud infrastructure, and backend software engineering. I design and build production systems that scale to millions of events/hour and terabyte-scale datasets, with obsessive attention to reliability, performance, and operational excellence.

Key Focus Areas:

  • 🏗️ Systems Architecture — Distributed systems, microservices, event-driven architectures
  • Data Pipelines — Real-time streaming, batch processing, end-to-end data platforms
  • ☁️ Cloud Infrastructure — AWS/GCP/Azure, Kubernetes, infrastructure-as-code
  • 🔧 Backend Engineering — API design, database optimization, performance tuning
  • 📈 Data Quality & Observability — Monitoring, alerting, reliability engineering

📍 Seattle / Bellevue, WA | Open to: Data Engineer, SDE, Platform Engineer roles


💼 Professional Experience

Data Engineer — Cognizant (2023–2024)

Real-time analytics platform for financial risk & operations

Impact & Scale:

  • ⚙️ Architected Kafka + Spark Structured Streaming pipeline processing 5M+ events/hour with sub-second latency
  • 🏗️ Designed & implemented Medallion architecture (Bronze → Silver → Gold layers) across Spark, dbt, and Redshift
  • 📐 Modeled 28+ dimension & fact tables using Star Schema with SCD Type 2 for complex business entities
  • 🧪 Built data quality framework with 20+ automated checks (freshness, schema validation, reconciliation, completeness)
  • 🚀 Performance optimization — reduced analytics query latency by ~75% through Spark tuning & warehouse indexing
  • 📊 Enabled 50+ BI dashboards serving risk, operations, and business intelligence teams in production
  • 🔄 Orchestrated 150+ daily workflows with Airflow, managing SLA compliance across dependent pipelines

Technical Stack: Kafka, Apache Spark (SQL/Streaming), Python, Airflow, dbt, Redshift, AWS, SQL, Git

Key Learnings:

  • Production data systems require obsessive focus on SLAs, data quality, and operational observability
  • Performance at scale demands deep understanding of distributed computing trade-offs
  • Data modeling directly impacts analytics velocity and business decision-making

Graduate Data Engineering Researcher — University at Buffalo (2022–2023)

Research in distributed data systems and scalable ETL

Projects & Contributions:

  • 🧬 Large-scale data processing — Built Spark pipelines processing 2.5TB+ datasets
  • ☁️ Cloud-native ETL — Designed containerized workflows using Docker and Airflow
  • 📦 Data platform infrastructure — Implemented reliable ingest, transform, and serve layers
  • 🔬 Research & optimization — Evaluated trade-offs between batch vs. streaming, different storage formats

Technical Stack: Spark, Python, Docker, Airflow, SQL, Cloud platforms


🔧 Core Competencies

Software Engineering

  • Languages: Python, Scala, Java, SQL
  • Design Patterns: Microservices, event-driven systems, CQRS
  • API Design: REST/gRPC, async/streaming APIs, schema evolution
  • Testing: Unit, integration, contract testing; test-driven development
  • Code Quality: Design patterns, SOLID principles, refactoring

Data Systems & Engineering

  • Streaming: Kafka, Spark Structured Streaming, message queue design
  • Batch Processing: Apache Spark, distributed SQL, DAG orchestration
  • Data Warehousing: Redshift, Snowflake, BigQuery, dimensional modeling
  • Data Transformation: dbt, SQL (advanced), Spark SQL
  • Data Quality: Great Expectations, custom validators, SLA monitoring
  • ETL/ELT: End-to-end pipeline design, CDC patterns, idempotency

Cloud & Infrastructure

  • Cloud Platforms: AWS (EC2, S3, RDS, Redshift, Lambda), GCP (BigQuery, Dataflow, Compute Engine), Azure
  • Container & Orchestration: Docker, Kubernetes, Helm
  • Infrastructure-as-Code: Terraform, CloudFormation
  • Networking: VPCs, security groups, API gateways
  • Monitoring & Observability: CloudWatch, DataDog, Prometheus, custom dashboards

Database Systems

  • Relational: PostgreSQL, MySQL, Redshift (columnar optimization)
  • NoSQL: MongoDB, DynamoDB
  • Data Formats: Parquet, Avro, Delta Lake
  • Query Optimization: Indexing strategies, execution plans, partitioning

🛠️ Technical Stack

Category Technologies
Languages Python, Scala, Java, SQL, Bash
Streaming & Messaging Apache Kafka, Spark Structured Streaming, RabbitMQ
Batch & Processing Apache Spark, Hadoop, Databricks
Workflow Orchestration Apache Airflow, Prefect, Dagster
Data Transformation dbt, SQL, PySpark, Scala
Data Warehouses Redshift, Snowflake, BigQuery, Postgres
NoSQL & Caching MongoDB, DynamoDB, Redis, Cassandra
Cloud Platforms AWS (primary), GCP, Azure
Container & DevOps Docker, Kubernetes, Terraform, Git
Monitoring CloudWatch, DataDog, Prometheus, custom metrics
Version Control Git, GitHub, GitLab, feature branching
Development Tools Jupyter, VS Code, IntelliJ, DataGrip

🏗️ Featured Projects

1. 💰 Financial Data Platform (Production)

Problem: Financial services organization needed real-time risk analytics with sub-second query latency
Solution:

  • Architecture: Kafka ingest → Spark Structured Streaming processing → Redshift warehouse → BI dashboards
  • Key Features:
    • Real-time ingestion of 5M+ financial events/hour
    • 28+ curated fact & dimension tables (Star Schema + SCD Type 2)
    • 20+ automated data quality checks with alerting
    • Sub-second query latency for risk dashboards
  • Impact: Enabled real-time risk monitoring across 50+ production dashboards
  • Technologies: Kafka, Spark, Python, Redshift, dbt, Airflow, AWS

2. ⚡ Real-Time Streaming ETL Pipeline

Problem: High-throughput event ingestion with exactly-once semantics and failure recovery
Solution:

  • Architecture: Kafka topics → Spark Streaming (micro-batching) → distributed storage
  • Key Features:
    • Exactly-once processing semantics with idempotent writes
    • Automatic retry & checkpoint management
    • Schema evolution handling with Avro
    • Real-time SLA monitoring
  • Impact: 99.99% uptime SLA with <5min recovery from failures
  • Technologies: Kafka, Spark Streaming, Python, AWS S3/RDS, Monitoring

3. 📊 BigQuery Analytics Layer

Problem: Analytics team needed dimensional models optimized for BI queries
Solution:

  • Architecture: Raw data lake → dbt transformations → optimized dimensions & facts
  • Key Features:
    • Dimensional modeling (Star Schema)
    • Incremental dbt models with CDC support
    • Automated model lineage & testing
    • Integration with Looker for self-service BI
  • Impact: 10x faster BI query performance, reduced analytics development time by 60%
  • Technologies: BigQuery, dbt, SQL, Looker, GCP

4. 🔍 Data Quality Framework

Problem: Need enterprise-grade data quality monitoring at scale
Solution:

  • Framework: Custom quality checks + Great Expectations integration
  • Key Features:
    • 20+ automated data quality checks (schema, freshness, completeness, reconciliation)
    • Real-time alerting with PagerDuty integration
    • SLA tracking with automated remediation
    • Data lineage tracking for root cause analysis
  • Impact: 95% reduction in data quality incidents, automated remediation for 80% of issues
  • Technologies: Python, Great Expectations, SQL, Airflow, monitoring

📊 Key Metrics & Impact

Metric Achievement
Data Volume 5M+ events/hour, 2.5TB+ historical datasets
Query Latency Sub-second to <5 seconds (depending on query complexity)
Performance Improvement ~75% faster analytics through tuning
Reliability 99.99% uptime SLA on production pipelines
Data Quality 95% reduction in data quality incidents
Automation 20+ data quality checks, 80% auto-remediation
Dashboards Enabled 50+ production BI dashboards
Daily Workflows 150+ orchestrated Airflow DAGs

🎓 Education & Certifications

  • Master of Science in Engineering — University at Buffalo
  • Google Cloud Certified Associate Cloud Engineer (in progress)
  • Coursework: Distributed Systems, Database Systems, Cloud Computing, Advanced Algorithms

🏆 Strengths & Philosophy

What Sets Me Apart:

  1. Full-Stack Systems Thinking — I understand data platforms from ingest to serving, infrastructure to observability
  2. Production Mindset — Built systems handling millions of events/hour with reliability guarantees
  3. Performance Obsession — Deep knowledge of distributed systems trade-offs, bottleneck identification, optimization
  4. Code Quality — Clean, maintainable, well-tested code following SOLID principles
  5. Communication — Excellent at explaining complex systems to both technical and non-technical audiences

Engineering Philosophy:

"Good systems are invisible. They're reliable, observable, and enable teams to move fast without fear. Great engineering is about obsessing over reliability, performance, and the developer experience for those who maintain the system."


🤝 Let's Connect

I'm actively looking for roles in:

  • 🔹 Data Engineer — Building scalable data platforms and ETL systems
  • 🔹 Software Engineer (SDE) — Backend systems, distributed systems, infrastructure
  • 🔹 Platform Engineer — Infrastructure automation, data platform architecture
  • 🔹 Systems Engineer — Cloud architecture, reliability engineering

📧 Email: [email protected]
🔗 LinkedIn: https://linkedin.com/in/kaushal-shivaprakash
💻 GitHub: https://github.com/kaushal-shivaprakashan
📊 Kaggle: https://www.kaggle.com/kaushal07


📝 Latest Blog Insights

  • Coming soon: Deep dive into Spark performance tuning at scale
  • Coming soon: Building reliable data quality frameworks
  • Coming soon: Event-driven architectures in practice

💡 Open to discussing data systems, distributed architecture, and software engineering best practices

Pinned Loading

  1. Hyperloop-Technology-Integrated-system-using-Hardware-Software-and-Data-Analytics-TECHNICAL-SEMINAR Hyperloop-Technology-Integrated-system-using-Hardware-Software-and-Data-Analytics-TECHNICAL-SEMINAR Public

  2. Machine-Learning-Model-for-NVIDIA-GPU-Benchmark-Classification-and-Prediction Machine-Learning-Model-for-NVIDIA-GPU-Benchmark-Classification-and-Prediction Public

    HTML

  3. Adaptive-Crop-Yield-Forecasting-Using-Statistical-Learning-Algorithms Adaptive-Crop-Yield-Forecasting-Using-Statistical-Learning-Algorithms Public

    HTML

  4. ARCHIQ ARCHIQ Public

    JavaScript 1