Vikash Kumar Dubey

Artificial Intelligence in Healthcare & Clinical AI.

AI Researcher · Senior AI/ML Engineer · Sanofi ↗

I build AI/ML models and data pipelines for pharmaceutical drug development, clinical research, and bioinformatics — working across disease progression modeling, patient stratification, medical imaging diagnostics, and clinical trial analytics with multimodal data (EHR, imaging, genomics).

I have deep roots in CDISC-aligned data standards (SDTM/ADaM), regulated environments, and close collaboration with Biostatistics, Clinical Development, and R&D teams. I've also published 6 peer-reviewed papers at IEEE and APASL conferences — the research side keeps me honest.

"Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less."
— Marie Curie  ·  In today's AI world, the more we learn, the less we fear.
Experience
Sanofi
Senior AI/ML Modeler Engineer
May 2025 – Present · Hyderabad, India

At Sanofi I work deep in the science — building survival and longitudinal models that track how diseases progress through time, drawing from rich multimodal datasets: cognitive endpoints, MRI/PET neuroimaging, genomic biomarkers. Every model has to hold up under scrutiny — I evaluate discrimination, calibration, and subgroup robustness before anything reaches a trial team. The pipelines I build follow CDISC standards (SDTM/ADaM) end-to-end, with automated validation and full regulatory traceability. Working shoulder-to-shoulder with Biostatistics and Clinical Development, I translate messy clinical questions into model specs that are actually useful.

Core: Survival Analysis, Longitudinal Modeling, SDTM/ADaM, Multimodal Clinical Data, Model Validation
Cognizant
Data Scientist – Healthcare
Jul 2021 – May 2025 · Hyderabad, India

Four years at Cognizant meant wearing many hats across pharma and life sciences. I built patient stratification and trial recruitment models using XGBoost and LightGBM — always with bias checks across diverse cohorts. The infrastructure side was just as important: end-to-end ML pipelines on AWS and Azure, from raw ingestion through to serving, running across multiple concurrent projects. I brought generative AI into clinical workflows — document automation, diagnostic support — and set up the MLOps scaffolding (versioning, monitoring, retraining) to keep it reliable. The Tableau and Power BI dashboards I delivered became daily tools for clinical and ops teams tracking enrollment in real time.

Core: Ensemble Modeling, Cloud ML Pipelines, Clinical Data Quality, Dashboard Analytics, MLOps
Institute of Liver & Biliary Sciences
Data Scientist – Medical AI Research
Dec 2020 – Jul 2021 · Delhi, India

This was the role that turned me into a medical AI researcher. Working with clinicians at ILBS, I built sepsis risk prediction models for ACLF patients — XGBoost and Random Forest with careful threshold tuning, so that model outputs actually mapped to clinical decisions (published APASL 2023). On the imaging side, I implemented U-Net and CNN models for brain tumor segmentation from MRI, validated against radiologist-annotated ground truth using dice coefficient and cross-fold validation (IEEE ICAII 2023). Working directly with clinicians forced me to care about clinical meaning, not just statistical metrics.

Core: Clinical Model Validation, Medical Imaging, Clinician Collaboration, Research Publishing
Centyle Biotech Pvt. Ltd
Data Scientist – Bioinformatics
Feb 2019 – Dec 2020 · Delhi, India

My entry into ML was through biology. At Centyle Biotech, I built pipelines to make sense of complex genomic and proteomic datasets — clustering, dimensionality reduction, classification — working in Python and R. I collaborated closely with domain scientists to define features and validation workflows that captured actual biological variability, not just data artifacts. It taught me to iterate carefully and document rigorously before anything touched production.

Core: Bioinformatics, Unsupervised/Supervised Learning, Genomic Data, Model Iteration
Projects
Disease Progression Modeling System2025

A system that forecasts how a patient's disease will evolve — combining Cox & Kaplan-Meier survival analysis, linear mixed-effects models, gradient boosting, and deep learning over multimodal clinical data (EHR records, imaging, genomic markers). Feature engineering was critical: I built biomarker identification pipelines using Random Forest and XGBoost with SHAP to make sure the right signals drove the predictions. Interpretability layers, subgroup validation, and decision threshold tuning made outputs that clinicians could actually trust. Experiment tracking via MLflow kept everything reproducible across dozens of data refreshes.

Generative AI for Drug Discovery2024

What if a researcher could just ask the literature a question and get a cited, synthesized answer? A RAG-based retrieval system over large biomedical corpora — using vector embeddings and semantic search to retrieve the most relevant passages, then synthesizing answers via LLM. Paired with NLP pipelines for drug-target interaction extraction using fine-tuned transformer models. Retrieval quality measured with precision@k and MRR. Result: researchers spending far less time on manual literature review.

Clinical Trial Analytics Platform2022

Built the data engine behind a clinical trial: an end-to-end patient screening pipeline for Phase II/III studies, fusing EHR, lab, and imaging features with automated feature selection (Lasso, mutual information), machine learning classifiers, and calibration analysis to improve cohort matching. On top of that, Tableau dashboards giving trial teams live visibility into enrollment rates, cohort drift, and site performance — replacing the weekly manual reporting grind entirely.

Medical Imaging Diagnostics2021

U-Net and ResNet deep learning models trained to detect brain tumors and hemorrhagic strokes from MRI/CT scans — validated against radiologist-annotated ground truth using dice coefficient, IoU, and cross-fold validation. Confidence scoring added so radiologists could see exactly how certain the model was at each prediction. The full inference pipeline containerized with Docker and integrated into a real clinical workflow for live diagnostic support.

Research & Publications
Education & Certifications
South Asian University
South Asian University (SAARC)
Master of Science in Computer Science
Jul 2016 – Jun 2018 · Delhi, India

Thesis: Multi-label text classification using deep learning and extreme classification techniques for large-scale datasets. [View Thesis]

Coursework: Machine Learning, Deep Learning, NLP, Statistical Inference, Data Structures & Algorithms, Computer Vision, Big Data Systems, Cloud Computing.

Beyond the Work
Taught ML, Deep Learning, NLP & Cloud to 1,000+ learners as a Data Science Instructor & Consultant (2018–2020), and ran hands-on AI workshops at IIT Kanpur. [Certificate]
Presented research at 6 IEEE & APASL conferences and served as ACM Vice Chair & University Ambassador — running healthcare AI knowledge-sharing workshops at Cognizant and Sanofi.
Picked up French & Spanish Diplomas at IIT Delhi (2016–2018) along the way — learning languages, like learning models, never really stops.
Technical Skills
Category Technologies & Tools
Languages Python R SQL PySpark
ML / DL TensorFlow Keras PyTorch Scikit-Learn XGBoost LightGBM Pandas NumPy
Cloud & Big Data AWS SageMaker AWS Lambda S3 EC2 Azure Databricks Azure ML Studio Hadoop Spark MLlib
Medical AI U-Net ResNet Clinical NLP Survival Analysis EHR Analytics
NLP & Gen AI Transformers BERT GPT LangChain RAG
Clinical Data SDTM ADaM CDISC Clinical Trial Analytics Biostatistics Pipelines
Tools Docker Kubernetes Git Jenkins MLflow Tableau Power BI
Databases PostgreSQL MySQL MongoDB
Regulatory HIPAA GDPR FDA AI/ML Guidance GxP
Contact

If you work in healthcare AI, clinical data science, or pharma R&D — or you're just curious about any of it — I'd genuinely love to connect.