Vikash Kumar Dubey

Artificial Intelligence in Healthcare & Clinical AI.

AI Researcher · Senior AI/ML Engineer · Sanofi ↗

I build AI/ML models and data pipelines for pharmaceutical drug development, clinical research, and bioinformatics — working across disease progression modeling, patient stratification, medical imaging diagnostics, and clinical trial analytics with multimodal data (EHR, imaging, genomics).

I have deep roots in CDISC-aligned data standards (SDTM/ADaM), regulated environments, and close collaboration with Biostatistics, Clinical Development, and R&D teams. I've also published 6 peer-reviewed papers at IEEE and APASL conferences — the research side keeps me honest.

"Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less."

— Marie Curie · In today's AI world, the more we learn, the less we fear.

Gmail LinkedIn GitHub ResearchGate

Experience

Sanofi

Senior AI/ML Modeler Engineer

May 2025 – Present · Hyderabad, India

At Sanofi I work deep in the science — building survival and longitudinal models that track how diseases progress through time, drawing from rich multimodal datasets: cognitive endpoints, MRI/PET neuroimaging, genomic biomarkers. Every model has to hold up under scrutiny — I evaluate discrimination, calibration, and subgroup robustness before anything reaches a trial team. The pipelines I build follow CDISC standards (SDTM/ADaM) end-to-end, with automated validation and full regulatory traceability. Working shoulder-to-shoulder with Biostatistics and Clinical Development, I translate messy clinical questions into model specs that are actually useful.

Core: Survival Analysis, Longitudinal Modeling, SDTM/ADaM, Multimodal Clinical Data, Model Validation

Cognizant

Data Scientist – Healthcare

Jul 2021 – May 2025 · Hyderabad, India

Four years at Cognizant meant wearing many hats across pharma and life sciences. I built patient stratification and trial recruitment models using XGBoost and LightGBM — always with bias checks across diverse cohorts. The infrastructure side was just as important: end-to-end ML pipelines on AWS and Azure, from raw ingestion through to serving, running across multiple concurrent projects. I brought generative AI into clinical workflows — document automation, diagnostic support — and set up the MLOps scaffolding (versioning, monitoring, retraining) to keep it reliable. The Tableau and Power BI dashboards I delivered became daily tools for clinical and ops teams tracking enrollment in real time.

Core: Ensemble Modeling, Cloud ML Pipelines, Clinical Data Quality, Dashboard Analytics, MLOps

Institute of Liver & Biliary Sciences

Data Scientist – Medical AI Research

Dec 2020 – Jul 2021 · Delhi, India

This was the role that turned me into a medical AI researcher. Working with clinicians at ILBS, I built sepsis risk prediction models for ACLF patients — XGBoost and Random Forest with careful threshold tuning, so that model outputs actually mapped to clinical decisions (published APASL 2023). On the imaging side, I implemented U-Net and CNN models for brain tumor segmentation from MRI, validated against radiologist-annotated ground truth using dice coefficient and cross-fold validation (IEEE ICAII 2023). Working directly with clinicians forced me to care about clinical meaning, not just statistical metrics.

Core: Clinical Model Validation, Medical Imaging, Clinician Collaboration, Research Publishing

Centyle Biotech Pvt. Ltd

Data Scientist – Bioinformatics

Feb 2019 – Dec 2020 · Delhi, India

My entry into ML was through biology. At Centyle Biotech, I built pipelines to make sense of complex genomic and proteomic datasets — clustering, dimensionality reduction, classification — working in Python and R. I collaborated closely with domain scientists to define features and validation workflows that captured actual biological variability, not just data artifacts. It taught me to iterate carefully and document rigorously before anything touched production.

Core: Bioinformatics, Unsupervised/Supervised Learning, Genomic Data, Model Iteration

Projects

Disease Progression Modeling System2025

A system that forecasts how a patient's disease will evolve — combining Cox & Kaplan-Meier survival analysis, linear mixed-effects models, gradient boosting, and deep learning over multimodal clinical data (EHR records, imaging, genomic markers). Feature engineering was critical: I built biomarker identification pipelines using Random Forest and XGBoost with SHAP to make sure the right signals drove the predictions. Interpretability layers, subgroup validation, and decision threshold tuning made outputs that clinicians could actually trust. Experiment tracking via MLflow kept everything reproducible across dozens of data refreshes.

Generative AI for Drug Discovery2024

What if a researcher could just ask the literature a question and get a cited, synthesized answer? A RAG-based retrieval system over large biomedical corpora — using vector embeddings and semantic search to retrieve the most relevant passages, then synthesizing answers via LLM. Paired with NLP pipelines for drug-target interaction extraction using fine-tuned transformer models. Retrieval quality measured with precision@k and MRR. Result: researchers spending far less time on manual literature review.

Clinical Trial Analytics Platform2022

Built the data engine behind a clinical trial: an end-to-end patient screening pipeline for Phase II/III studies, fusing EHR, lab, and imaging features with automated feature selection (Lasso, mutual information), machine learning classifiers, and calibration analysis to improve cohort matching. On top of that, Tableau dashboards giving trial teams live visibility into enrollment rates, cohort drift, and site performance — replacing the weekly manual reporting grind entirely.

Medical Imaging Diagnostics2021

U-Net and ResNet deep learning models trained to detect brain tumors and hemorrhagic strokes from MRI/CT scans — validated against radiologist-annotated ground truth using dice coefficient, IoU, and cross-fold validation. Confidence scoring added so radiologists could see exactly how certain the model was at each prediction. The full inference pipeline containerized with Docker and integrated into a real clinical workflow for live diagnostic support.

Research & Publications

Advanced MRI Segmentation Algorithm for Brain Tumor Detection

IEEE ICAII 2023, July 2023

View Paper →

Prediction of Sepsis in ACLF Patients – A Machine Learning Approach

APASL ACLF Research Consortium, June 2023

View Paper →

Brain Hemorrhagic Stroke Detection by Image Processing

IEEE ICEECCOT 2022, March 2022

View Paper →

News Classification from Microblogging Dataset using Supervised Learning

IEEE ICCCIS 2021, February 2021

View Paper →

Comparative Analysis of Various Extreme Multi-Label Classification Algorithms

IEEE ICEECCOT 2019, December 2019

View Paper →

A Review of Omega-Based Portfolio Optimization

IEEE ICPECA 2019, November 2019

View Paper →

Education & Certifications

South Asian University (SAARC)

Master of Science in Computer Science

Jul 2016 – Jun 2018 · Delhi, India

Thesis: Multi-label text classification using deep learning and extreme classification techniques for large-scale datasets. [View Thesis]

Coursework: Machine Learning, Deep Learning, NLP, Statistical Inference, Data Structures & Algorithms, Computer Vision, Big Data Systems, Cloud Computing.

Beyond the Work

Taught ML, Deep Learning, NLP & Cloud to 1,000+ learners as a Data Science Instructor & Consultant (2018–2020), and ran hands-on AI workshops at IIT Kanpur. [Certificate]

Presented research at 6 IEEE & APASL conferences and served as ACM Vice Chair & University Ambassador — running healthcare AI knowledge-sharing workshops at Cognizant and Sanofi.

Picked up French & Spanish Diplomas at IIT Delhi (2016–2018) along the way — learning languages, like learning models, never really stops.

Technical Skills

Category	Technologies & Tools
Languages	Python R SQL PySpark
ML / DL	TensorFlow Keras PyTorch Scikit-Learn XGBoost LightGBM Pandas NumPy
Cloud & Big Data	AWS SageMaker AWS Lambda S3 EC2 Azure Databricks Azure ML Studio Hadoop Spark MLlib
Medical AI	U-Net ResNet Clinical NLP Survival Analysis EHR Analytics
NLP & Gen AI	Transformers BERT GPT LangChain RAG
Clinical Data	SDTM ADaM CDISC Clinical Trial Analytics Biostatistics Pipelines
Tools	Docker Kubernetes Git Jenkins MLflow Tableau Power BI
Databases	PostgreSQL MySQL MongoDB
Regulatory	HIPAA GDPR FDA AI/ML Guidance GxP

Contact

If you work in healthcare AI, clinical data science, or pharma R&D — or you're just curious about any of it — I'd genuinely love to connect.

Home Gmail LinkedIn GitHub ResearchGate