Vikash Kumar Dubey
Artificial Intelligence in Healthcare & Clinical AI.
AI Researcher · Senior AI/ML Engineer · Sanofi ↗
I build AI/ML models and data pipelines for pharmaceutical drug development, clinical research, and bioinformatics — working across disease progression modeling, patient stratification, medical imaging diagnostics, and clinical trial analytics with multimodal data (EHR, imaging, genomics).
I have deep roots in CDISC-aligned data standards (SDTM/ADaM), regulated environments, and close collaboration with Biostatistics, Clinical Development, and R&D teams. I've also published 6 peer-reviewed papers at IEEE and APASL conferences — the research side keeps me honest.
"Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less."— Marie Curie · In today's AI world, the more we learn, the less we fear.
At Sanofi I work deep in the science — building survival and longitudinal models that track how diseases progress through time, drawing from rich multimodal datasets: cognitive endpoints, MRI/PET neuroimaging, genomic biomarkers. Every model has to hold up under scrutiny — I evaluate discrimination, calibration, and subgroup robustness before anything reaches a trial team. The pipelines I build follow CDISC standards (SDTM/ADaM) end-to-end, with automated validation and full regulatory traceability. Working shoulder-to-shoulder with Biostatistics and Clinical Development, I translate messy clinical questions into model specs that are actually useful.
Four years at Cognizant meant wearing many hats across pharma and life sciences. I built patient stratification and trial recruitment models using XGBoost and LightGBM — always with bias checks across diverse cohorts. The infrastructure side was just as important: end-to-end ML pipelines on AWS and Azure, from raw ingestion through to serving, running across multiple concurrent projects. I brought generative AI into clinical workflows — document automation, diagnostic support — and set up the MLOps scaffolding (versioning, monitoring, retraining) to keep it reliable. The Tableau and Power BI dashboards I delivered became daily tools for clinical and ops teams tracking enrollment in real time.
This was the role that turned me into a medical AI researcher. Working with clinicians at ILBS, I built sepsis risk prediction models for ACLF patients — XGBoost and Random Forest with careful threshold tuning, so that model outputs actually mapped to clinical decisions (published APASL 2023). On the imaging side, I implemented U-Net and CNN models for brain tumor segmentation from MRI, validated against radiologist-annotated ground truth using dice coefficient and cross-fold validation (IEEE ICAII 2023). Working directly with clinicians forced me to care about clinical meaning, not just statistical metrics.
My entry into ML was through biology. At Centyle Biotech, I built pipelines to make sense of complex genomic and proteomic datasets — clustering, dimensionality reduction, classification — working in Python and R. I collaborated closely with domain scientists to define features and validation workflows that captured actual biological variability, not just data artifacts. It taught me to iterate carefully and document rigorously before anything touched production.
A system that forecasts how a patient's disease will evolve — combining Cox & Kaplan-Meier survival analysis, linear mixed-effects models, gradient boosting, and deep learning over multimodal clinical data (EHR records, imaging, genomic markers). Feature engineering was critical: I built biomarker identification pipelines using Random Forest and XGBoost with SHAP to make sure the right signals drove the predictions. Interpretability layers, subgroup validation, and decision threshold tuning made outputs that clinicians could actually trust. Experiment tracking via MLflow kept everything reproducible across dozens of data refreshes.
What if a researcher could just ask the literature a question and get a cited, synthesized answer? A RAG-based retrieval system over large biomedical corpora — using vector embeddings and semantic search to retrieve the most relevant passages, then synthesizing answers via LLM. Paired with NLP pipelines for drug-target interaction extraction using fine-tuned transformer models. Retrieval quality measured with precision@k and MRR. Result: researchers spending far less time on manual literature review.
Built the data engine behind a clinical trial: an end-to-end patient screening pipeline for Phase II/III studies, fusing EHR, lab, and imaging features with automated feature selection (Lasso, mutual information), machine learning classifiers, and calibration analysis to improve cohort matching. On top of that, Tableau dashboards giving trial teams live visibility into enrollment rates, cohort drift, and site performance — replacing the weekly manual reporting grind entirely.
U-Net and ResNet deep learning models trained to detect brain tumors and hemorrhagic strokes from MRI/CT scans — validated against radiologist-annotated ground truth using dice coefficient, IoU, and cross-fold validation. Confidence scoring added so radiologists could see exactly how certain the model was at each prediction. The full inference pipeline containerized with Docker and integrated into a real clinical workflow for live diagnostic support.
Thesis: Multi-label text classification using deep learning and extreme classification techniques for large-scale datasets. [View Thesis]
Coursework: Machine Learning, Deep Learning, NLP, Statistical Inference, Data Structures & Algorithms, Computer Vision, Big Data Systems, Cloud Computing.
| Category | Technologies & Tools |
|---|---|
| Languages | Python R SQL PySpark |
| ML / DL | TensorFlow Keras PyTorch Scikit-Learn XGBoost LightGBM Pandas NumPy |
| Cloud & Big Data | AWS SageMaker AWS Lambda S3 EC2 Azure Databricks Azure ML Studio Hadoop Spark MLlib |
| Medical AI | U-Net ResNet Clinical NLP Survival Analysis EHR Analytics |
| NLP & Gen AI | Transformers BERT GPT LangChain RAG |
| Clinical Data | SDTM ADaM CDISC Clinical Trial Analytics Biostatistics Pipelines |
| Tools | Docker Kubernetes Git Jenkins MLflow Tableau Power BI |
| Databases | PostgreSQL MySQL MongoDB |
| Regulatory | HIPAA GDPR FDA AI/ML Guidance GxP |
If you work in healthcare AI, clinical data science, or pharma R&D — or you're just curious about any of it — I'd genuinely love to connect.