Member of Technical Staff at Datology working across multimodal pretraining, evaluation systems, research infrastructure, and data-centric ML.
I build systems and research loops for training, evaluating, and improving multimodal models at scale. My recent work sits at the intersection of algorithmic data mixing, evaluation quality, distributed training and inference infrastructure, and agentic tooling for research.
- Algorithmic multimodal pretraining data mixing
- Curating evals for better signal and coverage
- Research infrastructure for large-scale distributed training and vLLM-based eval inference
- Data pipeline infrastructure for large-scale data processing and synthetic data generation using vLLM on Ray orchestrated by Kubernetes
- Agentic tooling for research and harnesses with verifiable signals for autoresearch
At Datology, I work on multimodal model development and the systems around it: training pipelines, eval workflows, large-scale data processing, and the infrastructure needed to iterate quickly on pretraining and post-training decisions.
I am especially interested in building tight loops between:
- data mixture design and downstream capability
- eval coverage and trustworthy research signals
- research ideas and the infrastructure needed to test them quickly
- benchmark-dataloader: multimodal dataloader benchmarking for understanding throughput and systems bottlenecks
- Multimodal Dataloaders Go Brrrrrrr: write-up on dataloader performance and why it matters for practical multimodal training
- UniCat: stronger fusion baseline for multimodal re-identification, with the corresponding paper
- SpecReFlow: official implementation for SPIE Photonics West 2023 on reflection-aware video restoration
If you're working on multimodal pretraining, evaluation, or research infrastructure, feel free to connect.



