Michael Goin `@mgoin`

systems engineer making inference fast

About

I've been working in ML inference since 2019, currently focused on making SOTA open-source LLMs run fast on various hardware accelerators in vLLM as a core maintainer.

I like working across the software stack wherever the bottleneck is - CPU, GPU, NPU, [compute, memory, io]-bound, etc using Python, C++, PyTorch, Triton, CUDA, CUTLASS. Most of my time goes into profiling, benchmarking, and figuring out why things are slow. I enjoy learning about the latest hardware and working hard to utilize it fully.

Before that, my background was in HPC where I worked on robotics, materials science, energy simulations, and neuromorphic computing at ORNL and UTK.

I'm currently working at Red Hat on vLLM to power the open-source AI ecosystem with fast and easy inference. Before acquisition by Red Hat, I was at Neural Magic, where I worked on vLLM and originally built a sparsity-aware inference compiler that optimized CNNs, Transformers, and other models for CPUs.

If you want to reach me, the best way is to ping me @mgoin on vLLM Slack. I'm always happy to collaborate on projects or ideas related to inference performance!

Work

2025-01 -> now Red Hat (acq Neural Magic), Principal Engineer
2024-01 -> now vLLM, Core Maintainer
2019-09 -> 2024-12 Neural Magic, Engineering Tech Lead
2015 -> 2019 University of Tennessee, Knoxville, Research Assistant
2013 -> 2016 Oak Ridge National Laboratory, Research Intern

Changelog

Things I've shipped or helped ship.

2025-10-09 vLLM + NVIDIA Blackwell Optimized Inference
2025-01-27 vLLM V1: A Major Upgrade to vLLM's Core Architecture
2025-01-14 Structured Decoding Optimizations in vLLM
2024-08-31 Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization (arXiv)
2024-06-20 Won a bounty converting nvidia/Nemotron-4-340B-Instruct to work with vLLM (twitter)
2024-05-06 Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment (arXiv)
2023-10-10 Sparse Fine-tuning for LLM Inference Acceleration (arXiv)
2023-10-10 The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models (arXiv)

Talks

2024 -> now vLLM Office Hours, ~bi-weekly vLLM update I host with guests from the community. Slides are available in each video's description.
2025-11-06 vLLM Zurich Meetup, Slides, Recording
2025-11-01 vLLM Beijing Meetup, Slides
2025-10-21 PyTorch Conf 2025 - Accelerating Open-Source RL and Agentic Inference with vLLM, Slides, Recording
2025-10-09 vLLM Tokyo Meetup, Slides
2025-09-18 vLLM Boston Meetup, Slides
2025-05-07 NYC vLLM Meetup, Slides