LLM Inference Engineer | Optimizing inference systems on NVIDIA GPUs
MS Computer Science @ Florida Atlantic University (Dec 2025)
📧 [email protected] | LinkedIn
I optimize LLM inference performance on NVIDIA A100/H100 GPUs using TensorRT-LLM, vLLM, and low-level CUDA optimization.
Recent work:
- Speculative Decoding: 2.26× latency reduction on Qwen models using TensorRT-LLM on A100
- Llama-3.1-8B on H100: 1,700 tok/s, 11ms P99 TTFT, 94% GPU utilization
- Mixtral 8x7B: Distributed inference on dual A100s with expert + tensor parallelism
Inference Optimization: TensorRT-LLM, vLLM, Triton Inference Server, speculative decoding, quantization (FP8/INT8/AWQ), paged KV cache, FlashAttention
GPU Programming: CUDA 12.x, NVIDIA Nsight Systems/Compute, kernel-level profiling
Infrastructure: Docker, Kubernetes, FastAPI, AWS, Prometheus