Inesh Reddy Chappidi IneshReddy249

Inesh Reddy Chappidi

LLM Inference Engineer | Optimizing inference systems on NVIDIA GPUs

MS Computer Science @ Florida Atlantic University (Dec 2025)
📧 [email protected] | LinkedIn

What I Work On

I optimize LLM inference performance on NVIDIA A100/H100 GPUs using TensorRT-LLM, vLLM, and low-level CUDA optimization.

Recent work:

Speculative Decoding: 2.26× latency reduction on Qwen models using TensorRT-LLM on A100
Llama-3.1-8B on H100: 1,700 tok/s, 11ms P99 TTFT, 94% GPU utilization
Mixtral 8x7B: Distributed inference on dual A100s with expert + tensor parallelism

Technical Focus

Inference Optimization: TensorRT-LLM, vLLM, Triton Inference Server, speculative decoding, quantization (FP8/INT8/AWQ), paged KV cache, FlashAttention

GPU Programming: CUDA 12.x, NVIDIA Nsight Systems/Compute, kernel-level profiling

Infrastructure: Docker, Kubernetes, FastAPI, AWS, Prometheus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inesh Reddy Chappidi IneshReddy249

Block or report IneshReddy249

Inesh Reddy Chappidi

What I Work On

Technical Focus

Popular repositories Loading

Uh oh!