Skip to content
View IneshReddy249's full-sized avatar

Block or report IneshReddy249

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
IneshReddy249/README.md

Inesh Reddy Chappidi

LLM Inference Engineer | Optimizing inference systems on NVIDIA GPUs

MS Computer Science @ Florida Atlantic University (Dec 2025)
📧 [email protected] | LinkedIn


What I Work On

I optimize LLM inference performance on NVIDIA A100/H100 GPUs using TensorRT-LLM, vLLM, and low-level CUDA optimization.

Recent work:

  • Speculative Decoding: 2.26× latency reduction on Qwen models using TensorRT-LLM on A100
  • Llama-3.1-8B on H100: 1,700 tok/s, 11ms P99 TTFT, 94% GPU utilization
  • Mixtral 8x7B: Distributed inference on dual A100s with expert + tensor parallelism

Technical Focus

Inference Optimization: TensorRT-LLM, vLLM, Triton Inference Server, speculative decoding, quantization (FP8/INT8/AWQ), paged KV cache, FlashAttention

GPU Programming: CUDA 12.x, NVIDIA Nsight Systems/Compute, kernel-level profiling

Infrastructure: Docker, Kubernetes, FastAPI, AWS, Prometheus


Popular repositories Loading

  1. vLLM-mixtral-MoE-optimization vLLM-mixtral-MoE-optimization Public

    Python 2

  2. IneshReddy249 IneshReddy249 Public

    Jupyter Notebook 1

  3. LLAMA-3.1-8B-TENSORRT-LLM-OPTIMIZATION LLAMA-3.1-8B-TENSORRT-LLM-OPTIMIZATION Public

    Python 1

  4. RAG_PIPELINE RAG_PIPELINE Public

    Python

  5. SPECULATIVE_DECODING SPECULATIVE_DECODING Public

    Python

  6. LLAMA_FINETUNING_WITH_LORA LLAMA_FINETUNING_WITH_LORA Public

    Jupyter Notebook