I am Ziren Wang, a senior undergraduate in the Yao Class at Tsinghua University, majoring in Computer Science. I am currently a research intern at the Systems Lab at the University of Washington, advised by Prof. Baris Kasikci, and also advised by Prof. Mingyu Gao at Tsinghua University.
Currently, I am working on the development of a flexible and high-performance Python framework for LLM inference, featuring fine-grained intra-GPU resource management. My research interests center around distributed systems and machine learning systems, with a focus on building efficient and scalable infrastructures for modern AI workloads.
In the future, I plan to pursue a Ph.D. in Computer Science, starting in Fall 2025. If you are interested in my research, please feel free to contact me.
Publications
-
NanoFlow: Towards Optimal Large Language Model Serving Throughput.
Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci
19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Year 2025. [PDF] [Code]
Projects
-
LLM Inference [Code]
Leading development of a flexible, high-performance Python framework for LLM inference with fine-grained intra-GPU resource management.
We model SMs, memory bandwidth, and PCIe transfers as separable resources on independent streams, enabling kernel co-scheduling and overlap to maximize GPU utilization while minimizing cross-kernel interference. -
Network on Chip [PDF] [Code]
I reproduced the current SOTA algorithm for torus networks GOAL, implemented different VCs control policies and evaluated the experiment results, which shows global load balance by randomly choosing the direction to route in each dimension and therefore achieves local load balance by routing adaptively. -
Backend Development [Slides]
Our team developed an enrollment system, where we can publish announcements, and it also allows users to take exams. Additionally, there are some design tricks, such as masking, security design, and so on. -
Numerical Analysis [PDF]
We presented a new parallel decomposition algorithm that utilizes the sampling algorithm of RChol in conjunction with Multifrontal, dynamically managing the dependencies between threads and nodes. Experiments show that this algorithm can effectively improve the matrix decomposition rate when the matrix has high parallelism; however, it does not accelerate matrices that are inherently difficult to compute in parallel.