Jingwei’s Homepage

QStore: Quantization-Aware Compressed Model Storage

2025-07-16T00:00:00+00:00

This paper has been pre-printed on Arxiv, and the accompanying software is open-sourced on GitHub.

Problem

Many tasks require accessing the same large language models (LLMs) at different precision levels. For example, fine-tuning is often performed at a higher precision such as FP16, while the fine-tuned model is often quantized to a lower precision format such as INT8 for faster inference.

However, maintaining multiple versions of a model with varying precisions leads to prohibitive storage costs. A common approach is storing only the high-precision model and generating low-precision models on demand. This approach is inefficient because retrieving a low-precision model requires loading more data than necessary from the high-precision model and performing a computationally expensive quantization process.

The core problem is to reduce LLM storage costs while maintaining efficient retrieval of both high and low-precision models.

Overview

This paper introduces QStore, a novel approach that improves storage efficiency by jointly compressing both high-precision and low-precision versions of the same model. The key insight is that the low-precision model is derived through quantization of the high-precision model, making it possible to leverage the information in the low-precision model to guide the compression of the high-precision model.

Specifically, this paper builds on the observation that two floating-point numbers that quantize to the same value (using the same quantization function) are likely to share many redundant bits. It proposes a two-step grouping approach for compressing the high-precision model. First, it groups the weights of the high-precision model according to the quantization function applied. Within each group, it further divides the weights into subgroups based on those that quantize to the same value in the low-precision model. Compression is then performed on a per-subgroup basis, as weights within the same subgroup are expected to share significant redundancy.

In addition, the low-precision model is directly compressed using Zstd, a state-of-the-art lossless compression algorithm. The likely reason for not using ZipNN (a lossless compression algorithm designed for LLMs) is that the weights in the low-precision model are integers (e.g., INT8 considered in this paper), whereas ZipNN is better suited for compressing LLMs with floating-point parameters.

ZipNN: Lossless Compression for AI Models

2025-05-16T00:00:00+00:00

This paper has been pre-printed on Arxiv, and the accompanying software is open-sourced on GitHub.

Motivation

This paper focuses on lossless compression for neural network models, motivated by three key use cases.

Model hubs (e.g., Hugging Face, Model Zoo, PyTorch, TensorFlow) host a large number of models and serve numerous download requests for popular models. Lossless compression offers several benefits, including reducing the amount of data transferred during downloads, minimizing the storage footprint of hosted models, and decreasing the time required to upload and download these models.
Distributed training trains large models by distributing the computational workload across multiple devices, such as GPUs, TPUs or servers connected over a network, in order to address the resource limitations of single-device training. Lossless compression can play a crucial role in reducing the amount of data (e.g., model weights, optimizer weights and gradients) transferred between devices and mitigating network bottlenecks.
During training, multiple intermediate model versions are saved for various purposes, such as hyperparameter tuning, fault tolerance, analysis, and performance evaluation. Lossless compression helps minimize the storage footprint of these checkpoint versions.

Model Parameters

In neural networks, a layer is a fundamental unit responsible for performing specific transformations, such as fully connected layers, convolutional layers, or attention layers. Each layer typically includes multiple tensors, which are multi-dimensional arrays. These tensors store the model’s parameters—such as weights and biases—that are learned during the training process. This paper focuses on compressing the numeric parameters within these tensors.

The parameters are represented as floating-point numbers, which consist of three key components:

Exponent: Indicates the range within which the real number lies.
Fraction: Specifies the exact value within the given range.
Sign bit: Denotes whether the number is positive or negative.

The real number is calculated as (-1)^{sign} * 2^{exponent-127} * (1.fraction). Models are trained with various standards that represent these floating-point numbers in different ways.

Representations	Sign	Exponent	Fraction
FP32	1 bit	8 bit	23 bit
BF16	1 bit	8 bit	7 bit
FP16	1 bit	5 bit	10 bit

Key Observations

This paper identifies two key observations regarding the compressibility of parameters:

Limited effectiveness of LZ compression. Traditional LZ compression algorithms, which rely on removing repetitions, yield minimal storage savings. This is because tensors in neural networks are inherently noisy, and their parameters typically lack any meaningful affinity with neighboring values.
Skewness in the exponent part. The exponent values exhibit a highly skewed distribution. Some exponent values occur with significantly higher probabilities, while others are rarely observed. A plausible explanation is that model weights are initially initialized within the range ([-1, +1]), and the training process rarely pushes these weights beyond that range.

Main Idea

This paper presents ZipNN, an approach designed to compress both regular models (unmodified after training) and clean models (with parameters rounded after training). The core innovation of ZipNN lies in byte grouping, a technique that organizes model parameters for more efficient compression. Specifically:

The exponent parts of different parameters are grouped together.
Similarly, each byte of the fraction parts from different parameters is grouped together.

After grouping, ZipNN applies either ZSTD (a combination of LZ compression and Huffman encoding) or Huffman encoding to each group of bytes, achieving significant compression.

The design rationale is in two folds. First, guided by the second observation, grouping exponents together for entropy encoding compression (e.g., Huffman encoding) is expected to achieve significant storage savings. This is effective for both regular and clean models. Second, in clean models, the rounding of the fraction part results in the least significant bits being zeros, while the most significant bits remain relatively random. By grouping the fraction part byte-by-byte for compression, additional storage savings can be achieved, particularly for the least significant bits in clean models.

Additional Observations

This paper provides additional insights into the compression of gradients and optimizers, as well as delta compression on checkpoint models.

Gradients represent the direction and magnitude of change required for each parameter to reduce the training loss. Optimizers use gradients to update the model’s parameters during training while maintaining additional state to improve convergence. This paper observes that gradients and optimizers are more compressible than the model itself. This is primarily due to the extreme compressibility of the token embeddings layer within gradients and optimizers.

Delta compression, which involves XORing consecutive checkpoint models and further compressing the resulting XOR deltas, offers more storage savings compared to standalone compression. This paper highlights that during the training process, although all parameters are updated in each epoch, an increasing number of bytes remain unchanged as the model approaches convergence. This increases the effectiveness of delta compression over time.

IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference

2025-04-03T14:24:00+00:00

This paper, published in FAST’25, aims to improve the inference performance of large language models (LLMs).

LLM Inference Basics

LLMs are artificial intelligence systems designed to understand and generate human language. They learn patterns in language, grammar, context, and general knowledge by training on vast datasets of text from various sources. An LLM’s inference workflow is the process of generating responses from an input prompt after pre-training. It begins by breaking down input texts into tokens that can be words, subwords, or characters. This tokenization enables LLM to process diverse vocabularies and writing styles. The tokens are then converted into numerical embeddings (i.e., dense vector representations) that capture the meanings and relationships among tokens.

These embeddings enter LLM’s input layer, which processes them before they move through the transformer architecture. Specifically, the embeddings then pass through multiple transformer layers, with each layer containing a self-attention mechanism (which analyzes the relationships between tokens in the input) and a feed-forward neural network (which improves LLM’s ability in making predictions).

After processing through the stacked transformer layers, the final states pass through an output layer, which converts the high-dimension representations into a probability distribution across the vocabulary to predict the next token. This newly generated token joins the original input sequence, creating an extended input for the next iteration. LLM repeatedly processes the concatenated sequence through the same steps: embedding, transformer layer processing, and output layer prediction, until reaching a predefined stopping condition. Finally, the generated sequence of tokens converts back into human-readable texts for response.

Problem

The responsiveness of LLM applications significantly impacts user experience. A key metric is the time-to-first-token (TTFT), which measures the latency from processing initial input texts until the first output token is generated. However, many LLM applications prepend user queries with context-rich prefixes. While improving response quality, these prefixes increase the input sequence size, consequently extending the critical TTFT.

Existing studies often employ prefix key-value (KV) caches, which store the computed states for common prefixes to avoid re-computations when these prefixes are reused. However, due to the substantial storage footprints of prefix KVs, managing them within the limited space of GPUs or CPUs is impossible. On the other hand, storing prefix KVs on SSDs results in significant I/O latencies to load them into GPU for processing, potentially increasing the TTFT even further.

This paper aims to improve prefix KV management to reduce TTFT.

Main Idea

This paper builds on the previous observation that the KV pairs associated with different tokens vary in their importance for maintaining LLM accuracy during inference. It focuses on minimizing I/O during prefix reuse, and propose selectively retrieving only the KVs deemed important, thereby avoiding the latency associated with loading and processing less critical KV pairs from storage.

This paper introduces IMPRESS, an importance-aware prefix KV cache management approach. For an input sequence S comprising a prefix (P) and a query (Q), denoted S=P||Q, IMPRESS first identifies the longest common subsequence (R) shared between P and previously cached prefixes. Then, it selectively directs only specific KVs to the LLM for processing: those corresponding to (i) important tokens within the matched subsequence R; (ii) tokens unique to the current prefix (P−R), and (iii) the query tokens Q. KVs corresponding to unimportant tokens within R are intentionally omitted to reduce I/O overhead.

Design Summary

IMPRESS faces two design challenges: (i) efficiently identifying important tokens and (ii) optimizing the management of prefix KVs.

To address challenge (i), IMPRESS leverages a key observation: the relative importance ranking between tokens often remains consistent across different attention heads (which are parallel instances of the self-attention mechanism, i.e., the self-attention layer typically consists of multiple attention heads, each capable of learning different patterns or relationships within the input sequence). Thus, IMPRESS proposes checking the importance of tokens in a limited number of selected heads and using the identified important tokens from these heads to approximate those in the remaining heads. This design approach mitigates I/O overhead by only requiring the keys of the selected heads to be loaded into the GPU to identify the important tokens, rather than loading those of all heads.

To address challenge (ii), IMPRESS introduces two approaches. The first approach, KV reordering, addresses the inefficiency of retrieving KVs when those of important tokens are scattered across different chunks (the basic unit of storage management). By reordering the KVs within chunks based on token importance, IMPRESS physically consolidates the KVs of important tokens into fewer chunks. This spatial locality allows retrieval of many important KVs with fewer I/O operations. However, global reordering could disrupt the structure of the radix tree, which is used for efficiently finding the longest common prefix in LLM. IMPRESS confines the reordering scope to tokens within the same node of the radix tree.

The second approach, score-based cache management, augments traditional cache replacement policies with the token importance. Traditional policies often prioritize based solely on access frequency (e.g., least recently used). This should lead to the premature eviction of chunks containing highly important KVs if those chunks have not been accessed recently. IMPRESS assigns each chunk with a score that considers both its access frequency and the proportion of important KVs it contains. Cache eviction decisions are then based on this combined score, ensuring that chunks rich in important KVs are preferentially retained in the fast cache tier.

MedFS: Pursuing Low Update Overhead via Metadata-Enabled Delta Compression for Log-structured File System on Mobile Device

2025-04-02T10:01:00+00:00

This paper was published in FAST’25. It explores the use of delta compression to reduce write traffic on mobile devices’ flash storage. Delta compression works by maintaining only the XORed data (called a delta chunk) between the original memory page and its updated version. When the original and updated pages are similar, the delta chunk is typically small—often just a few dozen bytes for a 4KB page. This approach minimizes written data and reduces daily full disk writes, extending flash storage lifespan. The paper’s analysis shows that 77.1% of write traffic in real-world applications comes from file updates, where the difference between updated and original content is only 13.8%, making delta compression particularly effective. layout: post

However, delta compression poses two key challenges. First, it requires additional metadata to link the original page with the corresponding delta chunk. Second, it creates write/read amplification because the system must process three components for writes/reads: the original page, the delta chunk, and related metadata (such as the in-page offset and size of the delta chunk).

This paper aims to address these challenges and enhance the lifetime of flash storage via delta compression.

Flash-friendly File System

This paper focuses on F2FS (flash-friendly file system), developed by Samsung to optimize the performance and lifespan use of NAND flash memory (a type of non-volatile storage technology that retains data even when the power is turned off) storage devices. F2FS uses a log-structured design, writing data sequentially, and a node-based structure to organize files. Specifically, F2FS includes three types of nodes: inode, direct node and indirect node. An inode block is allocated 4KB and contains 923 data block pointers, which can collectively index a file with up to 3.69MB in the inline area.

Main Idea

The idea is inspired by the observation that 90% of files generated or updated in modern applications are small files (i.e., less than 3.69MB). This indicates that the inline area in F2FS is often underutilized. This paper proposes maintaining deltas in the file’s inline area to improve space utilization. Once the inode is retrieved during the first file operation, the following file operations are highly likely to hit the inode in the cache. Storing deltas within the inode can effectively leverage high cache hit rates to reduce the access overhead of delta chunks.

Design Summary

This paper presents MedFS to realize the above idea in F2FS. MedFS includes two key design components: (i) delta chunk inlining (DCI), which processes delta compression and manages chunks in the inline area, and (ii) delta chunk maintenance (DCM), which manages delta chunks in persistent storage.

Delta Chunk Inlining

DCI associates each delta chunk with two metadata fields: delta index that indicates the page address of its base page and delta size that represents the size of the delta chunk. DCI organizes these delta chunks and their metadata in the inline area, writing them from tail to head. The rationale is that as more data is added to a file, the pointer area (which contains pointers to data blocks or direct/indirect nodes) grows, and this tail-to-head organization ensures that pointer area growth does not interfere with the delta chunks.

In addition, DCI manages delta chunks within the size constraints of the inline area. First, as the file size increases, the available space in the inline area decreases, prompting DCI to evict delta chunks using a FIFO strategy. However, it remains unclear in this paper that whether DCI reconstructs the full page for evicted delta chunks (disabling delta compression for such pages) or combines multiple delta chunks for storage in persistent storage, as processed by DCM (see below).

Second, when the file receives a new delta chunk, DCI determines whether to replace an existing delta chunk by comparing their respective benefits. It normalizes the I/O latency of writing a new page based on the ratio of the size of the new delta chunk to the average size of all delta chunks. It also models the I/O latency overhead of replacement, which includes retrieving the base page of the existing delta chunk, performing decompression, and writing the decompressed full page into persistent storage. If the benefit of replacing the existing delta chunk outweighs the associated overhead, DCI performs the replacement.

Delta Chunk Maintenance

DCM manages delta chunks that are evicted or replaced from the inline area. It compacts these chunks into a compact page to maintain delta compression benefits. However, reading from compact pages causes read amplification, as it requires retrieving both delta chunks and their corresponding base pages (which are stored separately) for decompression. DCM determines whether to create compact pages based on file access patterns. For write-hot and read-cold files, DCM creates compact pages to reduce write I/O traffic. For other files, DCM simply flushes uncompressed data pages to avoid read amplification.

For decision-making, DCM uses a dynamic file hotness clustering approach. It tracks each file’s read and write counts in real-time, along with their most recent I/O latencies, storing this information within the inode. Based on average read and write times, it categorizes files into four groups: read-hot/write-hot, read-hot/write-cold, read-cold/write-cold, and read-cold/write-hot. During system idle periods, DCM restores full pages for read-hot files by decompressing their delta chunks from compact pages.