StreamingTOM: Streaming TOken coMpression for Efficient Video Understanding

Xueyi Chen1,2  Keda Tao1,3  Kele Shao1,4,3  Huan Wang1*
CVPR 2026
1Westlake University 2The Chinese University of Hong Kong 3Zhejiang University 4SII
*Corresponding author: [email protected]
StreamingTOM Teaser
Left: StreamingTOM (streaming token compression) is a training-free, two-stage framework for efficient streaming video understanding: Causal Temporal Reduction selects pre-LLM tokens with a causal, fixed-budget policy, and Online Quantized Memory bounds the active kv-cache via 4-bit quantization with on-demand retrieval. Right: On LLaVA-OV-7B, StreamingTOM achieves a 15.7× kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers 1.2× lower peak memory and 2× faster TTFT.

Abstract

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves 15.7× kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers 1.2× lower peak memory and 2× faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of 63.8% on offline benchmarks and 55.8% accuracy and 3.7 score on RVS. These results demonstrate that real-time streaming video understanding with bounded active memory is achievable without model retraining.

Method Overview

StreamingTOM Architecture Overview
StreamingTOM Architecture: The framework consists of two coordinated pipelines. The vision pipeline encodes each frame and applies Causal Temporal Reduction to condense redundant tokens into compact groups, which are written to an online memory for reuse. The query pipeline processes questions and drives the decoder to interact with the memory through Online Quantized Memory, which stores groups at 4-bit precision, retrieves at most k groups on demand, and dequantizes them for efficient generation.

Causal Temporal Reduction (CTR) Pipeline

CTR Pipeline Details
CTR compression pipeline: The algorithm processes visual tokens from consecutive frames using a 2-frame window (current and previous), producing a binary classification (static, dynamic) through similarity comparison. The adaptive budget allocation dynamically distributes compression resources based on content, followed by dual-path processing: dpc clustering for static tokens and attention-based selection for dynamic tokens.

Experimental Results

Offline Video Understanding Benchmarks

Table 1: Offline evaluation results

Table 1: Offline evaluation across three long-video benchmarks. StreamingTOM achieves state-of-the-art performance among training-free streaming methods with 63.8% average accuracy, outperforming both StreamMem (63.1%) and LiveVLM (60.9%). The method demonstrates consistent improvements across VideoMME, MLVU, and EgoSchema benchmarks.

Streaming Video Understanding (RVS Benchmarks)

Table 2: RVS streaming results

Table 2: Evaluation results on RVS streaming benchmarks under memory-constrained settings (28GB GPU memory limit). StreamingTOM achieves 55.8% average accuracy and 3.7 average score, outperforming all other training-free methods without CPU offloading. This demonstrates the effectiveness of our two-stage compression approach for practical streaming deployment.

Efficiency Analysis

Processing timeline comparison
Complete processing timeline comparison. The top row shows the baseline, while the bottom row shows StreamingTOM. For a 64-frame stream at batch size 8 with 50 tokens per frame, StreamingTOM incurs minimal overhead: OQM introduces 7.3ms for kv storage, 6.9ms for retrieval, and 28.7ms for 4-bit reconstruction. In contrast, CTR delivers 3.6× prefill acceleration from 337.8ms to 92.8ms, yielding an efficient 0.20s query TTFT and demonstrating substantial performance gains at negligible cost.

Qualitative Results

Qualitative case study on RVS Movie
Two qualitative examples from RVS Movie requiring long-horizon reasoning. StreamingTOM provides faithful answers consistent with ground truth, accurately capturing fine-grained semantics such as "subway station" and multi-party interactions. These results illustrate the model's ability to maintain causal reasoning and long-horizon consistency in real streaming scenarios.

BibTeX

@inproceedings{chen2026streamingtom,
  title={StreamingTOM: Streaming Token Compression for Efficient Video Understanding},
  author={Chen, Xueyi and Tao, Keda and Shao, Kele and Wang, Huan},
  booktitle={CVPR},
  year={2026}
}