Causal Temporal Reduction (CTR) Pipeline
Table 1: Offline evaluation across three long-video benchmarks. StreamingTOM achieves state-of-the-art performance among training-free streaming methods with 63.8% average accuracy, outperforming both StreamMem (63.1%) and LiveVLM (60.9%). The method demonstrates consistent improvements across VideoMME, MLVU, and EgoSchema benchmarks.
Table 2: Evaluation results on RVS streaming benchmarks under memory-constrained settings (28GB GPU memory limit). StreamingTOM achieves 55.8% average accuracy and 3.7 average score, outperforming all other training-free methods without CPU offloading. This demonstrates the effectiveness of our two-stage compression approach for practical streaming deployment.
@inproceedings{chen2026streamingtom,
title={StreamingTOM: Streaming Token Compression for Efficient Video Understanding},
author={Chen, Xueyi and Tao, Keda and Shao, Kele and Wang, Huan},
booktitle={CVPR},
year={2026}
}