Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting
-
Meta Learning for Time Series Forecasting: we add the code for meta-learning-based model selection used in the paper. You can:
- Run meta learning experiments:
python meta/run.py --mode simple --test_dataset ETTh2 --meta_model_type mlp
- Extract meta-features for datasets:
python meta/meta_features/get_meta_features_LTF.py --meta_feature_type tabpfn
- Apply meta selection to new datasets:
python meta/run_custom.py --new_dataset my_dataset --checkpoint_path <path> --new_dataset_path <csv_path> --scripts_root <scripts_dir>
- Run meta learning experiments:
-
We add distribution plot analyses of meta-features based on our method (TabPFN-based) and other statistical methods. We found that the meta-features extracted by TabPFN exhibit a more pronounced normal distribution.
Official implementation of TSCOMP.
As the field of multivariate time series forecasting (MTSF) continues to diversify across Transformers, MLPs, Large Language Models (LLMs), and Time Series Foundation Models (TSFMs), existing studies typically address concerns about methodological effectiveness by conducting large-scale benchmarks. These studies consistently indicate that no single approach dominates across all scenarios.
However, existing benchmarks typically evaluate models holistically, failing to analyze the multi-level hierarchy of MTSF pipelines. Consequently, the contributions of internal mechanisms remain obscured, hindering the combination of effective designs into superior solutions.
To bridge these gaps, we propose TSCOMP, a comprehensive framework designed to systematically deconstruct and benchmark deep MTSF methods. Instead of viewing models as indivisible black boxes, TSCOMP performs a hierarchical deconstruction across three levels: the Pipeline, Component Dimensions, and Deconstructed Components.
- Comprehensive benchmark via hierarchical deconstruction We propose TSCOMP, the first large-scale benchmark that systematically deconstructs deep MTSF methods. TSCOMP examines the MTSF workflow through a hierarchical design space, spanning from the overall modeling pipeline to fine-grained specific components. To rigorously assess these elements, we design a constrained orthogonal evaluation protocol that isolates the core mechanisms driving forecasting performance.
- Multi-view analysis and insights We conduct a large-scale analysis that provides both overall and conditional insights. Beyond evaluating general component effectiveness, we extensively investigate performance variations across different backbones (including specific models and emerging LLMs/TSFMs), diverse data domains, and data characteristics. Furthermore, we explore the intricate interaction effects among deconstructed components, verifying community claims with rigorous experimental evidence.
- Open-sourced corpus and automated construction We open-source the resulting fine-grained performance corpus and validate its utility for model design. This corpus facilitates automated construction of MTSF methods that are adaptively tailored to different forecasting scenarios, consistently achieving better results than state-of-the-art methods.
Overview of the proposed TSCOMP framework. TSCOMP deconstructs existing SOTA models into a modular component pool. Through large-scale experimental analysis, TSCOMP conducts bottom-up evaluation from component-level comparisons to dimension-level and pipeline-level importance ranking. The resulting performance corpus enables automated model construction via a pre-trained meta-predictor that delivers zero-shot, data-adaptive component selection.
Deconstructed component taxonomy in TSCOMP. We organize forecasting model design into a hierarchical component space for controlled and interpretable benchmarking.
The design space is structured into three levels:
- Pipeline level: the standard MTSF workflow is modeled as Series Preprocessing -> Series Encoding -> Network Architecture -> Network Optimization.
- Dimension level: each pipeline stage contains multiple component dimensions, such as normalization, tokenization, and attention mechanisms.
- Component level: each dimension includes concrete implementations extracted from SOTA models, such as RevIN normalization, series patching, and sparse attention.
This deconstruction forms a structured and extensible design space that covers diverse modeling strategies.
Constrained orthogonal pool generation process. Following the protocol in our paper, TSCOMP constructs valid model combinations under compatibility constraints to ensure fair and systematic large-scale evaluation.
Design Space Complexity.
The Cartesian product of component dimensions yields more than
Pairwise Coverage Criterion.
To balance rigor and efficiency, we adopt a constrained orthogonal design that targets pairwise coverage of valid component interactions. Compared with exhaustive
data_provider/: dataset loading and preprocessing.models/: forecasting model implementations.layers/: reusable neural network building blocks.exp/: experiment pipelines for forecasting tasks.scripts/: generated batch scripts for benchmark execution.meta/: meta-feature extraction and meta-learning based model selection.figures/: framework and analysis figures used in the paper and README.
To reproduce the experimental results for TSCOMP, you need to first generate the execution scripts for the Constrained Orthogonal Pool and the Random Pool, and then run these generated scripts.
conda env create -f environment.yml
conda activate tscompPlease run the following Python scripts to generate bash scripts for batch testing of short-term and long-term forecasting tasks:
-
Short-term forecasting:
python notebooks/bash_generator_short_term_forecasting_sota_seed.py
-
Long-term forecasting:
python notebooks/bash_generator_long_term_forecasting_sota_seed.py
After executing the above code, a series of .sh script files will be generated in scripts/ (or the output directory specified in the code).
Once generated, you can directly run the .sh scripts to build and evaluate the TSCOMP model combinations within the benchmark, for example:
bash scripts/<generated_script_name>.sh-
Run meta learning experiments:
python meta/run.py --mode simple --test_dataset ETTh2 --meta_model_type mlp
-
Extract meta-features for datasets:
python meta/meta_features/get_meta_features_LTF.py --meta_feature_type tabpfn
-
Apply meta selection to new datasets:
python meta/run_custom.py --new_dataset my_dataset --checkpoint_path <path> --new_dataset_path <csv_path> --scripts_root <scripts_dir>
If you find this work useful, please consider citing:
@inproceedings{liang2025beyond,
title={Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting},
author={Liang, Shuang and Hou, Chaochuan and Yao, Xu and Wang, Shiping and Huang, Hailiang and Han, Songqiao and Jiang, Minqi},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)},
year={2025}
}


