This repository provides the official PyTorch implementation of the research paper:
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition (Accepted by CVPR2026).
Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for selfevolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on MLLM.
Our implementation is based on LLaMA-Factory and performs LoRA fine-tuning for training and evaluation.
We recommend using Anaconda to create the Python environment and install required libraries:
conda create -n hier python=3.12 -y
conda activate hier
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txtThe data can be downloaded through the following links:
https://drive.google.com/drive/folders/1nCkhkz72F6ucseB73XVbqCaDG-pjhpSS
All configuration files for different model × dataset combinations are placed at:
/LLaMA-Factory/examples/train_lora/The YAML files follow a clear naming convention, e.g.:
qwen2vl_lora_sft_mintrec2.yaml# Training
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_mintrec2.yaml
# Testing / Evaluation
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_lora/qwen2vl_lora_test_mintrec2.yamlThe overview model architecture:
If you are insterested in this work, and want to use the codes or results in this repository, please star this repository and cite by:
@misc{zhou2026evolutionarymultimodalreasoninghierarchical,
title={Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition},
author={Qianrui Zhou and Hua Xu and Yunjin Gu and Yifan Wang and Songze Li and Hanlei Zhang},
year={2026},
eprint={2603.03827},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2603.03827},
}
Some of the code in this repository is built upon and adapted from LLaMA-Factory. We sincerely thank the authors and contributors for their open-source efforts.
If you have any questions or encounter issues, please open an issue and describe your environment, commands, and error logs as clearly as possible.

