Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

This repository provides the official PyTorch implementation of the research paper:

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition (Accepted by CVPR2026).

1.Introduction

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for selfevolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on MLLM.

2. Dependencies

Our implementation is based on LLaMA-Factory and performs LoRA fine-tuning for training and evaluation.

We recommend using Anaconda to create the Python environment and install required libraries:

conda create -n hier python=3.12 -y
conda activate hier
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

3. Usage

3.1 Data

The data can be downloaded through the following links:

https://drive.google.com/drive/folders/1nCkhkz72F6ucseB73XVbqCaDG-pjhpSS

3.2 Configuration Files

All configuration files for different model × dataset combinations are placed at:

/LLaMA-Factory/examples/train_lora/

The YAML files follow a clear naming convention, e.g.:

qwen2vl_lora_sft_mintrec2.yaml

3.3 Run Training / Testing

# Training
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_mintrec2.yaml
# Testing / Evaluation
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_lora/qwen2vl_lora_test_mintrec2.yaml

4. Model

The overview model architecture:

5. Experimental Results

6. Citation

If you are insterested in this work, and want to use the codes or results in this repository, please star this repository and cite by:

@misc{zhou2026evolutionarymultimodalreasoninghierarchical,
      title={Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition}, 
      author={Qianrui Zhou and Hua Xu and Yunjin Gu and Yifan Wang and Songze Li and Hanlei Zhang},
      year={2026},
      eprint={2603.03827},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2603.03827}, 
}

7. Acknowledgements

Some of the code in this repository is built upon and adapted from LLaMA-Factory. We sincerely thank the authors and contributors for their open-source efforts.

If you have any questions or encounter issues, please open an issue and describe your environment, commands, and error logs as clearly as possible.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
examples/train_lora		examples/train_lora
figs		figs
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

1.Introduction

2. Dependencies

3. Usage

3.1 Data

3.2 Configuration Files

3.3 Run Training / Testing

4. Model

5. Experimental Results

6. Citation

7. Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

1.Introduction

2. Dependencies

3. Usage

3.1 Data

3.2 Configuration Files

3.3 Run Training / Testing

4. Model

5. Experimental Results

6. Citation

7. Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages