First, create and activate a new conda environment
conda create -n tide python=3.12 -y
conda activate tideIf you want to run models other than Mistral-3-14B-Instruct, use the following command to install vllm==0.11.0
bash ./scripts/setup.shIf you need to run the Mistral-3-14B-Instruct model, use the following command to install vllm==0.12.0 and the latest version of transformers from GitHub
bash ./scripts/setup_new.sh
pip install https://github.com/huggingface/transformers.gitexport ALFWORLD_DATA="./data/alfworld"
alfworld-downloadPlease refer to the script files in the scripts directory to run evaluations in different environments.
Notes:
- For the Mistral-3-14B-Instruct model with vllm == 0.12.0, you need to set
export ENFORCE_EAGER=Trueandexport MAX_LOGPROBS=0 - When running in the WebShop environment, DP can only be set to 1
You need to create an yaml file with your API keys at config/api_key.yaml.
Example:
api_config:
deepseek:
base_url: "<api_url>"
api_key: "<api_key>"
gemini:
base_url: "<api_url>"
api_key: "<api_key>"You can set different key for different remote model providers.
Analyze the trajectories of the experiments and generate the output json files
bash scripts/process_exp_res.shAnalyze memory recall of alfworld experiments
bash scripts/process_alfworld_mem_recall.shIf you meet the error of module 'numpy' has no attribute 'trapz',you can try to downgrade numpy to version 2.2.6
The implementation of our code should be thankful to excellent projects RAGEN and SimpleTIR.
If you find TIDE useful in your research, please consider citing the following paper:
@article{yan2025tide,
title={TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents},
author={Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin},
journal={arXiv preprint arXiv:2602.02196},
year={2025}
}