2025.08.18 🌟 We are excited to introduce Thyme: Think Beyond Images. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration and code execution precision.

conda create -n Thyme python=3.10 -y
conda activate Thyme

pip install -e .
pip install "sglang[all]" -U
pip install "vllm>=0.5.1" "transformers<4.55" "trl<0.21" -U
pip install "lmdeploy>=0.5,<0.9" -U --no-deps
pip install autoawq -U --no-deps
pip install auto_gptq optimum bitsandbytes "gradio<5.33" -U
pip install git+https://github.com/modelscope/ms-swift.git
pip install timm -U
pip install "deepspeed<0.17" -U
pip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U
pip install liger_kernel nvitop pre-commit math_verify py-spy -U
pip install wandb

pip install flash-attn --no-build-isolation --use-pep517

2. Data Preparation

2.1 Download Dataset

Obtain the training data from the HuggingFace Dataset Page. The SFT dataset consists of three splits:

wo_thinking_thyme_single_round: Single-turn image operation data
2round: Multi-turn dialogue data
computation: Annealing data used for computational tasks

Each sample’s image field is a list containing the original and processed images.

2.2 Process Images and Update Paths

Before training, ensure all referenced images are downloaded and saved locally. Update the dataset files (e.g., .jsonl) by replacing image URLs or remote paths with local absolute paths, for example:

"image": [
  "/path/to/original_images/0904.0709_0.jpg",
  "/path/to/processed_images/0904.0709_0_6349.jpg"
]

2.3 File Path Conversion for System Integration

In every question, there is a specified file path that needs to be converted into the correct system path for use in our platform. The following steps outline the process for handling these paths.

Conversion Process:

Original Path Format:
- Example: "User Image Path: \"/mllm_hdd/yfzhang/data/temp_processed_images/cauldron_dvqa_images_dvqa_00110792.png_rotated_image_318.png\""
Transformation:
- Extract the filename from the original path.
- Convert it into the first element of the image array in the system.
- This element will represent the correct file path for the system.
Response Path Conversion:
- Similarly, ensure that any corresponding absolute paths provided in the response are transformed to match the system format as described.

3. Supervised Fine-Tuning (Thyme-SFT)

3.1 Training Data Format

Training samples follow this JSON format example (full dataset includes similar structures):

{
  "image": ["/path/to/original.jpg", "/path/to/processed.jpg"],
  "question": "<image>\nBased on the top-right graph, describe the behavior of P(z) as z approaches zero. Options:\n...",
  "response": "<think>Detailed reasoning and executable code...</think><answer>B</answer>"
}

3.2 Configure Training Paths

Set these variables in your training script or environment:

DATASET: Path to your training dataset
SAVE_PATH: Directory to save the trained model
Model: Path to your model

3.3 Run Training

Execute the training scripts:

sh scripts/sft_stage1.sh   # Stage 1: Supervised fine-tuning
sh scripts/sft_stage2.sh   # Stage 2: Computational Data Annealing

Note: Since the computational data contains samples with and without images, this stage requires these two types of data to be processed and input separately. Please ensure your script handles this accordingly.

4. Reinforcement Learning (Thyme-RL)

4.1 RL Dataset Format

Each RL data sample follows this structure:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant. Solve the problem step-by-step..."
    },
    {
      "role": "user",
      "content": "<image>\nQuestion text...\n"
    }
  ],
  "images": ["/path/to/image.jpg"],
  "solution": "black",
  "question": "What is the color of the glasses frame of the man in the white shirt?"
}

4.2 Reward Model Configuration

Reward functions are implemented in examples/train/grpo/plugin/agent_rm.py, supporting fmt_orm, vqa_orm, and cst_orm by default.
For multi-node training, configure REWARD_API_ADDRESS and QWEN_API_PORT in .deepspeed_env.
Single-node training can use the default environment variables.

Example asynchronous function calling the reward model API:

async def llm_openai_api(messages, ip="10.82.121.22", host="8080", temperature=0.1, max_tokens=256, top_p=None, n=1):
    openai_api_base = f"http://{ip}:{host}/v1"
    async with httpx.AsyncClient(timeout=httpx.Timeout(600.0)) as client:
        model = "/mllm_hdd/yfzhang/models/Qwen2.5-VL-72B-Instruct-AWQ"
        resp = await client.post(
            f"{openai_api_base}/chat/completions",
            headers={"Content-Type": "application/json"},
            json={
                "model": model,
                "messages": messages,
                "temperature": temperature,
                "max_tokens": max_tokens,
                "top_p": top_p,
                "n": n,
            },
        )
        resp.raise_for_status()
        response_data = resp.json()
        return [choice["message"]["content"] for choice in response_data["choices"]]

4.3 Launch RL Training

Run the script:

sh scripts/rl.sh

Parameters explained:

--O3 true activates Thyme training configuration with multi-turn dialogue and sandboxed iterative output processing.
--stop_words ... defines output termination tokens; enabling GRPO-ATS with sampling temperature set to 0 when <code> is a stop word.

5. Evaluation

We recommend using VLMEvalKit for automatic evaluation on benchmarks.

5.1 Install VLMEvalKit

cd eval/VLMEvalKit
pip install -e .

5.2 Evaluate on Multiple Benchmarks

For example, to evaluate on V* and HRBench with 8 GPUs, simply run (under eval/VLMEvalKit directory):

torchrun --nproc-per-node=8 run.py --data VStarBench HRBench4K HRBench8K --model Thyme-7B --verbose

6. Usage Example: How to use Thyme

We provide a simple script (eval/inference.py) to inference on custom cases. Simply run:

python -m eval.inference

You may modify the user question and image path for your own cases:

# --- 3. Prepare Input Data (Example) ---
# Example question and responses (replace with your actual data)
# Example case (sampled from MME-RealWorld-Lite bench)
question_text = ("Question: What is the plate number of the blue car in the picture?\nOptions:\n"
                 "A. S OT 911\n"
                 "B. S TQ 119\n"
                 "C. S QT 911\n"
                 "D. B QT 119\n"
                 "E. This image doesn't feature the plate number.\n"
                 "Please select the correct answer from the options above.")
image_path = "eval/17127.jpg"

Visualization of the Provided Example

❓ User Prompt

Question: What is the plate number of the blue car in the picture?
Options:
A. S OT 911
B. S TQ 119
C. S QT 911
D. B QT 119
E. This image doesn't feature the plate number.
Please select the correct answer from the options above.

🤖 Model Response

<think>To determine the plate number of the blue car in the image, we need to focus on the license plate located near the bottom front of the vehicle. The plate is relatively small compared to the overall size of the car and the image, so zooming in on this specific area will help us clearly read the characters on the plate.
The coordinates for the license plate appear to be around the lower front section of the car, specifically near the center of the lower part of the image. Cropping and zooming in on this region will enhance the visibility of the text on the plate, making it easier to identify the correct option.
Here is the Python code to crop and zoom in on the license plate area:

<code>

import cv2
import os
import numpy as np
from uuid import uuid4

# Load the image
image_path = "/eval/17127.jpg"
image = cv2.imread(image_path)

# Define the coordinates for the license plate area (inferred from the image content)
x1, y1, x2, y2 = 1250, 1030, 1360, 1070

# Crop the image based on the defined coordinates
cropped_image = image[y1:y2, x1:x2]

# Apply a zoom factor to enhance the cropped region
zoom_factor = 2
resized_image = cv2.resize(cropped_image, None, fx=zoom_factor, fy=zoom_factor, interpolation=cv2.INTER_CUBIC)

# Save the processed image with a random suffix
random_suffix = str(uuid4())[:8]
processed_filename = f"17127_{random_suffix}.jpg"
processed_path = os.path.join("/mnt/data/temp_processed_images/", processed_filename)
cv2.imwrite(processed_path, resized_image)

# Print the path of the saved cropped image for further processing
print(processed_path)

</code>

<sandbox_output> </sandbox_output>

Upon examining the cropped and zoomed-in image of the license plate, it becomes clear that the characters are "S QT 911". This matches option C. Therefore, the correct answer is C. S QT 911.</think>

Citation

If you find Thyme useful in your research or applications, please cite our paper:

@misc{zhang2025thymethinkimages,
      title={Thyme: Think Beyond Images}, 
      author={Yi-Fan Zhang and Xingyu Lu and Shukang Yin and Chaoyou Fu and Wei Chen and Xiao Hu and Bin Wen and Kaiyu Jiang and Changyi Liu and Tianke Zhang and Haonan Fan and Kaibing Chen and Jiankang Chen and Haojie Ding and Kaiyu Tang and Zhang Zhang and Liang Wang and Fan Yang and Tingting Gao and Guorui Zhou},
      year={2025},
      eprint={2508.11630},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.11630}, 
}

Related Projects

Explore other related work from our team:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.dev_scripts		.dev_scripts
.github		.github
asset		asset
docs		docs
eval		eval
examples		examples
requirements		requirements
scripts		scripts
scripts_old		scripts_old
swift		swift
.deepspeed_env		.deepspeed_env
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit-config_local.yaml		.pre-commit-config_local.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_CN.md		CONTRIBUTING_CN.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 News

Table of Contents

1. Quick Start

1.1 Clone the Repository

1.2 Environment Setup & Dependency Installation

2. Data Preparation

2.1 Download Dataset

2.2 Process Images and Update Paths

2.3 File Path Conversion for System Integration

Conversion Process:

3. Supervised Fine-Tuning (Thyme-SFT)

3.1 Training Data Format

3.2 Configure Training Paths

3.3 Run Training

4. Reinforcement Learning (Thyme-RL)

4.1 RL Dataset Format

4.2 Reward Model Configuration

4.3 Launch RL Training

5. Evaluation

5.1 Install VLMEvalKit

5.2 Evaluate on Multiple Benchmarks

6. Usage Example: How to use Thyme

❓ User Prompt

🤖 Model Response

Citation

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔥 News

Table of Contents

1. Quick Start

1.1 Clone the Repository

1.2 Environment Setup & Dependency Installation

2. Data Preparation

2.1 Download Dataset

2.2 Process Images and Update Paths

2.3 File Path Conversion for System Integration

Conversion Process:

3. Supervised Fine-Tuning (Thyme-SFT)

3.1 Training Data Format

3.2 Configure Training Paths

3.3 Run Training

4. Reinforcement Learning (Thyme-RL)

4.1 RL Dataset Format

4.2 Reward Model Configuration

4.3 Launch RL Training

5. Evaluation

5.1 Install VLMEvalKit

5.2 Evaluate on Multiple Benchmarks

6. Usage Example: How to use Thyme

❓ User Prompt

🤖 Model Response

Citation

Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages