Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Project page | Paper | Video
Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
This repository contains training scripts for Talk2Move, scene-level image editing models using GRPO (Group Relative Policy Optimization).
In this work, we demonstrate that RLVR can effectively improve prompt-following performance for the corresponding tasks in vision-related settings, and we propose an early stopping strategy that greatly improves the sampling efficiency of flow-based GRPO.
This codebase is build upon:
- Flow-GRPO, which is licensed under MIT license;
- Orient-Anything that is licensed under CC-BY-4.0 license,
- lang-segment-anything that is licensed underApache-2.0 license;
- Grounding-DINO that is licensed under Apache-2.0 license
Modified file: talk2move/rewards.py
- Added new editing-focused rewards:
translation,ours_qwenvl(zero-shot qwenvl scorer),ours_clip,rotation,resize,lpips - Extended
multi_scoreto support editing-task inputs via a new 4-argument path:images,ref_images,prompts,metadata.
Modified files: grpo/diffusers_patch/qwenimage_edit_pipeline_with_logprob.py, grpo/diffusers_patch/sd3_sde_with_logprob.py
- Introduced
ode_shortcut_stepinqwenimage_edit_pipeline_with_logprob.py, extending sampling from pure SDE toSDE + shortcut ODE. - Added
ode_shortcut_stepinsd3_sde_with_logprob.py, which updates latents using continuous-time steps (t -> t_prev) anddt(instead of the scheduler’s discretestep+1), and performs deterministic ODE updates without injecting random noise.
- Python 3.8+
- PyTorch with CUDA support
- 16 GPUs (2 nodes × 8 GPUs per node)
- Required Python packages (install via
pip install -e .)
Before running training, update the paths in your configuration:
- Replace
enter_path_hereplaceholders in the codebase with your actual paths - Update
MASTER_ADDRinscripts/multi_node/qwenimagedit/main.shto match your master node IP - Ensure all nodes can communicate via the specified master address and port
The training script uses the following default settings:
- GPUs per node: 8
- Number of nodes: 2
- Total GPUs: 16
- Master port: 19001
- Config:
config/grpo.py:talk2move
To modify these settings, edit scripts/multi_node/qwenimagedit/main.sh.
Check config_files/grpo.py for available training configurations:
- Various task-specific configs for rotation(
talk2move_rotation), resize (talk2move_resize), translation (talk2move_translation)
Each configuration specifies:
- Model architecture and checkpoint paths
- Batch sizes and gradient accumulation steps
- Sampling parameters (num_steps, guidance_scale)
- Reward function weights
- Training hyperparameters (learning rate, beta, etc.)
To run training on 16 GPUs across 2 nodes (8 GPUs per node):
sh scripts/multi_node/qwenimagedit/main.sh 0sh scripts/multi_node/qwenimagedit/main.sh 1- Connection issues: Verify that
MASTER_ADDRis correct and nodes can communicate - CUDA out of memory: Reduce batch size in the config file
- Path errors: Ensure all
enter_path_hereplaceholders are replaced with valid paths - Reward server errors: Check that reward server IPs (
your-api-server-ip,your-reward-server-ip) are correctly configured - Import errors: Run
pip install -e .to install the package in development mode - NCCL timeout: Increase timeout or check network connectivity between nodes
If you use this code in your research, please cite the relevant papers for the models and methods use
@misc{tan2026talk2movereinforcementlearningtextinstructed,
title={Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes},
author={Jing Tan and Zhaoyang Zhang and Yantao Shen and Jiarui Cai and Shuo Yang and Jiajun Wu and Wei Xia and Zhuowen Tu and Stefano Soatto},
year={2026},
eprint={2601.02356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.02356},
}
This codebase is built by Jing Tan during her internship at AWS Agentic AI.
For any question, feel free to contact her via
