👏 Welcome to the Awesome-Agentic-MLLMs repository!
This curated collection features influential papers, codebases, datasets, benchmarks, and resources dedicated to exploring the emerging field of agentic capabilities in Multimodal Large Language Models.
⭐ Feel free to star and fork this repository to stay updated with the latest advancements and contribute to the growing community.
We greatly appreciate and welcome everyone to submit an issue for any related work we may have missed, and we’ll review and address it in the next release!
If you find this survey helpful, please cite our work:
@article {yao2025survey ,
title ={ A Survey on Agentic Multimodal Large Language Models} ,
author ={ Yao, Huanjin and Zhang, Ruifei and Huang, Jiaxing and Zhang, Jingyi and Wang, Yibo and Fang, Bo and Zhu, Ruolin and Jing, Yongcheng and Liu, Shunyu and Li, Guanbin and others} ,
journal ={ arXiv preprint arXiv:2510.10991} ,
year ={ 2025}
}
We collect recent advances in Agentic MLLMs and categorize them into three core dimensions: (1) Agentic Internal Intelligence , which leverages reasoning, reflection, and memory to enable accurate long-horizon planning; (2) Agentic External Tool Invocation , whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (3) Agentic environment interaction , which situates models within virtual or physical environments, allowing them to perceive changes and incorporate feedback from the real world.
Date
Title
Paper
Code
2502
Qwen2.5-VL Technical Report
2502
SmolVLM2: Bringing Video Understanding to Every Device
2506
MiMo-VL Technical Report
2507
Kwai Keye-VL Technical Report
2509
SAIL-VL2 Technical Report
2509
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
2509
MiniCPM-V 4.5 technical report
Date
Title
Paper
Code
2509
Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action
2409
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
-
2412
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
2503
Kimi-VL Technical Report
2506
ERNIE 4.5 Technical Report
2507
Seed1.5-VL Technical Report
2507
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
2507
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
2508
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Agentic Internal Intelligence
Date
Title
Paper
Github
2410
Improve Vision Language Model Chain-of-thought Reasoning
2411
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
2412
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
2503
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
2503
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
2503
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
2503
Video-R1: Reinforcing Video Reasoning in MLLMs
2504
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
2504
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
2504
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
2504
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
2505
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
2505
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
2505
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
2505
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
2506
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
2506
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
2506
APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization
2507
Scaling RL to Long Videos
2507
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
2507
C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
2507
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
-
2508
StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models
-
2509
MAPO: Mixed Advantage Policy Optimization
-
2509
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
2509
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
2509
Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Date
Title
Paper
Code
2410
ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents
2411
Self-Corrected Multimodal Large Language Model for Robot Manipulation and Reflection
-
2411
Vision-Language Models Can Self-Improve Reasoning via Reflection
2412
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search
2503
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
-
2504
MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding
-
2504
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
2505
Training-Free Reasoning and Reflection in MLLMs
2506
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
2507
Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
2509
Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
2510
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Date
Title
Paper
Code
2305
MemoryBank: Enhancing Large Language Models with Long-Term Memory
2307
MovieChat
2312
Empowering Working Memory for Large Language Model Agents
-
2402
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
2502
A-Mem: Agentic Memory for LLM Agents
2503
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents
-
2504
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
2506
A Walk to Remember: Mllm Memory-Driven Visual Navigation
-
2506
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
2507
MemOS: A Memory OS for AI System
-
2507
MIRIX: Multi-Agent Memory System for LLM-Based Agents
2508
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
-
2508
Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory
-
2508
MMS: Multiple Memory Systems for Enhancing the Long-term Memory of Agent
-
Agentic External Tool Invocation
Agentic Search for Information Retrieval
Date
Title
Paper
Code
2502
Open AI Deep Research: Introducing deep research
2505
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
2505
Visual Agentic Reinforcement Fine-Tuning
2506
MMSearch-R1: Incentivizing LMMs to Search
2508
Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning
2508
M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
-
2508
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
2510
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
-
Agentic Coding for Complex Computations
Date
Title
Paper
Code
2501
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
2504
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
2505
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
2506
CoRT: Code-integrated Reasoning within Thinking
2507
PyVision: Agentic Vision with Dynamic Tooling
2508
rStar2-Agent: Agentic Reasoning Technical Report
2508
Posterior-GRPO: Rewarding Reasoning Processes in Code Generation
-
2509
Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use
Agentic Visual Processing for Thinking with Image
Date
Title
Paper
Code
2501
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
2505
Visual Planning: Let's Think Only with Images
2505
Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
2505
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
2505
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
2505
VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
-
2505
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
2505
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
2505
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
2505
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
2508
Simple o3: Towards Interleaved Vision-Language Reasoning
-
2508
Thyme: Think Beyond Images
2509
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Agentic Enviroment Interaction
Agentic Virtual Interaction
Date
Title
Paper
Code
2411
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
2501
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
2503
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
2504
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
2504
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
2504
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
2505
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
2506
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
2509
InfraMind: A Novel Exploration-based GUI Agentic Framework for Mission-critical Industrial Management
-
2509
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Agentic Physical Interaction
Date
Title
Paper
Code
2406
OpenVLA: An Open-Source Vision-Language-Action Model
2505
ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models
-
2506
Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning
2506
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
2507
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
2508
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
2508
MolmoAct: Action Reasoning Models that can Reason in Space
2508
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
2509
Nav-R1: Reasoning and Navigation in Embodied Scenes
2509
Wall-x: Igniting VLMs toward the Embodied Space
2509
VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search
-
Agentic Training Framework
Title
Code
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
ms-swift: SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning)
Megatron-LM
Unsloth
Title
Code
verl: Volcano Engine Reinforcement Learning for LLMs
rLLM (DeepScaleR): Reinforcement Learning for Language Agents
RLFactory: Easy and Efficient RL Training
ROLL: Reinforcement Learning Optimization for Large-Scale Learning
RAGEN: Training Agents by Reinforcing Reasoning
SkyRL: A Modular Full-stack RL Library for LLMs
Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning
Multimodal-Search-R1: Incentivizing LMMs to Search
Visual Agentic Reinforcement Fine-Tuning