OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

Zheng, Guiyong; Ban, YueTing; Zhang, Mingjie; Zheng, Juepeng; Zhou, Boyu

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

A fully onboard real-time framework for zero-shot aerial vision-language navigation

Guiyong Zheng^1,2, Yueting Ban², Mingjie Zhang^3,2, Juepeng Zheng¹, and Boyu Zhou^2,†

¹ School of Artificial Intelligence, Sun Yat-Sen University, Zhuhai, China
² Southern University of Science and Technology, Shenzhen, China
³ The Hong Kong University of Science and Technology, Guangzhou, China
^† Corresponding Author

Paper Video Code

OnFly enables UAVs to follow natural-language instructions in complex 3D environments with fully onboard, zero-shot aerial vision-language navigation. The system combines a shared-perception dual-agent architecture, hybrid long-horizon memory, semantic-geometric target verification, and receding-horizon planning to improve both safety and efficiency.

REAL WORLD

Abstract

Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment.

BibTeX

@misc{zheng2026onflyonboardzeroshotaerial,
      title={OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency}, 
      author={Guiyong Zheng and Yueting Ban and Mingjie Zhang and Juepeng Zheng and Boyu Zhou},
      year={2026},
      eprint={2603.10682},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.10682}, 
}