(ICCV 2025) UAVScenes: A Multi-Modal Dataset for UAVs
We introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including detection, segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). To the best of our knowledge, this is the first UAV benchmark dataset to offer both image and LiDAR point cloud semantic annotations (120k labeled pairs), with the potential to advance multi-modal UAV perception research.
We provide both the full dataset (interval=1) and the key-frame only dataset (interval=5, 1/5 size).
UAVScenes has been uploaded onto various cloud platforms.
- OneDrive
- Google Drive
- Baidu/百度网盘 (interval=5 only)
- HuggingFace (interval=5 only)
We currently include:
- Hikvision camera images with annotations
- Livox Avia LiDAR point clouds with annotations
- 6-DoF poses
- Reconstructed 3D point cloud/mesh maps
interval1_CAM_LIDAR contains camera images, LiDAR point clouds, 6-DoF poses, and calibrations.
interval1_CAM_label contains camera semantic annotations.
interval1_LIDAR_label contains LiDAR semantic annotations.
terra_3dmap_pointcloud_mesh contains 3D mesh/point cloud maps.
cmap.py contains color-ID mapping.
calibration_results.py contains camera-LiDAR calibrations.
sampleinfos_interpolated.json contains camera-3D map calibrations.
terra_ply/ contains the raw mesh map outputs from Terra, which contains multiple mesh blocks.
cloud_merged.ply contains the raw point cloud map outputs from Terra.
Mesh.ply is built by merging all mesh blocks from terra_ply/ together.
-
UAVScenes is built based on MARS-LVIG. Thanks for their excellent work.
-
We use X-AnyLabeling for 2D annotating, CloudCompare for 3D annotating, and DJI Terra (大疆智图) for 3D reconstruction (much more accurate than COLMAP).
-
More sensor and scene information can be found from MARS-LVIG.
- UAVScenes consists of 4 large scenes (AMtown, AMvalley, HKairport, and HKisland). Each scene consists of multiple runs (e.g., 01, 02, and 03).
Under preparing. Please stay tuned. You are also welcome to use your custom train/test split for all tasks.
@article{wang2025uavscenes,
title={UAVScenes: A Multi-Modal Dataset for UAVs},
author={Wang, Sijie and Li, Siqi and Zhang, Yawei and Yu, Shangshu and Yuan, Shenghai and She, Rui and Guo, Quanjiang and Zheng, JinXuan and Howe, Ong Kang and Chandra, Leonrich and others},
journal={arXiv preprint arXiv:2507.22412},
year={2025}
}
This work is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and is meant for academic use only.


