This repository provides a ROS 2-based framework for semantic navigation using visual-language models.
It integrates segmentation models like LSeg and FC-CLIP within a Dockerized environment for reproducibility and ease of setup.
- GPU with support for
nvidia/cuda:12.8.0 - At least 8 GB of VRAM (can be reduced by lowering the batch size)
- Docker
- NVIDIA Container Toolkit
Before building the Docker image, download the checkpoints for the segmentation models.
From the docker directory, run:
python3 download_checkpoint.py <PATH_TO_THIS_REPO>/docker
python3 download_cocopan_checkpoint.pyThese scripts will automatically fetch the necessary pretrained weights for LSeg and FC-CLIP.
Inside build_docker.sh, set your preferred image name after the -t flag.
Then build the image using:
. build_docker.sh GITHUB_USERNAME GITHUB_EMAIL GITHUB_TOKENNote: The GitHub credentials are required to authenticate and access this repository during the build process.
After a successful build, edit run.sh to match the image name and tag used during the build, then run:
. run.shThis repository is a ROS 2 package.
To use ROS 2 launch files and parameters, build the workspace with colcon:
colcon build --symlink-install
source install/local_setup.bashAdditionally, compile and source the ROS 2 message interface:
cd ros2_vlmaps_interfaces
colcon build --symlink-install
source install/local_setup.bashYou can add the
sourcecommand to your~/.bashrcfor convenience.
ros2 launch visual_language_navigation build_map.launch.pyros2 launch visual_language_navigation semantic_map_server.launch.pyIf you already have a semantic map, launch the semantic_map_server.
If you need to build or update one, use the semantic_map_builder.
Then run the following nodes:
python3 visual_language_navigation/llm/nav_agent.py
python3 visual_language_navigation/llm/chat_agent.py| Topic | Type | Description |
|---|---|---|
/camera/rgbd/img |
sensor_msgs/Image |
RGB image from the camera |
/camera/rgbd/depth |
sensor_msgs/Image |
Depth image from the camera |
/amcl_pose |
geometry_msgs/PoseWithCovariance |
Pose estimate from Nav2 AMCL |
/global_costmap/costmap |
Nav2 Costmap | Occupancy grid map |
/camera/rgbd/camera_info |
sensor_msgs/CameraInfo |
Camera intrinsics (optional, can be manually set in mapping_params.yaml) |
| Topic | Type | Description |
|---|---|---|
/voxmap |
sensor_msgs/PointCloud2 |
RGB-colored semantic map |
/map_index_result |
PointCloud2 |
Result of the semantic_map_server/index_map service |
/map_goal_indexing |
PointCloud2 |
Map goal indexing output |
/map_2d_index_marker |
PointCloud2 |
2D index markers |
| Service | Description |
|---|---|
semantic_map_builder/enable_mapping |
Enables or disables the mapping callback (enable_flag: bool) |
semantic_map_server/index_map |
Searches the map for indexing_string and publishes the result on /map_index_result |
semantic_map_server/show_semantic_map |
Publishes the RGB semantic point cloud on /voxmap |
semantic_map_server/load_semantic_map |
Loads a semantic map from a specified path |
semantic_map_server/llm_query |
Retrieves information from the semantic map for the LLM agent |
chat_agent/user_text |
Sends user queries (transcribed text) to the LLM planner |
Main parameters are defined in params/mapping_params.yaml.
Key ones include:
| Parameter | Default | Description |
|---|---|---|
cell_size |
0.02 |
Cell size (m) |
grid_size |
1500 |
Number of cells per axis (map resolution) |
maximum_height |
2.3 |
Maximum voxel grid height (m) |
max_camera_distance |
3.0 |
Maximum Z-distance from the camera (m) |
depth_downsampling |
10 |
Random downsampling factor for RGB and depth pixels |
robot_base_frame |
"geometric_unicycle" |
Robot base frame |
target_frame |
"map" |
Global navigation frame |
map_frame |
"map" |
Reference frame used by Nav2 |
seg_model_name |
"fcclip" |
Segmentation model name |
classes_to_skip |
["person", "floor", "ceiling", "wall"] |
Classes excluded from mapping |
- This branch allows choosing between LSeg and FC-CLIP as the segmentation backend.
- The Dockerfile provides a ready-to-use development environment for ease of setup.
- Ensure your GPU drivers and CUDA runtime are correctly configured before running.
TODO
This work has taken inspiration from VLMaps: https://github.com/vlmaps/vlmaps
@inproceedings{huang23vlmaps,
title={Visual Language Maps for Robot Navigation},
author={Chenguang Huang and Oier Mees and Andy Zeng and Wolfram Burgard},
booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
year={2023},
address = {London, UK}
}