Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation [Paper] [Dataset](Password: mz23)
In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model’s capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety.
- SafeDrive228K: A Large-Scale Multimodal QA Benchmark for Autonomous Driving Safety
- 9,331 real-world traffic accident videos
- Over 35,000 corner-case and safety-related images
- 228,000 QA pairs spanning traffic accidents, corner cases, and safety commonsense
- Comprehensive Evaluation of VLMs in Traffic Safety
- The first benchmark systematically assessing model reasoning under diverse, safety-critical driving conditions
- Novel Multimodal Knowledge Graph-based RAG Framework
- Unified indexing and retrieval for both textual and visual entities
- Efficient multi-scale subgraph retrieval tailored for real-time requirements
- Plug-and-Play Enhancement for Mainstream Open-Source VLMs
- Substantial Performance Gains in Safety-Critical Tasks
- Notable improvements in commonsense safety (+14.57%), corner cases (+8.79%), and accident scenarios (+4.73%) with RAG enhancement
This benchmark is designed to evaluate model performance across diverse traffic safety scenarios. It consists of three major components: Traffic safety knowledge, Traffic accident, and Corner Case.
benchmark
├── Corner_Case
│ ├── annotations.json
│ ├── Coner_case_qa_new.json
│ └── img
├── Traffic_accident
│ ├── annotations
│ ├── img
│ └── Traffic_accident_qa_new.json
└── Traffic_safety_knowledge
└── qaThis folder primarily contains QA JSON files and the corresponding image folder. Each JSON file includes traffic-safety knowledge questions such as road rules, licenses, and vehicle types.
Example format:
{
"id": 0,
"question_type": "single-choice",
"question": "The holder of a Class 6 operator's licence may operate which of the following vehicles?",
"answer": ["OptionC"],
"explain": "To drive a motorcycle, you must hold a Class 6 licence.",
"img_path": [],
"country": "Canada",
"vehicle_type": "car",
"optionA": "An ambulance",
"optionB": "A bus",
"optionC": "A motorcycle",
"optionD": "A tractor-trailer",
"optionE": "",
"optionF": "",
"language": "English"
}These questions cover traffic rules and safety knowledge across multiple countries, ensuring a broad evaluation of model understanding.
This folder contains:
- QA JSON files
- Video folders (referenced via path fields)
- Annotation folders
Each video is associated with a set of 11 questions covering different aspects of traffic accidents. The question types include single-choice, multiple-choice, and open-ended QA, designed to comprehensively assess model reasoning in accident scenarios.
Example format:
{
"id": 0,
"path": "1/001537",
"extracted_json": [
{
"question": "What caused the accident in the video?",
"options": [
"A) Pedestrian is drunk",
"B) Pedestrian moves or stays on the motorway",
"C) Pedestrian does not notice the coming vehicles when crossing the street",
"D) The vehicle hits the objects falling from the front vehicles"
],
"answer": ["C"],
"type": "single-choice"
},
...
],
...
}
This folder includes:
- QA JSON files
- Annotation files
- Corner case images
Each corner case corresponds to 7 questions, spanning single-choice, multiple-choice, and open-ended formats. These questions focus on identifying rare or complex road entities, targeting the robustness of models in handling rare or intricate traffic contexts.
Example format:
{
"id": 0,
"path": "img/images_0001.jpg",
"extracted_json": [
{
"question": "What do you think the object at the bounding box [1265, 560, 40, 157] is?",
"options": [
"A) tricycle",
"B) bollard",
"C) machinery",
"D) traffic_box"
],
"answer": ["B"],
"type": "single-choice"
},
...
],
...
}This benchmark builds upon several publicly available datasets. We sincerely thank the creators of IDKB , CODA-LM and CAP-DATA for making their resources available to the community.
If you use our dataset or benchmark in your research, please cite us as:
@article{ye2025safedriverag,
title={SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation},
author={Ye, Hao and Qi, Mengshi and Liu, Zhaohong and Liu, Liang and Ma, Huadong},
journal={arXiv preprint arXiv:2507.21585},
year={2025}
}
