Implementation of our AAAI (Main Track) Paper: PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis
A novel framework that simulates pre-consultation dialogues between two Vision-Language Models (VLMs) - PatientVLM and DocVLM - to enhance medical diagnosis efficiency. This work was presented at AAAI 2026.
- Overview
- Installation Guide
- Data Preparation
- Pre-Consultation Dialogue Generation
- DocVLM Inference and Evaluation
- Citation
This repository implements a Pre-Consultation Dialogue Framework (PCDF) where:
- PatientVLM: Simulates patient perspective, describing symptoms, patient history, and medical images
- DocVLM: Acts as a clinical expert, asking diagnostic questions and making assessments
The dialogue-based approach improves diagnostic accuracy by leveraging the complementary strengths of different VLMs.
git clone https://github.com/vl2g/pcdf.git
cd pcdfconda create --name pcdf python=3.11.14 -y
conda activate pcdf
pip install -r requirements.txtProcess the MedMNIST dataset for your desired medical specialty:
python scripts/process_medmnist.py --dataset dermamnistSupported Datasets:
dermamnist- Dermatology imagespathmnist- Pathology imagesretinamnist- Retinal fundus imagespneumoniamnist- Chest X-rays (Radiology/Pneumonia)
Generate dialogues between PatientVLM and DocVLM using the PCDF framework.
export PYTHONPATH="${PYTHONPATH}:$(pwd)"python scripts/run_pcdf.py \
--speciality Dermatology \
--doc_vlm Qwen25VL \
--patient_vlm mPLUGOwl3 \
--config dermamnist \
--split train| Argument | Description | Default | Other Options |
|---|---|---|---|
--speciality |
Medical specialty domain | Dermatology |
Pathology, Ophthalmology, Radiology |
--doc_vlm |
Vision-Language Model for doctor role | Qwen25VL |
Gemma3, MedGemma, InternVL3, mPLUGOwl3 |
--patient_vlm |
Vision-Language Model for patient role | mPLUGOwl3 |
Gemma3, MedGemma, InternVL3, Qwen25VL |
--config |
Dataset configuration file | dermamnist |
pathmnist, retinamnist, pneumoniamnist |
--split |
Dataset split to process | train |
test, val |
- Gemma3: Google's Gemma 3 vision-language model
- MedGemma: Medical domain-adapted version of Gemma
- InternVL3: InternVL version 3 multimodal model
- Qwen25VL: Qwen 2.5 Vision-Language model
- mPLUGOwl3: Alibaba's mPLUG-Owl 3 model
| Specialty | Recommended Config | Description |
|---|---|---|
Dermatology |
dermamnist |
Skin lesion classification |
Pathology |
pathmnist |
Histopathology image analysis |
Ophthalmology |
retinamnist |
Retinal fundus imaging |
Radiology |
pneumoniamnist |
Pneumonia detection from chest X-rays |
After finetuning the DocVLM on pre-consultation dialogues, evaluate the DocVLM's diagnostic performance.
Create a separate environment for inference (requires Python 3.10):
conda create --name docvlm python=3.10.19 -y
conda activate docvlm
pip install -r inference/requirements.txtExecute inference using the trained DocVLM:
python inference/qwen25vl_inference.pyThis script processes the generated PCDF dialogues and produces diagnostic predictions.
python inference/evaluate.py --json_path results/pcdf_dermamnist.jsonArguments:
--json_path: Path to the inference results JSON file
If you find this work useful in your research, please consider citing:
@inproceedings{lokesh2026patientvlm,
title = {PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis},
author = {Lokesh, K and Penamakuri, Abhirama Subramanyam and Agarwal, Uday and Challa, Apoorva and Gowda, Shreya K and Gupta, Somesh and Mishra, Anand},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026},
publisher = {AAAI Press}
}