虽然检索增强生成(RAG)在直接知识检索方面表现出色,但在需要抽象或多步推理的复杂查询面前却表现一般。为了弥补这一不足,我们开发了DIVER,这是一种专门针对这些推理密集型任务设计的检索管道。DIVER集成了四个阶段:文档预处理、迭代的LLM 驱动查询扩展、在复杂合成数据上进行微调的专用检索器,以及一种将listwise与pointwise相结合的新型重排序器。在BRIGHT基准测试中,DIVER创造了新的最佳成绩,显著优于其他推理感知模型(NDCG 45.8)。 这些结果突显了将深度推理融入检索以解决复杂现实世界问题的有效性。更多详情请参阅Diver论文。
- [2025-11-20] 🚀 我们发布了我们的重排模型 Diver-GroupRank-7B 和 Diver-GroupRank-32B, 推理代码可在 ./Retriever/rerank_groupwise.py 找到. 我们的 GroupRank-32B 模型经过测试时增强后可在BRIGHT上达到 46.8 的分数,详见 paper。
- [2025-10-20] 🚀 我们在 ModelScope和Hugging Face上发布了 DIVER-Retriever-4B-1020 模型,在 BRIGHT 基准上取得了 31.9 的成绩。
- [2025-10-14] 🚀 我们在 ModelScope和Hugging Face上发布了 DIVER-Retriever-1.7B 模型,在 BRIGHT 基准上取得了 27.3 的成绩。
- [2025-09-12] 🚀 我们发布了使用 Gemini 的 listwise 重排序代码;可以在 ./Retriever/rerank_listwise.py 找到,并在 BRIGHT 上取得了 43.9 的得分。
- [2025-09-05] 🚀 我们在 ModelScope和Hugging Face上发布了 DIVER-Retriever-0.6B 模型,在 BRIGHT 基准上取得了 25.2 的成绩。
- [2025-08-28] 🚀 我们在 ModelScope 上发布了 DIVER-Retriever-4B 模型。
- [2025-08-24] 🏆 我们更新了Diver V2,在Bright Leaderboard效果进一步提升至45.8。
- [2025-08-18] 🚀 我们开源了Diver的整体代码库包括推理和训练。
- ⬜ 开源 DIVER-VL-Embedding 与 DIVER-VL-Reranker:发布源码与模型
- ✅ 开源 DIVER-Reranker:发布源码与模型
您可以下载以下表格,以查看适用于您场景的各种参数。如果您位于中国大陆,我们还将在 ModelScope.cn 上提供该模型,以加快下载速度。
Diver在 BRIGHT榜单上与其他基线的性能比较。每个数据集的最佳结果均以粗体突出显示。
| Method | Avg. | Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank-R1-14B | 20.5 | 31.2 | 38.5 | 21.2 | 26.4 | 22.6 | 18.9 | 27.5 | 9.2 | 20.2 | 9.7 | 11.9 | 9.2 |
| Qwen1.5-7B with InteRank-3B | 27.4 | 51.2 | 51.4 | 22.4 | 31.9 | 17.3 | 26.6 | 22.4 | 24.5 | 23.1 | 13.5 | 19.3 | 25.5 |
| GPT4 with Rank1-32B | 29.4 | 49.7 | 35.8 | 22.0 | 37.5 | 22.5 | 21.7 | 35.0 | 18.8 | 32.5 | 10.8 | 22.9 | 43.7 |
| ReasonIR with QwenRerank | 36.9 | 58.2 | 53.2 | 32.0 | 43.6 | 28.8 | 37.6 | 36.0 | 33.2 | 34.8 | 7.9 | 32.6 | 45.0 |
| ReasonIR with Rank-R1-32B | 38.8 | 59.5 | 55.1 | 37.9 | 52.7 | 30.0 | 39.3 | 45.1 | 32.1 | 17.1 | 10.7 | 40.4 | 45.6 |
| RaDeR with QwenRerank | 39.2 | 58.0 | 59.2 | 33.0 | 49.4 | 31.8 | 39.0 | 36.4 | 33.5 | 33.3 | 10.8 | 34.2 | 51.6 |
| XRR2 | 40.3 | 63.1 | 55.4 | 38.5 | 52.9 | 37.1 | 38.2 | 44.6 | 21.9 | 35.0 | 15.7 | 34.4 | 46.2 |
| ReasonRank | 40.8 | 62.72 | 55.53 | 36.7 | 54.64 | 35.69 | 38.03 | 44.81 | 29.46 | 25.56 | 14.38 | 41.99 | 50.06 |
| DIVER | 41.6 | 62.2 | 58.7 | 34.4 | 52.9 | 35.6 | 36.5 | 42.9 | 38.9 | 25.4 | 18.3 | 40.0 | 53.1 |
| BGE Reasoner | 45.2 | 66.5 | 63.7 | 39.4 | 50.3 | 37 | 42.9 | 43.7 | 35.1 | 44.3 | 17.2 | 44.2 | 58.5 |
| DIVER V2 | 45.8 | 68 | 62.5 | 42.0 | 58.2 | 41.5 | 44.3 | 49.2 | 34.8 | 32.9 | 19.1 | 44.3 | 52.6 |
| Method | Avg. | Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Evaluate Retriever with Original Query | |||||||||||||
| BM25 | 14.5 | 18.9 | 27.2 | 14.9 | 12.5 | 13.6 | 18.4 | 15.0 | 24.4 | 7.9 | 6.2 | 10.4 | 4.9 |
| SBERT | 14.9 | 15.1 | 20.4 | 16.6 | 22.7 | 8.2 | 11.0 | 15.3 | 26.4 | 7.0 | 5.3 | 20.0 | 10.8 |
| gte-Qwen1.5-7B | 22.5 | 30.6 | 36.4 | 17.8 | 24.6 | 13.2 | 22.2 | 14.8 | 25.5 | 9.9 | 14.4 | 27.8 | 32.9 |
| Qwen3-4B | 5.6 | 3.5 | 8.0 | 2.3 | 2.0 | 1.6 | 1.0 | 4.4 | 2.1 | 0.1 | 4.9 | 18.0 | 19.2 |
| OpenAI | 17.9 | 23.3 | 26.7 | 19.5 | 27.6 | 12.8 | 14.3 | 20.5 | 23.6 | 2.4 | 8.5 | 23.5 | 11.7 |
| 20.0 | 22.7 | 34.8 | 19.6 | 27.8 | 15.7 | 20.1 | 17.1 | 29.6 | 3.6 | 9.3 | 23.8 | 15.9 | |
| ReasonIR-8B | 24.4 | 26.2 | 31.4 | 23.3 | 30.0 | 18.0 | 23.9 | 20.5 | 35.0 | 10.5 | 14.7 | 31.9 | 27.2 |
| RaDeR-7B | 25.5 | 34.6 | 38.9 | 22.1 | 33.0 | 14.8 | 22.5 | 23.7 | 37.3 | 5.0 | 10.2 | 28.4 | 35.1 |
| Seed1.5-Embedding | 27.2 | 34.8 | 46.9 | 23.4 | 31.6 | 19.1 | 25.4 | 21.0 | 43.2 | 4.9 | 12.2 | 33.3 | 30.5 |
| DIVER-Retriever-0.6B | 25.2 | 36.4 | 41.9 | 29.0 | 31.0 | 21.2 | 24.6 | 23.2 | 15.6 | 6.8 | 8.4 | 33.2 | 31.7 |
| DIVER-Retriever-4B | 28.9 | 41.8 | 43.7 | 21.7 | 35.3 | 21.0 | 21.2 | 25.1 | 37.6 | 13.2 | 10.7 | 38.4 | 37.3 |
| Evaluate Retriever with GPT-4 REASON-query | |||||||||||||
| BM25 | 27.0 | 53.6 | 54.1 | 24.3 | 38.7 | 18.9 | 27.7 | 26.3 | 19.3 | 17.6 | 3.9 | 19.2 | 20.8 |
| SBERT | 17.8 | 18.5 | 26.3 | 17.5 | 27.2 | 8.8 | 11.8 | 17.5 | 24.3 | 10.3 | 5.0 | 22.3 | 23.5 |
| gte-Qwen1.5-7B | 24.8 | 35.5 | 43.1 | 24.3 | 34.3 | 15.4 | 22.9 | 23.9 | 25.4 | 5.2 | 4.6 | 28.7 | 34.6 |
| Qwen3-4B | 5.5 | 1.3 | 17.3 | 2.5 | 6.2 | 1.0 | 4.8 | 4.5 | 3.0 | 5.9 | 0.0 | 7.2 | 12.5 |
| OpenAI | 23.3 | 35.2 | 40.1 | 25.1 | 38.0 | 13.6 | 18.2 | 24.2 | 24.5 | 6.5 | 7.7 | 22.9 | 23.8 |
| 26.2 | 36.4 | 45.6 | 25.6 | 38.2 | 18.7 | 29.5 | 17.9 | 31.1 | 3.7 | 10.0 | 27.8 | 30.4 | |
| ReasonIR-8B | 29.9 | 43.6 | 42.9 | 32.7 | 38.8 | 20.9 | 25.8 | 27.5 | 31.5 | 19.6 | 7.4 | 33.1 | 35.7 |
| RaDeR-7B | 29.2 | 36.1 | 42.9 | 25.2 | 37.9 | 16.6 | 27.4 | 25.0 | 34.8 | 11.9 | 12.0 | 37.7 | 43.4 |
| DIVER-Retriever-4B | 32.1 | 51.9 | 53.5 | 29.5 | 41.2 | 21.4 | 27.5 | 26.1 | 33.5 | 11.7 | 9.5 | 39.3 | 39.7 |
| Evaluate retriever with DIVER-QExpand query | |||||||||||||
| ReasonIR-8B | 32.6 | 49.4 | 44.7 | 32.4 | 44.0 | 26.6 | 31.8 | 29.0 | 32.3 | 12.8 | 9.1 | 40.7 | 38.4 |
| +BM25 (Hybrid) | 35.7 | 56.8 | 53.5 | 33.0 | 48.5 | 29.4 | 34.2 | 32.0 | 35.2 | 16.8 | 12.9 | 39.3 | 36.8 |
| DIVER-Retriever | 33.9 | 54.5 | 52.7 | 28.8 | 44.9 | 25.1 | 27.4 | 29.5 | 34.5 | 10.0 | 14.5 | 40.7 | 44.7 |
| +BM25 (Hybrid) | 37.2 | 60.0 | 55.9 | 31.8 | 47.9 | 27.1 | 33.9 | 31.9 | 35.1 | 23.1 | 16.8 | 36.9 | 46.6 |
sh run_all.sh
使用Sentence Transformers
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("AQ-MedAI/Diver-Retriever-4B")
# The queries and documents to embed
queries = [
"What is the capital of China?",
"Explain gravity",
]
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
vLLM usage
# Requires vllm>=0.8.5
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="AQ-MedAI/Diver-Retriever-4B", task="embed")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)我们建议您使用 swift 来对我们的 DIVER-Retriever-4B 进行微调。 在开始训练之前,请确保您的环境已正确配置好。
pip install ms-swift -U
# Install from source
pip install git+https://github.com/modelscope/ms-swift.git
pip install transformers -U
# Optional packages
pip install deepspeed # multi-GPU training
pip install liger-kernel # save GPU memory resources
pip install flash-attn --no-build-isolation# LLM
{"query": "sentence1", "response": "sentence2"}
# MLLM
{"query": "<image>", "response": "sentence", "images": "/some/images.jpg"}
{"query": "<image>sentence1", "response": "<image>sentence2", "rejected_response": ["<image>sentence1", "<image>sentence2"], "images": ["/some/images.jpg", "/some/images.jpg", "/some/images.jpg", "/some/images.jpg"]}以infonce loss为例,完整的训练指令如下:
nproc_per_node=8
NPROC_PER_NODE=$nproc_per_node \
swift sft \
--model DIVER/DIVER-Retriever-4B \
--task_type embedding \
--model_type qwen3_emb \
--train_type full \
--dataset your_dataset \
--split_dataset_ratio 0.05 \
--eval_strategy steps \
--output_dir output \
--eval_steps 20 \
--num_train_epochs 5 \
--save_steps 20 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 6e-6 \
--loss_type infonce \
--label_names labels \
--dataloader_drop_last true \
--deepspeed zero3如果您觉得我们的工作有所帮助,请随时告知我们,我们会非常感激。
@misc{DIVER,
title={DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval},
author={Meixiu Long and Duolin Sun and Dan Yang and Junjie Wang and Yue Shen and Jian Wang and Peng Wei and Jinjie Gu and Jiahai Wang},
year={2025},
eprint={2508.07995},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2508.07995},
}
我们感谢之前的相关研究以及它们所发布的开源资源:BRIGHT, ReasonIR, RaDer, ThinkQE, Qwen3-Embedding。