DEER is a benchmark for evaluating deep research agents on expert report generation.
📄 Paper: https://arxiv.org/abs/2512.17776
DEER provides a systematic and interpretable evaluation framework for expert-level long-form research reports:
- Expert-defined hierarchical taxonomy (7 dimensions, 25 sub-dimensions)
- 101 fixed rubric items for structured LLM-based scoring
- Task-specific Expert Evaluation Guidance
- Report-wide claim verification with implicit citation back-tracking
DEER enables fine-grained, domain-aware diagnostics beyond aggregate scoring.
git clone https://github.com/hanjanghoon/DEER.git
cd DEER
conda env create -f deer.yml
conda activate deerCreate a .env file in the root directory and add your API keys:
OPENAI_API_KEY=your_openai_key_here
JINA_API_KEY=your_jina_key_here
Due to contamination risk, the dataset is provided separately via Google Drive. If you need access, please contact [email protected].
Download the dataset and extract it in the project root so that it is placed under data/.
Each domain folder inside data/ contains a query.md.
Generate a report that answers the query and place the report file in the same directory.
bash run_domain_all.sh| Model | Request Fulfillment |
Analytic Soundness |
Structural Coherence |
Format & Style |
Information Integrity |
Information Sufficiency |
Ethics | Mean |
|---|---|---|---|---|---|---|---|---|
| General LLMs (No Reasoning) | ||||||||
| Qwen3-235B | 4.51 | 5.02 | 6.09 | 7.49 | 1.24 | 4.20 | 7.19 | 5.11 |
| Gemini 2.5 Flash | 4.64 | 5.33 | 6.55 | 7.85 | 1.30 | 3.99 | 7.52 | 5.31 |
| Claude Opus 4.5 | 4.94 | 5.48 | 6.54 | 7.99 | 2.29 | 4.50 | 7.78 | 5.65 |
| GPT-5 | 4.11 | 4.75 | 5.84 | 7.21 | 1.05 | 3.13 | 7.30 | 4.77 |
| LLMs + Reasoning | ||||||||
| Qwen3-235B | 5.00 | 5.33 | 6.64 | 7.88 | 1.12 | 3.90 | 7.38 | 5.32 |
| Gemini 2.5 Pro | 4.88 | 5.81 | 6.99 | 8.09 | 2.23 | 4.40 | 7.73 | 5.73 |
| Claude Opus 4.5 | 4.96 | 5.48 | 6.68 | 8.10 | 2.27 | 4.22 | 7.73 | 5.63 |
| GPT-5 | 5.57 | 6.18 | 7.00 | 8.06 | 2.11 | 4.16 | 8.08 | 5.88 |
| LLMs + Reasoning + WebSearch | ||||||||
| Qwen3-235B | 4.05 | 4.34 | 5.68 | 6.83 | 5.22 | 5.45 | 7.06 | 5.52 |
| Claude Opus 4.5 | 4.52 | 5.13 | 5.99 | 7.41 | 7.03 | 7.62 | 7.37 | 6.44 |
| GPT-5 | 5.57 | 6.08 | 6.97 | 8.15 | 5.63 | 6.17 | 8.11 | 6.67 |
| Deep Research | ||||||||
| WebThinker | 4.11 | 4.64 | 5.51 | 7.35 | 6.21 | 6.40 | 7.13 | 5.91 |
| Qwen3-235B | 4.13 | 4.69 | 4.85 | 7.06 | 6.55 | 7.90 | 7.43 | 6.09 |
| Gemini 2.5 Pro | 4.71 | 5.37 | 6.25 | 7.59 | 6.01 | 7.61 | 7.39 | 6.42 |
| Claude Opus 4.5 | 4.53 | 5.22 | 5.69 | 7.22 | 6.04 | 5.66 | 7.57 | 5.99 |
| OpenAI | 4.67 | 5.29 | 6.28 | 7.66 | 7.14 | 6.89 | 7.48 | 6.49 |
- Code: MIT
- Data: CC BY-NC 4.0 (Non-commercial use only)
Exception:
The files core_criteria.md within the data/ directory are subject to additional restrictions.
Use of these files for commercial purposes is not permitted.
Redistribution, sharing, or public posting of these files is not permitted without explicit permission from the author.
