The repository contains inference scripts for reproducing experiments in CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models
The benchmark is available on Hugging Face Datasets.
This project requires python3 to run.
To setup all dependencies, use the following command: pip3 install -r requirements.txt
The task of automated code refinement aims to automate the developer's perspective in resolving an actionable code review comment provided by a reviewer. This is a generative task, where the LLM is required to revise a pre-review code submission with respect to the natural language code review comment to produce an intended post-review code revision. CodeReviewQA further breaks down this generative task into three intermediate reasoning steps (represented as MCQA problems) to provide early signals for model development.
The benchmark features 900 manually curated, high-quality examples across nine programming languages (100 examples each). Each example represents a real interaction between a human reviewer and developer in a collaborative code review scenario. Different from clear instruction-esque prompts, code review comments are often underspecified, ambiguous, and implicit. Thus, this problem assesses LLMs' proficiency in understanding and following conversational instructions in human-oriented software development.
Original Problem (Text-to-Text Generation)
- Automated Code Refinement (ACR): Given a pre-review code submission and code review comment, generate the post-review code revision that is being requested.
Intermediate Reasoning Steps (Multiple Choice Question Answering)
- Change Type Recognition (CTR): Given a pre-review code submission and code review comment, infer the general code change type that is being requested.
- Change Localisation (CL): Given a pre-review code submission and code review comment, locate the precise lines of code that need to be revised.
- Solution Identification (SI): Given a pre-review code submission and code review comment, identify the exact code revision that is being requested.
(Both Change Localisation and Solution Identification have easy (E) and hard (H) difficulty variations, where the hard version represents an adversarial setup.)
- Natural Language: English
- Programming Language: C, C++, CSharp, Go, Java, JavaScript, PHP, Python, Ruby
.
├── ACR_vLLM.py # Inference script for Automated Code Refinement
├── CL_vLLM.py # Inference script for Change Localisation
├── CTR_vLLM.py # Inference script for Change Type Recognition
├── graphics
├── LICENSE
├── README.md
├── requirements.txt # List of dependencies required for the project
├── results # Folder for storing all inference results
│ ├── acr
│ ├── cl
│ ├── ctr
│ └── si
├── SI_vLLM.py # Inference script for Solution Identification
└── utils.py # Prompt templates and evaluation functions
Running Inference Scripts
### Change the arguments as required
# huggingface_token: your hugging face access token
# language: C, CPP, CSharp, Go, Java, JavaScript, PHP, Python, Ruby
# mode: difficulty for change localisation/solution identification i.e., easy, hard
# model_name: hugging face model name e.g., meta-llama/Llama-3.1-8B-Instruct
python3 ACR_vLLM.py <huggingface_token> <language> <model_name>
python3 CTR_vLLM.py <huggingface_token> <language> <model_name>
python3 CL_vLLM.py <huggingface_token> <language> <mode> <model_name>
python3 SI_vLLM.py <huggingface_token> <language> <mode> <model_name>
@inproceedings{lin-etal-2025-codereviewqa,
title = "{C}ode{R}eview{QA}: The Code Review Comprehension Assessment for Large Language Models",
author = "Lin, Hong Yi and Liu, Chunhua and Gao, Haoyu and Thongtanunam, Patanamon and Treude, Christoph",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.476/",
doi = "10.18653/v1/2025.findings-acl.476",
pages = "9138--9166",
ISBN = "979-8-89176-256-5"
}

