This directory contains evaluation scripts for the MMRB2 benchmark.
evaluate/
├── generate_judgements/ # Part 1: Generate judgements using LLM judges
│ ├── evaluate.py # Single-process evaluation
│ ├── multi_gpu_evaluate.py # Multi-GPU/process evaluation
│ ├── run_gpt4o.sh # GPT-4o script
│ ├── run_gemini25flash.sh # Gemini 2.5 Flash script
│ ├── run_qwen3.sh # Qwen3-VL-8B script
│ └── outputs/ # Sample outputs
└── compute_scores/ # Part 2: Compute accuracy from judgement files
└── compute_accuracy.py # Scoring script
Generate pairwise judgements using any reward models. We provide example implementations for a few multimodal LLM-as-a-judge, but you can easily add your own LLM as the base model for judges.
Note that you don't necessarily need an LLM-as-a-judge, you can also use any other models, as long as you process the results in the correct format. Here we mainly provide the codes for using any LLM as judge.
| Evaluator | Type | Description |
|---|---|---|
gpt4o-pairwise |
API | GPT-4o via OpenAI |
gemini25flash-pairwise |
API | Gemini 2.5 Flash |
qwen3vl8b-pairwise |
Local | Qwen3-VL-8B (requires GPU) |
torch>=2.8.0
transformers>=4.57.1
openai>=1.98.0
google-generativeai>=0.3.0
google-genai>=1.28.0
json-repair
tqdm
Pillow
Set up API keys via environment variables:
# For OpenAI:
export OPENAI_API_KEY="your-openai-api-key"
# For Google:
export GOOGLE_API_KEY="your-google-api-key"Run evaluation:
cd generate_judgements
# Run GPT-4o evaluation
./run_gpt4o.sh
# Or run Gemini 2.5 Flash
./run_gemini25flash.sh
# Or run local Qwen3 (requires 8 GPUs)
./run_qwen3.sh# Run API models in 32 processes
python multi_gpu_evaluate.py \
--evaluator_name gpt4o-pairwise \
--task_type image \
--pairs_path /path/to/t2i.json \
--output_path ./outputs/task1_gpt4o-pairwise.json \
--n_gpu 32
# Multi-process (8 parallel workers, each on one GPU)
python multi_gpu_evaluate.py \
--evaluator_name qwen3vl8b-pairwise \
--task_type image \
--pairs_path /path/to/t2i.json \
--output_path ./outputs/task1_qwen3vl8b-pairwise.json \
--n_gpu 8| Task | Type Flag | Benchmark File |
|---|---|---|
| Text-to-Image | image |
t2i.json |
| Image Editing | edit |
edit.json |
| Interleaved | interleaved |
interleaved.json |
| Visual Reasoning | reasoning |
reasoning.json |
You can add your own model as a judge by following these steps:
- Add your API client in
evaluator_apis/model_apis/your_provider/:
# evaluator_apis/model_apis/your_provider/your_model.py
class YourModelClient:
def __init__(self, api_key: str):
self.api_key = api_key
def generate(self, messages: list) -> str:
# Call your API and return the response text
...- Create your evaluator class in
evaluator_apis/llm_judges/api_pairwise_evaluator.py:
class YourModelPairwiseEvaluator(APIPairwiseEvaluator):
def __init__(self, device_id: int = None):
# Initialize your model client
self.client = YourModelClient(api_key="...")
@property
def evaluator_name(self):
return "your-model_pairwise_evaluator"
def generate_response(self, prompt: List[List[str]]) -> str:
# Convert prompt to your API format and get response
...- Register your evaluator in
evaluator_apis/evaluators.py:
from .llm_judges import YourModelPairwiseEvaluator
EVALUATORS = {
# ... existing evaluators ...
"your-model-pairwise": {
"class": YourModelPairwiseEvaluator,
"type": EvaluatorTypes.PAIRWISE,
"is_api_based": True,
"capabilities": [
EvaluatorCapabilities.IMAGE,
EvaluatorCapabilities.EDIT,
EvaluatorCapabilities.INTERLEAVED,
EvaluatorCapabilities.REASONING,
],
},
}- Add your model in
evaluator_apis/llm_judges/local_models.py:
class LocalModelManager:
MODEL_CONFIGS = {
# ... existing models ...
"your-local-model": {
"huggingface_id": "your-org/your-model",
},
}- Create your evaluator class in
evaluator_apis/llm_judges/local_pairwise_evaluator.py:
class YourLocalModelPairwiseEvaluator(LocalPairwiseEvaluator):
def __init__(self, device_id: int = None):
super().__init__(model_name="your-local-model", device_id=device_id)- Register in
evaluators.pywith"is_api_based": False.
see the sample output files under generate_judgements/outputs/
Compute accuracy by comparing judgements against ground truth labels.
cd compute_scores
# Evaluate a single task. Can be image/edit/interleaved/reasoning
python compute_accuracy.py --task image \
--predictions ../generate_judgements/outputs/sample_task1_image.json
# Evaluate all 4 tasks
python compute_accuracy.py --task all \
--predictions ../generate_judgements/outputs/sample_task1_image.json \
../generate_judgements/outputs/sample_task2_edit.json \
../generate_judgements/outputs/sample_task3_interleaved.json \
../generate_judgements/outputs/sample_task4_reasoning.jsonFor each pair:
- Forward: Compare
forward.judgementwith ground truthchosen - Reverse: Flip
reverse.judgement(A↔B), then compare withchosen - Overall accuracy = (forward_correct + reverse_correct) / total
==================================================
SUMMARY
==================================================
Task Accuracy Missing
--------------------------------------------------
task1_image 53.20% 0
task2_edit 55.50% 0
task3_interleaved 57.50% 0
task4_reasoning 47.50% 0
--------------------------------------------------
Overall 53.42%
==================================================