Skip to content

zlab-princeton/UEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Unified image+text generation benchmark
1,000 questions · 8 tasks

UEval is a benchmark for evaluating unified models capable of generating both images and text.


Installation

git clone https://github.com/zlab-princeton/UEval.git
cd UEval
pip install -r requirements.txt

Quick Start

# Set API key
export GEMINI_API_KEY="your-api-key-here"

# Run evaluation
python ueval_eval.py \
  --model_output_path path/to/your_model_outputs.json \
  --output_path results/your_model_results.json

Generating Model Outputs

Using Gemini API

We provide generate_outputs/gemini.py for generating multimodal outputs (both text and images) using Google's Gemini API.

Prerequisites

Set up your Gemini API key:

export GEMINI_API_KEY="your-api-key-here"

Generate Outputs

Basic usage:

python generate_outputs/gemini.py \
  --output_path results/gemini_outputs.json \
  --output_image_dir results/images/ \
  --api_key YOUR_API_KEY

Advanced Options

# Generate for specific domains
python generate_outputs/gemini.py \
  --output_path results/gemini_outputs.json \
  --output_image_dir results/images/ \
  --domains art life tech

# Limit number of items for testing
python generate_outputs/gemini.py \
  --output_path results/test.json \
  --output_image_dir results/images/ \
  --limit 10

# Use specific Gemini model
python generate_outputs/gemini.py \
  --output_path results/gemini_outputs.json \
  --output_image_dir results/images/ \
  --model gemini-2.5-flash-image

Key Arguments:

  • --api_key: Gemini API key (or set GEMINI_API_KEY environment variable)
  • --output_path: Path to save output JSON file (required)
  • --output_image_dir: Directory to save generated images (required)
  • --hf_dataset: HuggingFace dataset ID (default: primerL/UEval-all)
  • --domains: Filter by specific task types (e.g., art, life, tech, exercise, space, textbook, diagram, paper)
  • --model: Gemini model name (default: gemini-2.5-flash-image)
  • --limit: Number of items to process (default: all)
  • --checkpoint_interval: Save checkpoint every N items (default: 1)
  • --retry_delay: Seconds between retry attempts (default: 3.0)
  • --max_attempts: Maximum retry attempts per prompt (default: 100)

Output Format

Generated outputs are saved in JSON format compatible with the evaluation script:

[
  {
    "id": 1,
    "prompt": "Your prompt here...",
    "task_type": "art",
    "question_type": "open",
    "gemini_image_ans": ["results/images/1_1.png"],
    "gemini_text_ans": "Generated text response..."
  },
  ...
]

Using Emu3.5

We adapted Emu3.5's official implementation to work with the UEval benchmark by adding two adapter files: ueval_inference_vllm.py and vis_proto_ueval.py.

Prerequisites

  1. Follow the official Emu3.5 setup instructions to configure the environment and download model weights.

  2. Ensure you have the required dependencies installed as specified in the Emu3.5 repository.

Generate Outputs

Step 1: Run inference to generate protobuf outputs

cd generate_outputs/Emu3.5
python ueval_inference_vllm.py \
  --cfg configs/example_config_visual_guidance.py \
  --dataset-name primerL/UEval-all \

This will generate protobuf (.pb) files containing the raw model outputs.

Step 2: Visualize and convert protobuf outputs to evaluation format

python src/utils/vis_proto_ueval.py \
  --proto-dir outputs/proto \
  --image-dir images \
  --output-json emu3.5_results.json

This converts the protobuf files into JSON format compatible with the UEval evaluation script.

Key Arguments for inference:

  • --cfg: Path to Emu3.5 configuration file (required)
  • --dataset-name: HuggingFace dataset ID (default: primerL/UEval-all)
  • --dataset-split: Specific split to process (e.g., art, life, etc.)
  • --tensor-parallel-size: Number of GPUs for tensor parallelism (default: 4)
  • --gpu-memory-utilization: GPU memory utilization ratio (default: 0.7)

Key Arguments for visualization:

  • --proto-dir: Directory containing .pb files (required)
  • --image-dir: Directory to save extracted images (required)
  • --output-json: Path to save output JSON file (required)
  • --relative-root: Base directory for computing relative image paths (default: .)

Output Format

The final JSON output will have the following format:

[
  {
    "id": "1",
    "emu_image": ["clip_00_00.png", ...],
    "emu_text": "Generated text response with chain-of-thought..."
  },
  ...
]

Evaluation

Quick Start

We provide ueval_eval.py for efficient evaluation using the Gemini API with caching strategy to reduce costs.

Cost Estimates:

  • Without caching: ~$90 per full benchmark evaluation
  • With caching: Cost savings depend on how many cached contents can be created
    • When judging reference answers: Can save ~$25 (almost every answer in open-ended questions can create cache)
    • For model outputs: Savings vary based on number of generated images per question
  • Caching requirements: Gemini's context caching requires min_total_token_count=2048, typically achieved when evaluating answers with 5+ images
  • Use --no_cache flag for models generating single images, as caching threshold may not be reached

Prerequisites

Set up your Gemini API key:

export GEMINI_API_KEY="your-api-key-here"

Evaluate Your Model

Prepare your model outputs in JSON format:

[
  {
    "id": "1",
    "text_answer": "Your model's text response",
    "image_answer": ["path/to/generated/image1.jpg", "path/to/generated/image2.jpg"]
  },
  ...
]

Run evaluation:

python ueval_eval.py \
  --model_output_path path/to/your_model_outputs.json \
  --output_path results/your_model_results.json \

Advanced Options

# With caching (recommended for models generating multiple images per question)
python ueval_eval.py \
  --model_output_path path/to/outputs.json \
  --output_path path/to/results.json \
  --text_field text_answer \
  --image_field image_answer \
  --api_key YOUR_API_KEY \
  --limit 10 \

# Without caching (recommended for models generating single images)
python ueval_eval.py \
  --model_output_path path/to/outputs.json \
  --output_path results/results.json \
  --text_field text_answer \
  --image_field image_answer \
  --api_key YOUR_API_KEY \
  --limit 100 \
  --no_cache

Key Arguments:

  • --model_output_path: Path to your model's output JSON file (required)
  • --output_path: Where to save evaluation results (required)
  • --hf_dataset: HuggingFace dataset ID (default: primerL/UEval-all)
  • --text_field: Field name for text answers in your output file (default: text_answer)
  • --image_field: Field name for image paths in your output file (default: image_answer)
  • --api_key: Gemini API key (or set GEMINI_API_KEY environment variable)
  • --limit: Number of examples to evaluate (default: all)
  • --checkpoint_interval: Save checkpoint every N items (default: 1)
  • --no_cache: Disable caching for single-image outputs (optional)

Output Format

Evaluation results are saved in JSON format:

{
  "results": [
    {
      "id": "1",
      "text_rate": 0.85,
      "image_rate": 0.90,
      "text_rubrics": [...],
      "image_rubrics": [...],
      "text_evaluations": [...],
      "image_evaluations": [...]
    },
    ...
  ],
  "summary": {
    "num_items": 1000,
    "num_items_with_text": 1000,
    "num_items_with_image": 1000,
    "text_avg_rate": 0.82,
    "image_avg_rate": 0.78,
    "overall_avg_rate": 0.80
  }
}

Leaderboard

We evaluate recent unified models on all 8 tasks in our benchmark. Overall, frontier models consistently outperform open-source ones across all tasks: GPT-5-Thinking achieves the highest average score of 66.4, while the best open-source model obtains only 49.1.

Model Space Textbook Diagram Paper Art Life Tech Exercise Avg
Reference 96.2 94.4 93.1 96.2 90.6 87.7 90.6 89.2 92.2
Janus-Pro 21.0 31.0 37.4 15.2 26.4 23.0 17.6 11.5 22.9
Show-o2 25.4 33.1 33.2 17.4 25.6 15.6 17.4 13.1 22.6
MMaDA 10.8 20.0 14.2 13.3 15.7 15.8 12.4 12.6 14.4
BAGEL 29.8 42.5 37.2 20.0 39.0 33.6 24.8 21.4 31.0
Emu3.5 59.1 57.4 41.1 31.6 59.3 62.0 37.0 45.4 49.1
Gemini-2.0-Flash 65.2 55.2 47.6 45.8 70.4 58.0 50.2 48.0 55.1
Gemini-2.5-Flash 78.0 74.0 66.4 71.6 66.6 63.0 58.2 50.0 66.0
GPT-5-Instant 77.3 77.9 62.3 55.1 71.2 69.7 50.7 57.6 65.2
GPT-5-Thinking 84.0 78.0 67.8 51.9 67.8 63.8 57.0 61.4 66.4
Image Image

Citation

If you find this repository helpful, please consider citing:

@article{li2026ueval,
    title     = {UEval: A Benchmark for Unified Multimodal Generation},
    author    = {Li, Bo and Yin, Yida and Chai, Wenhao and Fu, Xingyu and Liu, Zhuang},
    journal   = {arXiv preprint arXiv:2601.22155},
    year      = {2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

UEval: A Benchmark for Unified Multimodal Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages