GitHub - zlab-princeton/UEval: UEval: A Benchmark for Unified Multimodal Generation

Unified image+text generation benchmark
1,000 questions · 8 tasks

UEval is a benchmark for evaluating unified models capable of generating both images and text.

Installation

git clone https://github.com/zlab-princeton/UEval.git
cd UEval
pip install -r requirements.txt

Quick Start

# Set API key
export GEMINI_API_KEY="your-api-key-here"

# Run evaluation
python ueval_eval.py \
  --model_output_path path/to/your_model_outputs.json \
  --output_path results/your_model_results.json

Generating Model Outputs

Using Gemini API

We provide generate_outputs/gemini.py for generating multimodal outputs (both text and images) using Google's Gemini API.

Prerequisites

Set up your Gemini API key:

export GEMINI_API_KEY="your-api-key-here"

Generate Outputs

Basic usage:

python generate_outputs/gemini.py \
  --output_path results/gemini_outputs.json \
  --output_image_dir results/images/ \
  --api_key YOUR_API_KEY

Advanced Options

# Generate for specific domains
python generate_outputs/gemini.py \
  --output_path results/gemini_outputs.json \
  --output_image_dir results/images/ \
  --domains art life tech

# Limit number of items for testing
python generate_outputs/gemini.py \
  --output_path results/test.json \
  --output_image_dir results/images/ \
  --limit 10

# Use specific Gemini model
python generate_outputs/gemini.py \
  --output_path results/gemini_outputs.json \
  --output_image_dir results/images/ \
  --model gemini-2.5-flash-image

Key Arguments:

--api_key: Gemini API key (or set GEMINI_API_KEY environment variable)
--output_path: Path to save output JSON file (required)
--output_image_dir: Directory to save generated images (required)
--hf_dataset: HuggingFace dataset ID (default: primerL/UEval-all)
--domains: Filter by specific task types (e.g., art, life, tech, exercise, space, textbook, diagram, paper)
--model: Gemini model name (default: gemini-2.5-flash-image)
--limit: Number of items to process (default: all)
--checkpoint_interval: Save checkpoint every N items (default: 1)
--retry_delay: Seconds between retry attempts (default: 3.0)
--max_attempts: Maximum retry attempts per prompt (default: 100)

Output Format

Generated outputs are saved in JSON format compatible with the evaluation script:

[
  {
    "id": 1,
    "prompt": "Your prompt here...",
    "task_type": "art",
    "question_type": "open",
    "gemini_image_ans": ["results/images/1_1.png"],
    "gemini_text_ans": "Generated text response..."
  },
  ...
]

Using Emu3.5

We adapted Emu3.5's official implementation to work with the UEval benchmark by adding two adapter files: ueval_inference_vllm.py and vis_proto_ueval.py.

Prerequisites

Follow the official Emu3.5 setup instructions to configure the environment and download model weights.
Ensure you have the required dependencies installed as specified in the Emu3.5 repository.

Generate Outputs

Step 1: Run inference to generate protobuf outputs

cd generate_outputs/Emu3.5
python ueval_inference_vllm.py \
  --cfg configs/example_config_visual_guidance.py \
  --dataset-name primerL/UEval-all \

This will generate protobuf (.pb) files containing the raw model outputs.

Step 2: Visualize and convert protobuf outputs to evaluation format

python src/utils/vis_proto_ueval.py \
  --proto-dir outputs/proto \
  --image-dir images \
  --output-json emu3.5_results.json

This converts the protobuf files into JSON format compatible with the UEval evaluation script.

Key Arguments for inference:

--cfg: Path to Emu3.5 configuration file (required)
--dataset-name: HuggingFace dataset ID (default: primerL/UEval-all)
--dataset-split: Specific split to process (e.g., art, life, etc.)
--tensor-parallel-size: Number of GPUs for tensor parallelism (default: 4)
--gpu-memory-utilization: GPU memory utilization ratio (default: 0.7)

Key Arguments for visualization:

--proto-dir: Directory containing .pb files (required)
--image-dir: Directory to save extracted images (required)
--output-json: Path to save output JSON file (required)
--relative-root: Base directory for computing relative image paths (default: .)

Output Format

The final JSON output will have the following format:

[
  {
    "id": "1",
    "emu_image": ["clip_00_00.png", ...],
    "emu_text": "Generated text response with chain-of-thought..."
  },
  ...
]

Evaluation

Quick Start

We provide ueval_eval.py for efficient evaluation using the Gemini API with caching strategy to reduce costs.

Cost Estimates:

Without caching: ~$90 per full benchmark evaluation
With caching: Cost savings depend on how many cached contents can be created
- When judging reference answers: Can save ~$25 (almost every answer in open-ended questions can create cache)
- For model outputs: Savings vary based on number of generated images per question
Caching requirements: Gemini's context caching requires min_total_token_count=2048, typically achieved when evaluating answers with 5+ images
Use --no_cache flag for models generating single images, as caching threshold may not be reached

Prerequisites

Set up your Gemini API key:

export GEMINI_API_KEY="your-api-key-here"

Evaluate Your Model

Prepare your model outputs in JSON format:

[
  {
    "id": "1",
    "text_answer": "Your model's text response",
    "image_answer": ["path/to/generated/image1.jpg", "path/to/generated/image2.jpg"]
  },
  ...
]

Run evaluation:

python ueval_eval.py \
  --model_output_path path/to/your_model_outputs.json \
  --output_path results/your_model_results.json \

Advanced Options

# With caching (recommended for models generating multiple images per question)
python ueval_eval.py \
  --model_output_path path/to/outputs.json \
  --output_path path/to/results.json \
  --text_field text_answer \
  --image_field image_answer \
  --api_key YOUR_API_KEY \
  --limit 10 \

# Without caching (recommended for models generating single images)
python ueval_eval.py \
  --model_output_path path/to/outputs.json \
  --output_path results/results.json \
  --text_field text_answer \
  --image_field image_answer \
  --api_key YOUR_API_KEY \
  --limit 100 \
  --no_cache

Key Arguments:

--model_output_path: Path to your model's output JSON file (required)
--output_path: Where to save evaluation results (required)
--hf_dataset: HuggingFace dataset ID (default: primerL/UEval-all)
--text_field: Field name for text answers in your output file (default: text_answer)
--image_field: Field name for image paths in your output file (default: image_answer)
--api_key: Gemini API key (or set GEMINI_API_KEY environment variable)
--limit: Number of examples to evaluate (default: all)
--checkpoint_interval: Save checkpoint every N items (default: 1)
--no_cache: Disable caching for single-image outputs (optional)

Output Format

Evaluation results are saved in JSON format:

{
  "results": [
    {
      "id": "1",
      "text_rate": 0.85,
      "image_rate": 0.90,
      "text_rubrics": [...],
      "image_rubrics": [...],
      "text_evaluations": [...],
      "image_evaluations": [...]
    },
    ...
  ],
  "summary": {
    "num_items": 1000,
    "num_items_with_text": 1000,
    "num_items_with_image": 1000,
    "text_avg_rate": 0.82,
    "image_avg_rate": 0.78,
    "overall_avg_rate": 0.80
  }
}

Leaderboard

We evaluate recent unified models on all 8 tasks in our benchmark. Overall, frontier models consistently outperform open-source ones across all tasks: GPT-5-Thinking achieves the highest average score of 66.4, while the best open-source model obtains only 49.1.

Model	Space	Textbook	Diagram	Paper	Art	Life	Tech	Exercise	Avg
Reference	96.2	94.4	93.1	96.2	90.6	87.7	90.6	89.2	92.2
Janus-Pro	21.0	31.0	37.4	15.2	26.4	23.0	17.6	11.5	22.9
Show-o2	25.4	33.1	33.2	17.4	25.6	15.6	17.4	13.1	22.6
MMaDA	10.8	20.0	14.2	13.3	15.7	15.8	12.4	12.6	14.4
BAGEL	29.8	42.5	37.2	20.0	39.0	33.6	24.8	21.4	31.0
Emu3.5	59.1	57.4	41.1	31.6	59.3	62.0	37.0	45.4	49.1
Gemini-2.0-Flash	65.2	55.2	47.6	45.8	70.4	58.0	50.2	48.0	55.1
Gemini-2.5-Flash	78.0	74.0	66.4	71.6	66.6	63.0	58.2	50.0	66.0
GPT-5-Instant	77.3	77.9	62.3	55.1	71.2	69.7	50.7	57.6	65.2
GPT-5-Thinking	84.0	78.0	67.8	51.9	67.8	63.8	57.0	61.4	66.4

Citation

If you find this repository helpful, please consider citing:

@article{li2026ueval,
    title     = {UEval: A Benchmark for Unified Multimodal Generation},
    author    = {Li, Bo and Yin, Yida and Chai, Wenhao and Fu, Xingyu and Liu, Zhuang},
    journal   = {arXiv preprint arXiv:2601.22155},
    year      = {2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
generate_outputs		generate_outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png
requirements.txt		requirements.txt
ueval_eval.py		ueval_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Quick Start

Generating Model Outputs

Using Gemini API

Prerequisites

Generate Outputs

Advanced Options

Output Format

Using Emu3.5

Prerequisites

Generate Outputs

Output Format

Evaluation

Quick Start

Prerequisites

Evaluate Your Model

Advanced Options

Output Format

Leaderboard

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Start

Generating Model Outputs

Using Gemini API

Prerequisites

Generate Outputs

Advanced Options

Output Format

Using Emu3.5

Prerequisites

Generate Outputs

Output Format

Evaluation

Quick Start

Prerequisites

Evaluate Your Model

Advanced Options

Output Format

Leaderboard

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages