
Unified image+text generation benchmark
1,000 questions · 8 tasks
UEval is a benchmark for evaluating unified models capable of generating both images and text.
git clone https://github.com/zlab-princeton/UEval.git
cd UEval
pip install -r requirements.txt# Set API key
export GEMINI_API_KEY="your-api-key-here"
# Run evaluation
python ueval_eval.py \
--model_output_path path/to/your_model_outputs.json \
--output_path results/your_model_results.jsonWe provide generate_outputs/gemini.py for generating multimodal outputs (both text and images) using Google's Gemini API.
Set up your Gemini API key:
export GEMINI_API_KEY="your-api-key-here"Basic usage:
python generate_outputs/gemini.py \
--output_path results/gemini_outputs.json \
--output_image_dir results/images/ \
--api_key YOUR_API_KEY# Generate for specific domains
python generate_outputs/gemini.py \
--output_path results/gemini_outputs.json \
--output_image_dir results/images/ \
--domains art life tech
# Limit number of items for testing
python generate_outputs/gemini.py \
--output_path results/test.json \
--output_image_dir results/images/ \
--limit 10
# Use specific Gemini model
python generate_outputs/gemini.py \
--output_path results/gemini_outputs.json \
--output_image_dir results/images/ \
--model gemini-2.5-flash-imageKey Arguments:
--api_key: Gemini API key (or setGEMINI_API_KEYenvironment variable)--output_path: Path to save output JSON file (required)--output_image_dir: Directory to save generated images (required)--hf_dataset: HuggingFace dataset ID (default:primerL/UEval-all)--domains: Filter by specific task types (e.g.,art,life,tech,exercise,space,textbook,diagram,paper)--model: Gemini model name (default:gemini-2.5-flash-image)--limit: Number of items to process (default: all)--checkpoint_interval: Save checkpoint every N items (default: 1)--retry_delay: Seconds between retry attempts (default: 3.0)--max_attempts: Maximum retry attempts per prompt (default: 100)
Generated outputs are saved in JSON format compatible with the evaluation script:
[
{
"id": 1,
"prompt": "Your prompt here...",
"task_type": "art",
"question_type": "open",
"gemini_image_ans": ["results/images/1_1.png"],
"gemini_text_ans": "Generated text response..."
},
...
]We adapted Emu3.5's official implementation to work with the UEval benchmark by adding two adapter files: ueval_inference_vllm.py and vis_proto_ueval.py.
-
Follow the official Emu3.5 setup instructions to configure the environment and download model weights.
-
Ensure you have the required dependencies installed as specified in the Emu3.5 repository.
Step 1: Run inference to generate protobuf outputs
cd generate_outputs/Emu3.5
python ueval_inference_vllm.py \
--cfg configs/example_config_visual_guidance.py \
--dataset-name primerL/UEval-all \This will generate protobuf (.pb) files containing the raw model outputs.
Step 2: Visualize and convert protobuf outputs to evaluation format
python src/utils/vis_proto_ueval.py \
--proto-dir outputs/proto \
--image-dir images \
--output-json emu3.5_results.jsonThis converts the protobuf files into JSON format compatible with the UEval evaluation script.
Key Arguments for inference:
--cfg: Path to Emu3.5 configuration file (required)--dataset-name: HuggingFace dataset ID (default:primerL/UEval-all)--dataset-split: Specific split to process (e.g.,art,life, etc.)--tensor-parallel-size: Number of GPUs for tensor parallelism (default: 4)--gpu-memory-utilization: GPU memory utilization ratio (default: 0.7)
Key Arguments for visualization:
--proto-dir: Directory containing.pbfiles (required)--image-dir: Directory to save extracted images (required)--output-json: Path to save output JSON file (required)--relative-root: Base directory for computing relative image paths (default:.)
The final JSON output will have the following format:
[
{
"id": "1",
"emu_image": ["clip_00_00.png", ...],
"emu_text": "Generated text response with chain-of-thought..."
},
...
]We provide ueval_eval.py for efficient evaluation using the Gemini API with caching strategy to reduce costs.
Cost Estimates:
- Without caching: ~$90 per full benchmark evaluation
- With caching: Cost savings depend on how many cached contents can be created
- When judging reference answers: Can save ~$25 (almost every answer in open-ended questions can create cache)
- For model outputs: Savings vary based on number of generated images per question
- Caching requirements: Gemini's context caching requires
min_total_token_count=2048, typically achieved when evaluating answers with 5+ images - Use
--no_cacheflag for models generating single images, as caching threshold may not be reached
Set up your Gemini API key:
export GEMINI_API_KEY="your-api-key-here"Prepare your model outputs in JSON format:
[
{
"id": "1",
"text_answer": "Your model's text response",
"image_answer": ["path/to/generated/image1.jpg", "path/to/generated/image2.jpg"]
},
...
]Run evaluation:
python ueval_eval.py \
--model_output_path path/to/your_model_outputs.json \
--output_path results/your_model_results.json \# With caching (recommended for models generating multiple images per question)
python ueval_eval.py \
--model_output_path path/to/outputs.json \
--output_path path/to/results.json \
--text_field text_answer \
--image_field image_answer \
--api_key YOUR_API_KEY \
--limit 10 \
# Without caching (recommended for models generating single images)
python ueval_eval.py \
--model_output_path path/to/outputs.json \
--output_path results/results.json \
--text_field text_answer \
--image_field image_answer \
--api_key YOUR_API_KEY \
--limit 100 \
--no_cacheKey Arguments:
--model_output_path: Path to your model's output JSON file (required)--output_path: Where to save evaluation results (required)--hf_dataset: HuggingFace dataset ID (default:primerL/UEval-all)--text_field: Field name for text answers in your output file (default:text_answer)--image_field: Field name for image paths in your output file (default:image_answer)--api_key: Gemini API key (or setGEMINI_API_KEYenvironment variable)--limit: Number of examples to evaluate (default: all)--checkpoint_interval: Save checkpoint every N items (default: 1)--no_cache: Disable caching for single-image outputs (optional)
Evaluation results are saved in JSON format:
{
"results": [
{
"id": "1",
"text_rate": 0.85,
"image_rate": 0.90,
"text_rubrics": [...],
"image_rubrics": [...],
"text_evaluations": [...],
"image_evaluations": [...]
},
...
],
"summary": {
"num_items": 1000,
"num_items_with_text": 1000,
"num_items_with_image": 1000,
"text_avg_rate": 0.82,
"image_avg_rate": 0.78,
"overall_avg_rate": 0.80
}
}We evaluate recent unified models on all 8 tasks in our benchmark. Overall, frontier models consistently outperform open-source ones across all tasks: GPT-5-Thinking achieves the highest average score of 66.4, while the best open-source model obtains only 49.1.
| Model | Space | Textbook | Diagram | Paper | Art | Life | Tech | Exercise | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Reference | 96.2 | 94.4 | 93.1 | 96.2 | 90.6 | 87.7 | 90.6 | 89.2 | 92.2 |
| Janus-Pro | 21.0 | 31.0 | 37.4 | 15.2 | 26.4 | 23.0 | 17.6 | 11.5 | 22.9 |
| Show-o2 | 25.4 | 33.1 | 33.2 | 17.4 | 25.6 | 15.6 | 17.4 | 13.1 | 22.6 |
| MMaDA | 10.8 | 20.0 | 14.2 | 13.3 | 15.7 | 15.8 | 12.4 | 12.6 | 14.4 |
| BAGEL | 29.8 | 42.5 | 37.2 | 20.0 | 39.0 | 33.6 | 24.8 | 21.4 | 31.0 |
| Emu3.5 | 59.1 | 57.4 | 41.1 | 31.6 | 59.3 | 62.0 | 37.0 | 45.4 | 49.1 |
| Gemini-2.0-Flash | 65.2 | 55.2 | 47.6 | 45.8 | 70.4 | 58.0 | 50.2 | 48.0 | 55.1 |
| Gemini-2.5-Flash | 78.0 | 74.0 | 66.4 | 71.6 | 66.6 | 63.0 | 58.2 | 50.0 | 66.0 |
| GPT-5-Instant | 77.3 | 77.9 | 62.3 | 55.1 | 71.2 | 69.7 | 50.7 | 57.6 | 65.2 |
| GPT-5-Thinking | 84.0 | 78.0 | 67.8 | 51.9 | 67.8 | 63.8 | 57.0 | 61.4 | 66.4 |
If you find this repository helpful, please consider citing:
@article{li2026ueval,
title = {UEval: A Benchmark for Unified Multimodal Generation},
author = {Li, Bo and Yin, Yida and Chai, Wenhao and Fu, Xingyu and Liu, Zhuang},
journal = {arXiv preprint arXiv:2601.22155},
year = {2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
