📄 Paper | 🤗 Models | 📊 VC-RewardBench
🌈 Official repository for Visual-ERM, a multimodal generative reward model for vision-to-code tasks.
Visual-ERM evaluates outputs directly in the rendered visual space and provides fine-grained, interpretable, and task-agnostic feedback for structured visual reconstruction, including chart-to-code, table-to-markdown, and SVG-to-code.
- 🚀 [2026/03/15] Release of 📊 VisualCritic-RewardBench.
- 🚀 [2026/03/14] Release of pretrained Visual-ERM 🤗 checkpoints.
- 🚀 [2026/03/13] Initial release of Visual-ERM codebase.
-
🔥 Visual-space reward modeling. Instead of relying on text-only rules or coarse visual embedding similarity, Visual-ERM evaluates predictions in the rendered visual space.
-
🔥 Fine-grained and interpretable feedback. Visual-ERM predicts structured discrepancies with fields such as category, severity, location, and description, making reward signals actionable for both training and refinement.
-
🔥 Task-agnostic reward supervision. A single reward model generalizes across multiple vision-to-code tasks, including charts, tables, and SVGs.
-
🔥 Effective for both RL and test-time scaling. Visual-ERM can be used as a reward model in RL and as a visual critic for reflection-and-revision at inference time.
-
🔥 A new benchmark for visual discrepancy judgment. We introduce VC-RewardBench, a benchmark for fine-grained image-to-image discrepancy evaluation on structured visual data.
Vision-to-code tasks require models to reconstruct structured visual inputs into executable or structured representations with high visual fidelity. However, existing reward designs have major limitations: (1) Text-based rewards (e.g., edit distance, TEDS) ignore important visual cues such as layout, alignment, spacing, and style. (2) Vision embedding rewards (e.g., DINO similarity) are often coarse-grained, semantically biased, and vulnerable to reward hacking.
To address this, we propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that compares the ground-truth image and the rendered image from a model prediction, and then outputs fine-grained discrepancy annotations that can be converted into reward signals or used for reflection-based refinement.
Visual-ERM consists of three major components:
-
Reward data generation
We construct image pairs by: (1) editing ground-truth structured outputs to inject controlled errors, and (2) sampling natural errors from weaker model predictions. -
Fine-grained discrepancy annotation
Each image pair is annotated with structured visual discrepancies, including: category, severity, location and description -
Integration into RL and test-time scaling
Visual-ERM can be used: (1) as a reward model for GRPO-based RL, and (2) as a visual critic for iterative reflection and revision during inference.
Reinforcement Learning. Visual-ERM consistently improves vision-to-code performance across multiple tasks by providing feedback signals as a reward model during RL.
- Chart-to-Code: improves Qwen3-VL-8B-Instruct by +8.4 on average.
- Table-to-Markdown: yields +2.7 average improvement.
- SVG-to-Code: yields +4.1 average improvement.
Visual Critic Benchmark: On VC-RewardBench, Visual-ERM substantially improves over the base model on fine-grained discrepancy judgment and outperforms Qwen3-VL-235B-Instruct as an open-source judge.
We introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for evaluating fine-grained image-to-image discrepancy judgment on structured visual data.
- Covers charts, tables, and SVGs
- Contains 1,335 carefully curated instances
- Each instance includes:
- a ground-truth image
- a corrupted / rendered counterpart
- fine-grained discrepancy annotations
Use Visual-ERM to compare a reference image and a rendered prediction. First, download the model weights from Hugging Face:
Visual-ERM is fine-tuned from Qwen3-VL-8B-Instruct, so the usage is fully compatible with Qwen3-VL-8B-Instruct. Please refer to the supplementary materials of our paper for prompt templates.
We use veRL as the RL training framework. Before starting RL training, launch a reward model (Visual-ERM) service via vLLM:
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve "$MODEL_PATH" \
--tensor-parallel-size 4 \
--served-model-name $SERVE_NAME \
--port $PORT \
--max-num-seqs $N_PROCYou may need to launch multiple vLLM services to speed up reward computation. vLLM service can be set in reward functions as show in
rl_scripts/reward_func/table_parse_rm_v2.py.
Then start RL training with the following script:
bash ./rl_scripts/run_qwen3vl_table_40k_rm_32gpus.shRL training data can be freely organized following the veRL data format. Taking the table-to-markdown task as an example, the training data contains table images (no parsed markdown is needed). During training, the reward function renders the markdown parsed by the policy model back into an image, then feeds both the rendered image and the ground-truth image into Visual-ERM to produce a reward score (see the framework figure above). An example reward function is provided at rl_scripts/reward_func/table_parse_rm_v2.py.
First, download the VC-RewardBench data from 📊 Hugging Face. Then use ./evaluation/api_judge.py to generate model responses (this serves as an example of API-based inference). After inference, the output JSONL will contain two additional fields: pred_json and error_parse_failed_reason. Finally, run ./evaluation/evaluation.py to produce a JSON file with the evaluation results.
Note: You need to specify the API key and model type in the Python files, and update the corresponding benchmark JSONL file path accordingly.
Visual-ERM is designed for settings where visual equivalence matters more than textual similarity.
It is particularly useful when:
- semantic similarity is not sufficient,
- reward hacking is common under proxy rewards,
- test-time self-correction requires interpretable visual feedback.
If you find this work useful, please consider citing:
TBDIf you are interested in visual reward modeling, vision-to-code, or reinforcement learning for multimodal models, feel free to communicate with us.



