llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
🎉[2026.01.16] Our paper introducing llm-jp-eval-mm, "Cross-Task Evaluation and Empirical Analysis of Japanese Visual Language Models", was accepted for publication in the Journal of Natural Language Processing (Japan)!
You can install llm-jp-eval-mm from GitHub or via PyPI.
- Option 1: Clone from GitHub (Recommended)
git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync- Option 2: Install via PyPI
pip install eval_mmTo use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:
- For Azure: Set
AZURE_OPENAI_ENDPOINTandAZURE_OPENAI_KEY - For OpenAI: Set
OPENAI_API_KEY
If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.
To evaluate a model on a task, use the eval-mm CLI:
uv sync --group normal
uv run --group normal eval-mm run \
--model_id llava-hf/llava-1.5-7b-hf \
--task_id japanese-heron-bench \
--result_dir result \
--metrics heron-bench \
--judge_model gpt-4o-2024-11-20 \
--overwriteYou can also use python -m eval_mm run ... or the legacy python examples/sample.py ... wrapper.
To evaluate using vLLM for faster batch inference:
uv sync --group vllm_normal
uv run --group vllm_normal eval-mm run \
--backend vllm \
--model_id Qwen/Qwen2.5-VL-3B-Instruct \
--task_id japanese-heron-bench \
--metrics heron-benchTo score existing predictions without running inference:
eval-mm evaluate --model_id llava-hf/llava-1.5-7b-hf --task_id japanese-heron-bench --metrics heron-benchTo list available tasks and metrics:
eval-mm list tasks
eval-mm list metricsThe evaluation results will be saved in the result directory:
result
├── japanese-heron-bench
│ ├── llava-hf
│ │ ├── llava-1.5-7b-hf
│ │ │ ├── evaluation.jsonl
│ │ │ └── prediction.jsonl
To evaluate multiple models on multiple tasks, please check eval_all.sh.
You can integrate llm-jp-eval-mm into your own code. Here's an example:
from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
class MockVLM:
def generate(self, images: list[Image.Image], text: str) -> str:
return "宮崎駿"
task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]
input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)
model = MockVLM()
prediction = model.generate(images, input_text)
scorer = ScorerRegistry.load_scorer(
"rougel",
ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})A Next.js web frontend provides three interactive views for browsing evaluation results. It connects to a lightweight FastAPI backend that serves data from your result/ directory.
# Terminal 1: Start the API server
uv pip install fastapi uvicorn
uvicorn eval_mm.api:app --reload
# Terminal 2: Start the web frontend
cd web
pnpm install
pnpm devThen open http://localhost:3000.
| Page | URL | Description |
|---|---|---|
| Runner Dashboard | /runner |
GPU monitoring, task configuration, progress tracking (Sentry-inspired dark UI) |
| Leaderboard | /leaderboard |
Sortable model comparison table with scores across all tasks (Stripe-inspired design) |
| Prediction Browser | /browser |
Browse individual predictions, compare models side-by-side, navigate samples (Claude-inspired editorial UI) |
The FastAPI backend (uvicorn eval_mm.api:app) exposes:
| Endpoint | Description |
|---|---|
GET /api/tasks |
List available evaluation tasks |
GET /api/models |
List leaderboard models |
GET /api/results |
Discover all task/model result pairs |
GET /api/predictions/{task_id}/{model_id} |
Paginated predictions (supports offset, limit) |
GET /api/scores/{task_id} |
Aggregate scores for all models on a task |
Set EVAL_MM_RESULT_DIR to point to your result directory (default: result/).
Note: The web frontend works without the API server — pages fall back to mock data for static builds and development.
cd web && pnpm buildThis generates static pages that can be deployed to GitHub Pages or any static host.
To generate a leaderboard from your evaluation results via CLI:
python scripts/make_leaderboard.py --result_dir resultThis will create a leaderboard.md file with your model performance:
| Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
|---|---|---|---|
| llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | 52.4 |
| Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 |
| google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 |
| microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 |
| gpt-4o-2024-11-20 | 93.7 | 4.44 | 32.2 |
The official leaderboard is available here
Japanese Tasks:
- Japanese Heron Bench
- JA-VG-VQA500
- JA-VLM-Bench-In-the-Wild
- JA-Multi-Image-VQA
- JDocQA
- JMMMU
- JIC-VQA
- MECHA-ja
- CC-OCR (multi_lan_ocr split, ja subset)
- CVQA (ja subset)
English Tasks:
We use uv’s dependency groups to manage each model’s dependencies.
For example, to use llm-jp/llm-jp-3-vila-14b, run:
uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.pySee eval_all.sh for the complete list of model dependencies.
When adding a new group, remember to configure conflict.
Start the web dashboard (see Web Dashboard above) and open the Prediction Browser at http://localhost:3000/browser.
uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hfTo add a new task, implement the Task class in src/eval_mm/tasks/task.py.
To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.
To add a new model, create a VLM adapter class that inherits from eval_mm.BaseVLM and place it in examples/.
Register it in examples/model_table.py so the CLI can discover it.
Note:
examples/contains model adapter implementations and usage examples. The evaluation engine, CLI, base classes, and scoring live insrc/eval_mm/(the published package). Model adapters inexamples/are not part of the PyPI distribution — they are reference implementations for users to copy, adapt, or extend
Install a new dependency using the following command:
uv add <package_name>
uv add --group <group_name> <package_name>Portable offline tests (no GPU, no network):
bash test.shModel smoke tests (requires GPU + model weights):
bash test_model.sh # all models
bash test_model.sh "Qwen" # filter by name
CUDA_VISIBLE_DEVICES=0,1 bash test_model.sh # custom GPU selectionEnsure code consistency with:
uv run ruff format src
uv run ruff check --fix srcTo release a new version:
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tagsThe web dashboard is built with Next.js in the web/ directory. For development:
cd web && pnpm devThe legacy GitHub Pages leaderboard is in github_pages/. To update its data:
python scripts/make_leaderboard.py --update_pagesThis project runs on mdx (a GPU computing platform for academic research in Japan). The devcontainer is pre-configured for this environment.
| Mount | Filesystem | Capacity | Purpose |
|---|---|---|---|
/workspace |
NFS (/home/$USER/workspace/evalmm) |
~6T (shared) | Source code, results, datasets |
/model |
Lustre (/model) |
~4P | Model weights, HF/uv/vLLM caches |
Large caches (HuggingFace models, uv packages, vLLM compilations) are stored on /model to avoid filling the NFS home directory. This is configured in .devcontainer/docker-compose.yml:
volumes:
- /model:/model:cached
environment:
HF_HOME: /model/<username>/cache/huggingface
UV_CACHE_DIR: /model/<username>/cache/uv
VLLM_CACHE_DIR: /model/<username>/cache/vllmBefore rebuilding the devcontainer, create the cache directories on the host:
mkdir -p /model/$USER/cache/{huggingface,uv,vllm}If you are not using mdx, remove or comment out the /model volume mount and the cache environment variables in .devcontainer/docker-compose.yml. Caches will fall back to their default locations (~/.cache/).
- Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
- lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.
We also thank the developers of the evaluation datasets for their hard work.
@article{maeda-etal-2026-evalmm,
title = {日本語視覚言語モデルのタスク横断評価と実証的分析},
author = {前田 航希 and
杉浦 一瑳 and
小田 悠介 and
栗田 修平 and
岡崎 直観},
journal={自然言語処理},
volume={33},
number={2},
pages={TBD},
year={2026},
note = {local.ja},
month = {January}
}
