llm-jp-eval-mm

llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.

Updates

🎉[2026.01.16] Our paper introducing llm-jp-eval-mm, "Cross-Task Evaluation and Empirical Analysis of Japanese Visual Language Models", was accepted for publication in the Journal of Natural Language Processing (Japan)!

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

Option 1: Clone from GitHub (Recommended)

git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync

Option 2: Install via PyPI

pip install eval_mm

To use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:

For Azure: Set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY
For OpenAI: Set OPENAI_API_KEY

If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.

Usage

To evaluate a model on a task, use the eval-mm CLI:

uv sync --group normal
uv run --group normal eval-mm run \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics heron-bench \
  --judge_model gpt-4o-2024-11-20 \
  --overwrite

You can also use python -m eval_mm run ... or the legacy python examples/sample.py ... wrapper.

To evaluate using vLLM for faster batch inference:

uv sync --group vllm_normal
uv run --group vllm_normal eval-mm run \
  --backend vllm \
  --model_id Qwen/Qwen2.5-VL-3B-Instruct \
  --task_id japanese-heron-bench \
  --metrics heron-bench

To score existing predictions without running inference:

eval-mm evaluate --model_id llava-hf/llava-1.5-7b-hf --task_id japanese-heron-bench --metrics heron-bench

To list available tasks and metrics:

eval-mm list tasks
eval-mm list metrics

The evaluation results will be saved in the result directory:

result
├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl

To evaluate multiple models on multiple tasks, please check eval_all.sh.

Hello World Example

You can integrate llm-jp-eval-mm into your own code. Here's an example:

from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig

class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"

task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]

input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)

model = MockVLM()
prediction = model.generate(images, input_text)

scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Web Dashboard

A Next.js web frontend provides three interactive views for browsing evaluation results. It connects to a lightweight FastAPI backend that serves data from your result/ directory.

Quick Start

# Terminal 1: Start the API server
uv pip install fastapi uvicorn
uvicorn eval_mm.api:app --reload

# Terminal 2: Start the web frontend
cd web
pnpm install
pnpm dev

Then open http://localhost:3000.

Pages

Page	URL	Description
Runner Dashboard	`/runner`	GPU monitoring, task configuration, progress tracking (Sentry-inspired dark UI)
Leaderboard	`/leaderboard`	Sortable model comparison table with scores across all tasks (Stripe-inspired design)
Prediction Browser	`/browser`	Browse individual predictions, compare models side-by-side, navigate samples (Claude-inspired editorial UI)

API Endpoints

The FastAPI backend (uvicorn eval_mm.api:app) exposes:

Endpoint	Description
`GET /api/tasks`	List available evaluation tasks
`GET /api/models`	List leaderboard models
`GET /api/results`	Discover all task/model result pairs
`GET /api/predictions/{task_id}/{model_id}`	Paginated predictions (supports `offset`, `limit`)
`GET /api/scores/{task_id}`	Aggregate scores for all models on a task

Set EVAL_MM_RESULT_DIR to point to your result directory (default: result/).

Note: The web frontend works without the API server — pages fall back to mock data for static builds and development.

Static Build

cd web && pnpm build

This generates static pages that can be deployed to GitHub Pages or any static host.

Leaderboard (CLI)

To generate a leaderboard from your evaluation results via CLI:

python scripts/make_leaderboard.py --result_dir result

This will create a leaderboard.md file with your model performance:

Model	Heron/LLM	JVB-ItW/LLM	JVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b	68.03	4.08	52.4
Qwen/Qwen2.5-VL-7B-Instruct	70.29	4.28	29.63
google/gemma-3-27b-it	69.15	4.36	30.89
microsoft/Phi-4-multimodal-instruct	45.52	3.2	26.8
gpt-4o-2024-11-20	93.7	4.44	32.2

The official leaderboard is available here

Supported Tasks

Japanese Tasks:

English Tasks:

Managing Dependencies

We use uv’s dependency groups to manage each model’s dependencies.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py

See eval_all.sh for the complete list of model dependencies.

When adding a new group, remember to configure conflict.

Browse Predictions

Web UI (Recommended)

Start the web dashboard (see Web Dashboard above) and open the Prediction Browser at http://localhost:3000/browser.

Streamlit (Legacy)

uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf

Development

Adding a new task

To add a new task, implement the Task class in src/eval_mm/tasks/task.py.

Adding a new metric

To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.

Adding a new model

To add a new model, create a VLM adapter class that inherits from eval_mm.BaseVLM and place it in examples/. Register it in examples/model_table.py so the CLI can discover it.

Note: examples/ contains model adapter implementations and usage examples. The evaluation engine, CLI, base classes, and scoring live in src/eval_mm/ (the published package). Model adapters in examples/ are not part of the PyPI distribution — they are reference implementations for users to copy, adapt, or extend

Adding a new dependency

Install a new dependency using the following command:

uv add <package_name>
uv add --group <group_name> <package_name>

Testing

Portable offline tests (no GPU, no network):

bash test.sh

Model smoke tests (requires GPU + model weights):

bash test_model.sh                  # all models
bash test_model.sh "Qwen"          # filter by name
CUDA_VISIBLE_DEVICES=0,1 bash test_model.sh  # custom GPU selection

Formatting and Linting

Ensure code consistency with:

uv run ruff format src
uv run ruff check --fix src

Releasing to PyPI

To release a new version:

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Updating the Website

The web dashboard is built with Next.js in the web/ directory. For development:

cd web && pnpm dev

The legacy GitHub Pages leaderboard is in github_pages/. To update its data:

python scripts/make_leaderboard.py --update_pages

Environment-Specific Configuration (mdx)

This project runs on mdx (a GPU computing platform for academic research in Japan). The devcontainer is pre-configured for this environment.

Storage Layout on mdx

Mount	Filesystem	Capacity	Purpose
`/workspace`	NFS (`/home/$USER/workspace/evalmm`)	~6T (shared)	Source code, results, datasets
`/model`	Lustre (`/model`)	~4P	Model weights, HF/uv/vLLM caches

Cache Configuration

Large caches (HuggingFace models, uv packages, vLLM compilations) are stored on /model to avoid filling the NFS home directory. This is configured in .devcontainer/docker-compose.yml:

volumes:
  - /model:/model:cached
environment:
  HF_HOME: /model/<username>/cache/huggingface
  UV_CACHE_DIR: /model/<username>/cache/uv
  VLLM_CACHE_DIR: /model/<username>/cache/vllm

Before rebuilding the devcontainer, create the cache directories on the host:

mkdir -p /model/$USER/cache/{huggingface,uv,vllm}

Running on Other Environments

If you are not using mdx, remove or comment out the /model volume mount and the cache environment variables in .devcontainer/docker-compose.yml. Caches will fall back to their default locations (~/.cache/).

Acknowledgements

Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

Citation

@article{maeda-etal-2026-evalmm,
  title = {日本語視覚言語モデルのタスク横断評価と実証的分析},
  author = {前田 航希 and 
    杉浦 一瑳 and 
    小田 悠介 and 
    栗田 修平 and 
    岡崎 直観},
  journal={自然言語処理},
  volume={33},
  number={2},
  pages={TBD},
  year={2026},
  note = {local.ja},
  month = {January}
}

Name		Name	Last commit message	Last commit date
Latest commit History 354 Commits
.github/workflows		.github/workflows
assets		assets
data		data
docs		docs
examples		examples
github_pages		github_pages
scripts		scripts
src/eval_mm		src/eval_mm
tips		tips
tsubame		tsubame
web		web
.env.sample		.env.sample
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
create_env.fish		create_env.fish
eval.sh		eval.sh
eval_all.sh		eval_all.sh
eval_with_vllm.sh		eval_with_vllm.sh
leaderboard.md		leaderboard.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
test.sh		test.sh
test_git_workflow.sh		test_git_workflow.sh
test_model.sh		test_model.sh

Folders and files

Latest commit

History

Repository files navigation

llm-jp-eval-mm

Updates

Getting Started

Usage

Hello World Example

Web Dashboard

Quick Start

Pages

API Endpoints

Static Build

Leaderboard (CLI)

Supported Tasks

Managing Dependencies

Browse Predictions

Web UI (Recommended)

Streamlit (Legacy)

Development

Adding a new task

Adding a new metric

Adding a new model

Adding a new dependency

Testing

Formatting and Linting

Releasing to PyPI

Updating the Website

Environment-Specific Configuration (mdx)

Storage Layout on mdx

Cache Configuration

Running on Other Environments

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages