Skip to content

llm-jp/llm-jp-eval-mm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

354 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-jp-eval-mm

pypi Test workflow License

llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.

Overview of llm-jp-eval-mm

Updates

🎉[2026.01.16] Our paper introducing llm-jp-eval-mm, "Cross-Task Evaluation and Empirical Analysis of Japanese Visual Language Models", was accepted for publication in the Journal of Natural Language Processing (Japan)!

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

  • Option 1: Clone from GitHub (Recommended)
git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync
  • Option 2: Install via PyPI
pip install eval_mm

To use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:

  • For Azure: Set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY
  • For OpenAI: Set OPENAI_API_KEY

If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.

Usage

To evaluate a model on a task, use the eval-mm CLI:

uv sync --group normal
uv run --group normal eval-mm run \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics heron-bench \
  --judge_model gpt-4o-2024-11-20 \
  --overwrite

You can also use python -m eval_mm run ... or the legacy python examples/sample.py ... wrapper.

To evaluate using vLLM for faster batch inference:

uv sync --group vllm_normal
uv run --group vllm_normal eval-mm run \
  --backend vllm \
  --model_id Qwen/Qwen2.5-VL-3B-Instruct \
  --task_id japanese-heron-bench \
  --metrics heron-bench

To score existing predictions without running inference:

eval-mm evaluate --model_id llava-hf/llava-1.5-7b-hf --task_id japanese-heron-bench --metrics heron-bench

To list available tasks and metrics:

eval-mm list tasks
eval-mm list metrics

The evaluation results will be saved in the result directory:

result
├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl

To evaluate multiple models on multiple tasks, please check eval_all.sh.

Hello World Example

You can integrate llm-jp-eval-mm into your own code. Here's an example:

from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig

class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"

task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]

input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)

model = MockVLM()
prediction = model.generate(images, input_text)

scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Web Dashboard

A Next.js web frontend provides three interactive views for browsing evaluation results. It connects to a lightweight FastAPI backend that serves data from your result/ directory.

Quick Start

# Terminal 1: Start the API server
uv pip install fastapi uvicorn
uvicorn eval_mm.api:app --reload

# Terminal 2: Start the web frontend
cd web
pnpm install
pnpm dev

Then open http://localhost:3000.

Pages

Page URL Description
Runner Dashboard /runner GPU monitoring, task configuration, progress tracking (Sentry-inspired dark UI)
Leaderboard /leaderboard Sortable model comparison table with scores across all tasks (Stripe-inspired design)
Prediction Browser /browser Browse individual predictions, compare models side-by-side, navigate samples (Claude-inspired editorial UI)

API Endpoints

The FastAPI backend (uvicorn eval_mm.api:app) exposes:

Endpoint Description
GET /api/tasks List available evaluation tasks
GET /api/models List leaderboard models
GET /api/results Discover all task/model result pairs
GET /api/predictions/{task_id}/{model_id} Paginated predictions (supports offset, limit)
GET /api/scores/{task_id} Aggregate scores for all models on a task

Set EVAL_MM_RESULT_DIR to point to your result directory (default: result/).

Note: The web frontend works without the API server — pages fall back to mock data for static builds and development.

Static Build

cd web && pnpm build

This generates static pages that can be deployed to GitHub Pages or any static host.

Leaderboard (CLI)

To generate a leaderboard from your evaluation results via CLI:

python scripts/make_leaderboard.py --result_dir result

This will create a leaderboard.md file with your model performance:

Model Heron/LLM JVB-ItW/LLM JVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b 68.03 4.08 52.4
Qwen/Qwen2.5-VL-7B-Instruct 70.29 4.28 29.63
google/gemma-3-27b-it 69.15 4.36 30.89
microsoft/Phi-4-multimodal-instruct 45.52 3.2 26.8
gpt-4o-2024-11-20 93.7 4.44 32.2

The official leaderboard is available here

Supported Tasks

Japanese Tasks:

English Tasks:

Managing Dependencies

We use uv’s dependency groups to manage each model’s dependencies.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py

See eval_all.sh for the complete list of model dependencies.

When adding a new group, remember to configure conflict.

Browse Predictions

Web UI (Recommended)

Start the web dashboard (see Web Dashboard above) and open the Prediction Browser at http://localhost:3000/browser.

Streamlit (Legacy)

uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf

Streamlit

Development

Adding a new task

To add a new task, implement the Task class in src/eval_mm/tasks/task.py.

Adding a new metric

To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.

Adding a new model

To add a new model, create a VLM adapter class that inherits from eval_mm.BaseVLM and place it in examples/. Register it in examples/model_table.py so the CLI can discover it.

Note: examples/ contains model adapter implementations and usage examples. The evaluation engine, CLI, base classes, and scoring live in src/eval_mm/ (the published package). Model adapters in examples/ are not part of the PyPI distribution — they are reference implementations for users to copy, adapt, or extend

Adding a new dependency

Install a new dependency using the following command:

uv add <package_name>
uv add --group <group_name> <package_name>

Testing

Portable offline tests (no GPU, no network):

bash test.sh

Model smoke tests (requires GPU + model weights):

bash test_model.sh                  # all models
bash test_model.sh "Qwen"          # filter by name
CUDA_VISIBLE_DEVICES=0,1 bash test_model.sh  # custom GPU selection

Formatting and Linting

Ensure code consistency with:

uv run ruff format src
uv run ruff check --fix src

Releasing to PyPI

To release a new version:

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Updating the Website

The web dashboard is built with Next.js in the web/ directory. For development:

cd web && pnpm dev

The legacy GitHub Pages leaderboard is in github_pages/. To update its data:

python scripts/make_leaderboard.py --update_pages

Environment-Specific Configuration (mdx)

This project runs on mdx (a GPU computing platform for academic research in Japan). The devcontainer is pre-configured for this environment.

Storage Layout on mdx

Mount Filesystem Capacity Purpose
/workspace NFS (/home/$USER/workspace/evalmm) ~6T (shared) Source code, results, datasets
/model Lustre (/model) ~4P Model weights, HF/uv/vLLM caches

Cache Configuration

Large caches (HuggingFace models, uv packages, vLLM compilations) are stored on /model to avoid filling the NFS home directory. This is configured in .devcontainer/docker-compose.yml:

volumes:
  - /model:/model:cached
environment:
  HF_HOME: /model/<username>/cache/huggingface
  UV_CACHE_DIR: /model/<username>/cache/uv
  VLLM_CACHE_DIR: /model/<username>/cache/vllm

Before rebuilding the devcontainer, create the cache directories on the host:

mkdir -p /model/$USER/cache/{huggingface,uv,vllm}

Running on Other Environments

If you are not using mdx, remove or comment out the /model volume mount and the cache environment variables in .devcontainer/docker-compose.yml. Caches will fall back to their default locations (~/.cache/).

Acknowledgements

  • Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
  • lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

Citation

@article{maeda-etal-2026-evalmm,
  title = {日本語視覚言語モデルのタスク横断評価と実証的分析},
  author = {前田 航希 and 
    杉浦 一瑳 and 
    小田 悠介 and 
    栗田 修平 and 
    岡崎 直観},
  journal={自然言語処理},
  volume={33},
  number={2},
  pages={TBD},
  year={2026},
  note = {local.ja},
  month = {January}
}

About

A lightweight framework for evaluating visual-language models.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors