This repo implements the metric part of the paper DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs.
Implemented metrics:
- Similarity:
QSim,ASim,ISim(expected cosine similarity) - Perplexity:
PPL(source),PPL(target) - Diversity:
Silhouette + normalized entropy - Final score:
M(s->t) = QSim * ASim * ISim * PPL(s) * (Sil + H) / PPL(t)
Also included:
- Generate embeddings directly from raw JSONL datasets (
image,question,answer) via OpenRouter API. - Supports separate embeddings (
question,answer,image) and joint embeddings (qa,image+qa).
The paper's training experiments are built on:
- LLaMA-Factory: https://github.com/hiyouga/LLaMA-Factory
- VERL: https://github.com/volcengine/verl
This repository currently focuses on metric computation and embedding generation.
pip install -e .or
pip install -r requirements.txtSupported input: .json, .jsonl, .npz.
Each dataset file should provide:
question_embeddings(2D)answer_embeddings(2D)image_embeddings(2D)