Skip to content

DataProphet26/dataprophet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataProphet Metric Toolkit

This repo implements the metric part of the paper DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs.

Implemented metrics:

  • Similarity: QSim, ASim, ISim (expected cosine similarity)
  • Perplexity: PPL(source), PPL(target)
  • Diversity: Silhouette + normalized entropy
  • Final score:
M(s->t) = QSim * ASim * ISim * PPL(s) * (Sil + H) / PPL(t)

Also included:

  • Generate embeddings directly from raw JSONL datasets (image, question, answer) via OpenRouter API.
  • Supports separate embeddings (question, answer, image) and joint embeddings (qa, image+qa).

Training Frameworks

The paper's training experiments are built on:

This repository currently focuses on metric computation and embedding generation.

Minimal Setup

pip install -e .

or

pip install -r requirements.txt

Input Files

Supported input: .json, .jsonl, .npz.

Each dataset file should provide:

  • question_embeddings (2D)
  • answer_embeddings (2D)
  • image_embeddings (2D)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages