dataprophet26.github.io/index.html at main · DataProphet26/dataprophet26.github.io

History

261 lines (248 loc) · 9.94 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

<!doctype html>

<head>

<title>DataProphet | Demystifying Supervision Data Generalization in MLLMs</title>

<meta

name="description"

content="Project page for DataProphet: a training-free metric for predicting supervision data influence in multimodal LLMs."

<link

href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=Source+Serif+4:opsz,[email protected],400;8..60,600&display=swap"

rel="stylesheet"

</head>

<body>

<span>DataProphet</span>

</a>

<a href="#abstract">Abstract</a>

<a href="#analysis">Analysis</a>

<a href="#metric">Metric</a>

<a href="#selection">Selection</a>

<a href="#citation">Citation</a>

</nav>

</header>

<p class="tagline">ICLR 2026 Submission</p>

<h1>DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs</h1>

<p class="authors">Xuan Qi, Luxi He, Dan Roth, Xingyu Fu</p>

DataProphet predicts which supervision datasets will help a target benchmark before any

training. It combines multimodal similarity, perplexity, and diversity into a

training-free transfer score.

</p>

<a class="btn btn-primary" href="./assets/paper/dataprophet-paper.pdf" target="_blank" rel="noreferrer">Download PDF</a>

<a class="btn btn-secondary" href="https://huggingface.co/datasets/THUQiXuan/DataProphet" target="_blank" rel="noreferrer">Hugging Face</a>

<a class="btn btn-secondary" href="https://github.com/DataProphet26/dataprophet" target="_blank" rel="noreferrer">GitHub Code</a>

</div>

<p>Source/target datasets</p>

</article>

<p>Task families</p>

</article>

<p>Kendall's tau (avg)</p>

</article>

<p>Synthetic data gain</p>

</article>

</div>

</div>

</section>

<h2>Abstract</h2>

<p>

Conventional data selection for multimodal LLMs often follows intuitive task similarity,

but this paper shows that intuition is unreliable for predicting transfer gains. The

authors evaluate 14 vision-language datasets across 7 task families and find that

influence is asymmetric and dataset-specific. DataProphet introduces a simple training-free

metric that integrates question/answer/image similarity, multimodal perplexity, and source

diversity to predict transfer ranking. The predicted rankings strongly correlate with actual

fine-tuning outcomes (Kendall's tau 0.860), and DataProphet-guided selection improves

average performance over uniform and training-based baselines under fixed compute budgets.

</p>

</section>

<h2>Core Takeaways</h2>

<h3>Task similarity is a weak proxy</h3>

<p>

OCR supervision can improve spatial reasoning more than chart tasks. Transfer cannot be

inferred from high-level task labels alone.

</p>

</article>

<h3>Influence is directional</h3>

<p>

The gain from train -> test is not symmetric: Delta<sub>s->t</sub> and

Delta<sub>t->s</sub> can differ substantially.

</p>

</article>

<h3>Training-free selection can win</h3>

<p>

DataProphet reaches +3.4% average improvement on real-data reweighting and +6.9% on

synthetic data selection versus uniform sampling.

</p>

</article>

</div>

</section>

<h2>Data Influence Analysis</h2>

Controlled fine-tuning with InternVL3-2B on each source dataset (20K samples) reveals

non-intuitive cross-task transfer patterns.

</p>

<figcaption>Figure: Three major takeaways from the paper.</figcaption>

</figure>

Relative improvement heatmap (train dataset on y-axis, test dataset on x-axis).

</figcaption>

</figure>

</section>

<h2>The DataProphet Metric</h2>

M(s->t) = (QSim * ASim * ISim * PPL(s) * (Sil + H)) / PPL(t)

</p>

<p>

The metric is directional and training-free. It rewards source datasets that are aligned

with the target in text and vision space, challenging enough to teach new capability, and

diverse in question coverage.

</p>

<figcaption>Metric score matrix between all source-target pairs.</figcaption>

</figure>

<h3>Ranking Quality</h3>

<ul>

<li>Average tau<sub>Tgt</sub> = 0.863</li>

<li>Average tau<sub>Src</sub> = 0.857</li>

<li>Overall average tau = 0.860</li>

</ul>

<h3>Ablation Signal</h3>

<ul>

<li>Removing perplexity: 0.860 -> 0.487</li>

<li>Removing image similarity: 0.860 -> 0.625</li>

<li>Removing diversity: 0.860 -> 0.659</li>

</ul>

</div>

<figcaption>Kendall tau across target datasets.</figcaption>

</figure>

</section>

<h2>Data Selection Results</h2>

Under a fixed budget of 280K samples, DataProphet-guided selection outperforms both uniform

and training-based methods in real and synthetic settings.

</p>

<table>

<thead>

<tr>

<th>Setting</th>

<th>Uniform</th>

<th>ICONS</th>

<th>Oracle</th>

<th>DataProphet</th>

</tr>

</thead>

<tbody>

<tr>

<td>Real Data Reweighting (Avg)</td>

</tr>

<tr>

<td>Improve over Uniform</td>

</tr>

<tr>

<td>Synthetic Data Selection (Avg)</td>

</tr>

<tr>

<td>Improve over Uniform</td>

</tr>

</tbody>

</table>

</div>

<h3>Synthetic Source Mix</h3>

<p>

Among selected synthetic samples, approximately 38% come from GPT-5 and 62% from

Gemini 2.5 Pro.

</p>

</article>

<h3>RL Post-training</h3>

<p>

DataProphet allocation improves average score from 0.583 -> 0.595 (real RL data) and

0.564 -> 0.577 (synthetic RL data).

</p>

</article>

</div>

</section>

<h2>Citation</h2>

<p>If this work is useful, please cite:</p>

<pre id="bibtex"><code>@article{qi2026dataprophet,

title={DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs},

author={Qi, Xuan and He, Luxi and Roth, Dan and Fu, Xingyu},

journal={International Conference on Learning Representations},

year={2026}

}</code></pre>

<button class="btn btn-primary" id="copyBibBtn" type="button">Copy BibTeX</button>

</section>

</main>

<p>DataProphet Project Page | Built from paper assets and LaTeX source</p>

</footer>

</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

index.html

Latest commit

History

index.html

File metadata and controls