Jean Kaddour

PhD in LLMs @ UCL

Publications

Selected Papers

agentssafety

Jean Kaddour, Srijan Patel, Gbètondji J-S Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner, ICLR 2026 Trustworthy AI

• What: Can agents predict whether they will succeed at a task?

• Why: Overconfident agents are dangerous.

rlreasoningreward engineering

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour†, Andreas Köpf†, NeurIPS 2025 (Spotlight, Top 2%)

• What: 100+ RL envs across 8 domains with configurable complexity.

• Why: Generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed

agentsscaffoldsoss

Jean Kaddour et al., Github (5.6k stars)

• What: A Python package with UI for building and debugging agents.

• Why: Debugging long-running agents in a terminal gets cumbersome.

unfathomable datasetshallucinationsmisalignment

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, Robert McHardy, arXiv 2023

• What: An opinionated review of 16 challenges for LLMs.

• Why: The field is moving fast, hard to keep up with what's worth solving.

compute-optimal traininghardware awarenessoptimizer

Jean Kaddour∗, Oscar Key∗, Piotr Nawrot, Pasquale Minervini, Matt J. Kusner, NeurIPS 2023

• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.

• Why: Every day, there's a new training method; the ones we tried weren't that effective.

• Trivia: We started by trying some ideas that never outperformed our baseline; then realized that our baseline was quite competitive.

optimizer

Jean Kaddour∗, Linqing Liu∗, Ricardo Silva, Matt J. Kusner, NeurIPS 2022

• What: We can find even flatter minima than SAM by adding weight averaging.

• Why: SAM finds flat basins; WA finds flat points inside those basins.

Evals

ai for sciencereasoning

Long Phan et al. (incl. Jean Kaddour), arXiv 2025

• What: A really hard multiple-choice science benchmark for LLMs.

• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing, promised.

tool usecoding

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro von Werra, ICLR 2025 (Oral, Top 2%)

• What: 1k+ diverse, multi-tool-use programming tasks in Python.

• Why: Other code benchmarks are too homogeneous and lacked tool calls.

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini, NAACL 2025

• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.

• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.

Hanchen Wang∗, Jean Kaddour∗, Shengchao Liu∗, Jian Tang, Joan Lasenby, Qi Liu, NeurIPS 2023

• What: A probing suite to profile molecular graph embeddings.

• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.

synthetic datavision modelsspurious correlations

Aengus Lynch∗, Gbètondji J-S Dovonon∗, Jean Kaddour∗, Ricardo Silva, ICLR 2025 SCSL

• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.

• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.

Posttraining

memoryparallelism

Oscar Key∗, Jean Kaddour∗, Pasquale Minervini, NeurIPS 2023 WANT

• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).

• Why: Allowing the GPU-poor to fine-tune some LLMs too.

• Trivia: Inspired by distributed training techniques, adopted for single-GPU fine-tuning.

Pretraining

lr decayoptimizer

Sunny Sanyal, Atula Neerkaje, Jean Kaddour, Abhishek Kumar, Sujay Sanghavi, COLM 2024, NeurIPS 2023 WANT

• What: We scale up LAWA (see below) to large models.

• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.

pretraining datadata mixturesclustering

Jean Kaddour, arXiv 2023

• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.

• Why: The Pile is too large for GPU-poor academics.

• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.

Misc

synthetic datavision modelsdiffusion modelsdistillation

Yuwei Yin, Jean Kaddour, Xiang Zhang, Yixin Nie, Zhenguang Liu, Lingpeng Kong, Qi Liu, arXiv 2023

• What: We generate synthetic training data for vision classification models.

• Why: You can think of it as knowledge distillation from generative to discriminative models.

• Trivia: This is sort of the training-equivalent of Spawrious.

causalityspurious correlationscausal rl

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva, Foundations and Trends in Optimization, 2022

• What: A survey of how causality can be applied to ML problems.

• Why: Causality allows you to make assumptions about the data-generating process.

• Trivia: 3 years later, I'm surprised how far we've come with LLMs without any causality.

data collectionactive learningmeta-learning

Jean Kaddour∗, Steindór Sæmundsson∗, Marc Peter Deisenroth, NeurIPS 2020

• What: We make meta-learning more sample-efficient by letting the model guide the task selection.

• Why: Acquiring datasets can be expensive and slow. Let's make sure we make it worth it.

Recent Blog Posts

View all posts →