Jean Kaddour

PhD in LLMs @ UCL

Publications

Selected Papers

[0]

Agentic Uncertainty Reveals Agentic Overconfidence

agentssafety

Jean Kaddour, Srijan Patel, Gbètondji J-S Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner, ICLR 2026 Trustworthy AI

• What: Can agents predict whether they will succeed at a task?

• Why: Overconfident agents are dangerous.

[1]

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

rlreasoningreward engineering

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour†, Andreas Köpf†, NeurIPS 2025 (Spotlight, Top 2%)

• What: 100+ RL envs across 8 domains with configurable complexity.

• Why: Generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed

[2]

PySpur: A visual playground for AI Agents

agentsscaffoldsoss

Jean Kaddour et al., Github (5.6k stars)

• What: A Python package with UI for building and debugging agents.

• Why: Debugging long-running agents in a terminal gets cumbersome.

[3]

Challenges and applications of large language models

unfathomable datasetshallucinationsmisalignment

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, Robert McHardy, arXiv 2023

• What: An opinionated review of 16 challenges for LLMs.

• Why: The field is moving fast, hard to keep up with what's worth solving.

[4]

No train no gain: Revisiting efficient training algorithms for transformer-based language models

compute-optimal traininghardware awarenessoptimizer

Jean Kaddour∗, Oscar Key∗, Piotr Nawrot, Pasquale Minervini, Matt J. Kusner, NeurIPS 2023

• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.

• Why: Every day, there's a new training method; the ones we tried weren't that effective.

• Trivia: We started by trying some ideas that never outperformed our baseline; then realized that our baseline was quite competitive.

[5]

When Do Flat Minima Optimizers Work?

optimizer

Jean Kaddour∗, Linqing Liu∗, Ricardo Silva, Matt J. Kusner, NeurIPS 2022

• What: We can find even flatter minima than SAM by adding weight averaging.

• Why: SAM finds flat basins; WA finds flat points inside those basins.

Evals

[6]

Humanity's Last Exam

ai for sciencereasoning

Long Phan et al. (incl. Jean Kaddour), arXiv 2025

• What: A really hard multiple-choice science benchmark for LLMs.

• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing, promised.

[7]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

tool usecoding

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro von Werra, ICLR 2025 (Oral, Top 2%)

• What: 1k+ diverse, multi-tool-use programming tasks in Python.

• Why: Other code benchmarks are too homogeneous and lacked tool calls.

[8]

Are We Done with MMLU?

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini, NAACL 2025

• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.

• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.

[9]

Evaluating Self-Supervised Learning for Molecular Graph Embeddings

Hanchen Wang∗, Jean Kaddour∗, Shengchao Liu∗, Jian Tang, Joan Lasenby, Qi Liu, NeurIPS 2023

• What: A probing suite to profile molecular graph embeddings.

• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.

[10]

Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases

synthetic datavision modelsspurious correlations

Aengus Lynch∗, Gbètondji J-S Dovonon∗, Jean Kaddour∗, Ricardo Silva, ICLR 2025 SCSL

• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.

• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.

Posttraining

[11]

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

inferencelatency

Georgy Tyukin, Gbètondji J-S Dovonon, Jean Kaddour, Pasquale Minervini, arXiv 2024

• What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.

• Why: Removing attention layers makes inference faster and cheaper.

[12]

Local LoRA: Memory-Efficient Fine-Tuning of Large Language Models

memoryparallelism

Oscar Key∗, Jean Kaddour∗, Pasquale Minervini, NeurIPS 2023 WANT

• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).

• Why: Allowing the GPU-poor to fine-tune some LLMs too.

• Trivia: Inspired by distributed training techniques, adopted for single-GPU fine-tuning.

[13]

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

distillationsynthetic datainference efficiency

Jean Kaddour, Qi Liu, arXiv 2023

• What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.

• Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.

Pretraining

[14]

Early Weight Averaging meets High Learning Rates for LLM Pre-training

lr decayoptimizer

Sunny Sanyal, Atula Neerkaje, Jean Kaddour, Abhishek Kumar, Sujay Sanghavi, COLM 2024, NeurIPS 2023 WANT

• What: We scale up LAWA (see below) to large models.

• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.

[15]

Minipile: A Challenge for Data-Efficient Language Models

pretraining datadata mixturesclustering

Jean Kaddour, arXiv 2023

• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.

• Why: The Pile is too large for GPU-poor academics.

• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.

[16]

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

weight averaginglr decayoptimizer

Jean Kaddour, NeurIPS 2022 HIIT Workshop

• What: Weight averaging = implicit LR decay.

• Why: We can evaluate intermediate checkpoints pre-LR decay, which is much cheaper.

Misc

[17]

Ttida: Controllable generative data augmentation via text-to-text and text-to-image models

synthetic datavision modelsdiffusion modelsdistillation

Yuwei Yin, Jean Kaddour, Xiang Zhang, Yixin Nie, Zhenguang Liu, Lingpeng Kong, Qi Liu, arXiv 2023

• What: We generate synthetic training data for vision classification models.

• Why: You can think of it as knowledge distillation from generative to discriminative models.

• Trivia: This is sort of the training-equivalent of Spawrious.

[18]

Causal Machine Learning: A Survey and Open Problems

causalityspurious correlationscausal rl

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva, Foundations and Trends in Optimization, 2022

• What: A survey of how causality can be applied to ML problems.

• Why: Causality allows you to make assumptions about the data-generating process.

• Trivia: 3 years later, I'm surprised how far we've come with LLMs without any causality.

[19]

Causal Effect Inference for Structured Treatments

causalitytreatment effectstreatment embeddings

Jean Kaddour, Yuchen Zhu, Qi Liu, Matt J. Kusner, Ricardo Silva, NeurIPS 2021

• What: We generalize the Robinson decomposition to treatment embeddings.

• Why: We can now use eg., an embedding of a drug's molecular graph.

[20]

Probabilistic Active Meta-Learning

data collectionactive learningmeta-learning

Jean Kaddour∗, Steindór Sæmundsson∗, Marc Peter Deisenroth, NeurIPS 2020

• What: We make meta-learning more sample-efficient by letting the model guide the task selection.

• Why: Acquiring datasets can be expensive and slow. Let's make sure we make it worth it.

Recent Blog Posts

View all posts →