Michal Valko - Notes Technical notes and research highlights by Michal Valko. https://misovalko.github.io/notes-feed.xml 2024-07-15T00:00:00Z Michal Valko [email protected] https://misovalko.github.io/ Copyright 2026, Michal Valko https://misovalko.github.io/favicon-32x32.png https://misovalko.github.io/images/common/mvgr20.webp Nash Learning from Human Feedback: A Brief Explainer https://misovalko.github.io/notes.html#nash-llm-explained 2024-07-15T00:00:00Z 2024-07-15T00:00:00Z Standard RLHF trains a reward model from pairwise comparisons, then optimizes a policy against that reward. Our approach, Nash Learning from Human Feedback (NLHF), sidesteps the reward model bottleneck by framing alignment as a two-player game.