Michal Valko - NotesTechnical notes and research highlights by Michal Valko.https://misovalko.github.io/notes-feed.xml2024-07-15T00:00:00ZMichal Valko[email protected]https://misovalko.github.io/Copyright 2026, Michal Valkohttps://misovalko.github.io/favicon-32x32.pnghttps://misovalko.github.io/images/common/mvgr20.webpNash Learning from Human Feedback: A Brief Explainerhttps://misovalko.github.io/notes.html#nash-llm-explained2024-07-15T00:00:00Z2024-07-15T00:00:00ZStandard RLHF trains a reward model from pairwise comparisons, then optimizes a policy against that reward. Our approach, Nash Learning from Human Feedback (NLHF), sidesteps the reward model bottleneck by framing alignment as a two-player game.