Scalable agent alignment via reward modeling

Authors

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg

This paper explores how to align agent behavior with human preferences in complex domains where reward functions are hard to specify.

November 19th, 2018

arXiv

March 16th, 2025

Reward ModelingReinforcement LearningAI Safety