Scalable agent alignment via reward modeling

Authors

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg

Abstract

This paper explores how to align agent behavior with human preferences in complex domains where reward functions are hard to specify.

Publication Details

Published:

November 19th, 2018

Venue:

arXiv

Added to AI Safety Papers:

March 16th, 2025

Metadata

Tags:

Reward ModelingReinforcement LearningAI Safety

Original Paper:

Link