Risks from Learned Optimization in Advanced Machine Learning Systems

Authors

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

Abstract

This paper analyzes the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization.

Publication Details

Published:

June 10th, 2019

Venue:

arXiv

Added to AI Safety Papers:

March 16th, 2025

Metadata

Tags:

Mesa-optimizationAI SafetyAlignment

Original Paper:

Link