What exactly is AI “alignment”?

Understanding AI Alignment

💡

What is AI “alignment,” and how does it differ from “safety” and “control”?

After reading this, you will understand what “alignment” is and how these terms are related.

Definitions

AI safety is the term for all things “reducing risks posed by AI” (misuse, reliability, privacy, etc.). It may be helpful to visualize AI safety in four quadrants:

Short TermLong Term
Accidente.g. Self-Driving Car CrashesOh boy
Misusee.g. Deep Fakese.g. AI-Enabled Dictatorship

This guide is about the upper righthand square, where things go wrong by accident (and by default). When we think about preventing this long-term accident risk, we think about solving the AI control problem.

AI control is about ensuring that AI systems try to do the right thing, and if they don’t, that they don’t competently pursue the wrong thing.

AI alignment is an approach to the AI control problem. More specifically, the alignment problem is about building AI systems that are trying to do what their operator wants them to do. In other words, AI alignment aims to solve the control problem by making sure the AI does the right thing to begin with, without considering approaches like containing, manipulating, or outsmarting the AI.

Alignment = motives of AI, not their knowledge or ability

  1. So “aligned” ≠ “perfect.” The AI can misunderstand instructions or the operator’s wants, predict consequences of actions incorrectly, or even create an unaligned AI itself.
  2. Current systems like GPT-4 can’t take over the world but are still misaligned. (They will often do things their designers didn’t want, such as lie or be offensive.)

Alignment ≠ human values

The alignment problem is commonly misunderstood as “aligning the AI’s values/preferences with humans.”


logo

Alignment Guide

© 2024 The Alignment Guide Project