Understanding AI Alignment

💡

What is AI “alignment,” and how does it differ from “safety” and “control”?

After reading this, you will understand what “alignment” is and how these terms are related.

Definitions

AI safety is the term for all things “reducing risks posed by AI” (misuse, reliability, privacy, etc.). It may be helpful to visualize AI safety in four quadrants:

	Short Term	Long Term
Accident	e.g. Self-Driving Car Crashes	Oh boy
Misuse	e.g. Deep Fakes	e.g. AI-Enabled Dictatorship

This guide is about the upper righthand square, where things go wrong by accident (and by default). When we think about preventing this long-term accident risk, we think about solving the AI control problem.

AI control is about ensuring that AI systems try to do the right thing, and if they don’t, that they don’t competently pursue the wrong thing.

AI alignment is an approach to the AI control problem. More specifically, the alignment problem is about building AI systems that are trying to do what their operator wants them to do. In other words, AI alignment aims to solve the control problem by making sure the AI does the right thing to begin with, without considering approaches like containing, manipulating, or outsmarting the AI.

Alignment = motives of AI, not their knowledge or ability

So “aligned” ≠ “perfect.” The AI can misunderstand instructions or the operator’s wants, predict consequences of actions incorrectly, or even create an unaligned AI itself.
Current systems like GPT-4 can’t take over the world but are still misaligned. (They will often do things their designers didn’t want, such as lie or be offensive.)

Alignment ≠ human values

The alignment problem is commonly misunderstood as “aligning the AI’s values/preferences with humans.”

Why Alignment Urgently Matters Why is alignment difficult?

Understanding AI Alignment

Definitions

Alignment = motives of AI, not their knowledge or ability

Alignment ≠ human values

But the type of alignment we are discussing is not so much about aligning the AI to human values, but to an operator’s intentions (the precise term for this is “intent alignment”).

Sources and resources