How do we actually align AI?

💡

How do we, humans, actually go about aligning an intelligence far superior to our own?

After reading this, you will understand a framework that you can use to think about aligning superintelligence.

“We do have one advantage: we get to build the stuff. - Nick Bostrom

Many people and organizations have proposed methods for aligning AI. We believe that Leopold Aschenbrenner's proposal is the most plausible and potentially successful. In particular, he decomposes the two significant ways in which AGI will be superhuman: qualitatively and quantitatively.

Decomposing the problem

Quantitative Capabilities: Superhuman AGI will be able to think orders of magnitude faster than we can, internalize information faster than we can, and generate millions of lines of code in seconds. In a way, quantitative capabilities are scaled-up, faster versions of abilities humans already have.

Qualitative Capabilities: Superhuman AGI will also be able to formulate and solve problems beyond human comprehension using frameworks that humans have not discovered. A classic baby example of this is AlphaGo’s move 37 (an unusual move that seemed like a mistake).

Aligning quantitatively superior AGI

A combination of scaled-up versions of current methods is likely to be enough to solve this aspect of alignment (mostly).

Importantly, at this stage, humans will be in the loop and can still verify what these AI systems are actually doing.

Aligning qualitatively superior AGI

A different paradigm will be necessary for aligning qualitatively superior AGI: it will be crucial to automate alignment research.

As Leopold discusses in his essay, AI capability research will certainly become automated, and alignment research will also have to be automated in order to keep up. The following is the rough framework for how automating alignment research would go:

Build an AGI system S₀ that is slightly above human capabilities that we can trust to do alignment research
Use system S₀ to build another system S₁ that is slightly smarter than S₀
Repeat inductively.

Necessary actions for both aspects

Interpretability: Advancements in interpretability, both at the mechanistic and top-down level, are likely also to be extremely helpful in aligning AI
Adversarial Measurement: We don’t have effective methods to determine whether the systems we are building will be safe post-training, so improving our current approaches to red-teaming models will likely be crucial.

Why is this urgent?Overview of Current Efforts

How do we actually align AI?

Decomposing the problem

Aligning quantitatively superior AGI

RLHF: Although RLHF won’t scale to qualitatively superior AGI, we can use expert-level humans across all domains to generate extremely high-quality feedback signals on advanced data.

Scalable Oversight: We can also design AI systems to help us supervise other AI systems. For example, although humans cannot parse millions of lines of code in seconds, if we train an AI system to point out suspicious code, we can drastically reduce the amount of code we need to vet.

Aligning qualitatively superior AGI

In this way, aligning superintelligence becomes less like an ant trying to align with a human and more like a high school student trying to align with an undergraduate student, that undergraduate student trying to align with a graduate student, and so on.

Necessary actions for both aspects