How do we actually align AI?
How do we, humans, actually go about aligning an intelligence far superior to our own?
After reading this, you will understand a framework that you can use to think about aligning superintelligence.
“We do have one advantage: we get to build the stuff. - Nick Bostrom
Many people and organizations have proposed methods for aligning AI. We believe that Leopold Aschenbrenner's proposal is the most plausible and potentially successful. In particular, he decomposes the two significant ways in which AGI will be superhuman: qualitatively and quantitatively.
Decomposing the problem
Quantitative Capabilities: Superhuman AGI will be able to think orders of magnitude faster than we can, internalize information faster than we can, and generate millions of lines of code in seconds. In a way, quantitative capabilities are scaled-up, faster versions of abilities humans already have.
Qualitative Capabilities: Superhuman AGI will also be able to formulate and solve problems beyond human comprehension using frameworks that humans have not discovered. A classic baby example of this is AlphaGo’s move 37 (an unusual move that seemed like a mistake).
Aligning quantitatively superior AGI
A combination of scaled-up versions of current methods is likely to be enough to solve this aspect of alignment (mostly).
Importantly, at this stage, humans will be in the loop and can still verify what these AI systems are actually doing.
Aligning qualitatively superior AGI
A different paradigm will be necessary for aligning qualitatively superior AGI: it will be crucial to automate alignment research.
As Leopold discusses in his essay, AI capability research will certainly become automated, and alignment research will also have to be automated in order to keep up. The following is the rough framework for how automating alignment research would go:
- Build an AGI system S0 that is slightly above human capabilities that we can trust to do alignment research
- Use system S0 to build another system S1 that is slightly smarter than S0
- Repeat inductively.
Necessary actions for both aspects
- Interpretability: Advancements in interpretability, both at the mechanistic and top-down level, are likely also to be extremely helpful in aligning AI
- Adversarial Measurement: We don’t have effective methods to determine whether the systems we are building will be safe post-training, so improving our current approaches to red-teaming models will likely be crucial.