Research directions

💡

We expand upon the specific technical research directions that prominent organizations are currently pursuing to tackle alignment.

For each direction, we summarize the current state of the research effort, provide 1 or 2 top resources to learn more, and cover opportunities for future work.

The research problem

How can we steer and control AI systems that are much smarter than us?

While reinforcement learning from human feedback (RLHF) has been largely successful in aligning today’s models, it won’t reliably scale to AI systems much smarter than us.

We will need new methods and scientific breakthroughs to ensure superhuman AI systems reliably follow their operator’s intent.

Directions

Comments

OpenAI, for their superalignment grants, has published a Research directions page with much more information, covering W2SG, interpretability (mechanistic and top-down), scalable oversight, and other directions.

New researchers have the opportunity to make enormous contributions. The alignment field is a very early one with many tractable research problems and new opportunities.

“There has never been a better time to start working on alignment.” - OpenAI

Overview of Current Efforts Next Steps

Research directions

The research problem

Directions

Sparse Autoencoders (SAEs)

Weak-to-strong generalization (W2SG)

Sleeper agents

Comments