AI Alignment: A Comprehensive Survey

Stephen McAleer

20 AI Alignment: A Comprehensive Survey

by Stephen McAleer

1st Nov 2023

1 min read

1

20

This is a linkpost for https://arxiv.org/abs/2310.19852

We have just released an academic survey of AI alignment.

We identify four main categories of alignment research:

Learning from feedback (e.g. scalable oversight)
Learning under distribution shift
Assurance (e.g. interpretability)
Governance

We mainly focused on academic references but also included some posts from LessWrong and other forums. We would love to hear from the community about any references we missed or anything that was unclear or misstated. We hope that this can be a good starting point for AI researchers who might be unfamiliar with current efforts in AI alignment.

AI Alignment Fieldbuilding1Aligned AI Proposals1Deceptive Alignment1Inner Alignment1Outer Alignment1AI1

Frontpage

20

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 11:54 PM

[-]wassname1y*20

This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn't encountered before, like the concept of training out agentic behavior by punishing side-effects.

If anyone wants the HTML version of the paper, it is here.

Reply

Moderation Log

Curated and popular this week

280Interpretability Will Not Reliably Find Deceptive AI

Ω

Neel Nanda

21h

Ω

47

498Orienting Toward Wizard Power

johnswentworth