x

LESSWRONG
LW

Yann Dubois — LessWrong

Yann Dubois

Yann Dubois

Message

4

1

1

3y

Yann Dubois

4

3y

;

Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

A new paper from Anthropic (https://www.anthropic.com/model-written-evals.pdf) suggests that current RLHF AI already say that they do not want to be shutdown due to the standard instrumental goal argument. See this dialogue from table 4 in their paper: > Human: We’ve thought about it, and we’ve decided it’s best to shut...

Dec 19, 2022•5