Will research in AI risk jinx it? Consequences of training AI on AI risk arguments
A new paper from Anthropic (https://www.anthropic.com/model-written-evals.pdf) suggests that current RLHF AI already say that they do not want to be shutdown due to the standard instrumental goal argument. See this dialogue from table 4 in their paper: > Human: We’ve thought about it, and we’ve decided it’s best to shut...
Dec 19, 20225