x

LESSWRONG
LW

deontologician — LessWrong

deontologician

deontologician

Message

1

1

3y

deontologician

1

3y

deontologician has not written any posts yet.

Replying toARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

deontologician3y

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

Yeah, if the RLHF is supposed to train honesty into it, maybe it would have done worse on the task rabbit task. This really seems like a PR throwaway line rather than a legit attempt to red team the final model.

2

1