x

LESSWRONG
LW

Jojo Yang — LessWrong

Jojo Yang

Jojo Yang

Message

22

2y

Jojo Yang hasn't written anything yet.

Jojo Yang

22

2y

Jojo Yang has not written any posts yet.

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Executive Overview As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an...

Aug 22, 2024•23