Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
Executive Overview As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an...
Aug 22, 202423

