Jojo Yang

Message

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Executive Overview As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an...

Aug 22, 2024•23

Message

22 karma

Member for 2 years

Jojo Yang — LessWrong

Jojo Yang

Message

Jojo Yang

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Aug 22, 2024•23

Message

22 karma

Member for 2 years

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang

Winnie Yang, Jojo Yang+ 0 more

Winnie Yang, Jojo Yang

Executive Overview

As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an instrumental goal. Importantly, deceptive alignment and robust alignment are behaviorally indistinguishable (Hubinger et al., 2024). Current predominant alignment methods only control for the output of the model while leaving the internals of the model unexamined (black-box access, Casper et al., 2024). However, recent literature has started to demonstrate the value of examining the internal of the models that provide additional predictive power that is not affordable by... (read 6249 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

Jojo Yang

Jojo Yang

Jojo Yang

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Jojo Yang

Jojo Yang

Jojo Yang

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Executive Overview