x

LESSWRONG
LW

alenoach — LessWrong

alenoach

alenoach

Message

16

2

10

7

4y

alenoach

16

7

4y

alenoach's Shortform

Jul 25, 2025•2

Trying to measure AI deception capabilities using temporary simulation fine-tuning

Rationale It is hard to ensure that a powerful AI model gives good answers just because it believes these answers are true. Deceptive models could give the expected answers for whatever instrumental reasons. And sycophants may be telling the truth only when they expect it to be approved. Being sincere...

May 4, 2023•4