Is anyone developing optimisation-robust interpretability methods?

Jono

6

[ Question ]

Is anyone developing optimisation-robust interpretability methods?

by Jono

11th Jun 2024

1 min read

A

0 0

6

With optimisation-robust I mean that it withstands point 27 from AGI Ruin:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Are you aware of any person or group that is working expressly on countering this failure mode?

AI

Frontpage

6

New Answer

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

6

[ Question ]

Is anyone developing optimisation-robust interpretability methods?

6

6

6