Search — LessWrong

LESSWRONG
LW

8848 results

(Non-deceptive) Suboptimality Alignment
Sodium5 karma2y
Executive Summary * I present a detailed and slightly different definition of suboptimality alignment compared to the original in Risks
The Inner Alignment Problem
evhub105 karma6y
, 2. Approximate alignment, and 3. Suboptimality alignment. Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can
More variations on pseudo-alignment
evhub27 karma6y
). In particular, we distinguished between proxy alignment, suboptimality alignment, approximate alignment, and deceptive alignment. I still make
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Evan R. Murphy58 karma3y
details. At least some forms of suboptimality alignment can be addressed as well. Robustness techniques such as relaxed adversarial
Concrete experiments in inner alignment
evhub74 karma6y
you produce approximate alignment if you constrain model capacity? * What about suboptimality alignment? Can you create an environment
Three scenarios of pseudo-alignment
Eleni Angelou9 karma3y
, there's an unavoidable difference in the objectives. Scenario 3: Suboptimality alignment Here imagine that we train a robot similar to scenario 1
Relaxed adversarial training for inner alignment
evhub69 karma6y
aligned with the actual loss function. Suboptimality alignment. One concerning form of misalignment discussed in “Risks from Learned
Does SGD Produce Deceptive Alignment?
Mark Xu96 karma5y
, myopic training objectives do not favor deception. ↩︎ 4. Evan Hubinger calls the more general case of this phenomenon “suboptimality deceptive alignment” and explains more here. ↩︎
[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts
Rohin Shah19 karma6y
when the model of the base objective is a non-robust proxy for the true base objective. Suboptimality deceptive alignment is when deception
How Interpretability can be Impactful
Connall Garrod18 karma3y
remain non deceptive after deployment, removing the issue of suboptimal deceptive alignment. Another way to potentially make this more

LESSWRONG
LW

Filters

Sort