astrobiscuit — LessWrong

To me that the mesaoptimizer in the toy example is:

aligned with its goal - it reaches the door (which it incorrectly identifies)
dysfunctional - it incorrectly identifies doors.

From a consequentialist perspective this may be irrelevant, but from safety point of view this distinction is important and big.

In the context of this article I believe that misalignment (pseudo alignment) would occur when the goal of the mesa optimizer would diverge from its original goal (change completely, extend, etc.)

(As a secondary point that I haven't thought a lot about, it seems problematic to discuss alignment unless the mesa optimizer's goal liberally contains the base goal: Find doors in order to achieve Obase.)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments