Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

To me that the mesaoptimizer in the toy example is:

  • aligned with its goal - it reaches the door (which it incorrectly identifies)
  • dysfunctional - it incorrectly identifies doors.

From a consequentialist perspective this may be irrelevant, but from safety point of view this distinction is important and big.

In the context of this article I believe that misalignment (pseudo alignment) would occur when the goal of the mesa optimizer would diverge from its original goal (change completely, extend, etc.)

(As a secondary point that I haven't thought a lot about, it seems problematic to discuss alignment unless the mesa optimizer's goal liberally contains the base goal: Find doors in order to achieve Obase.)