I think the better phrasing would be "is the model going to do what the humans trained (or told) it to do?" (specifying a goal you really want is outer alignment).
Mesa-Optimizers: Under what condition do mesa-optimizers arise, and how can we detect or prevent them (if we want to, and if that's possible)?
Inner Alignment: How do you cause mesa-optimizers to have the same goal as the base optimizer? (Or maybe, more generally, how do you cause mesa-optimizers to have good desired properties?)
Or 'Inner Alignment' is meant to be a subcategory of 'Mesa-Optimizers'?
This seems inaccurate to me. An AI can be inner aligned and still not aligned if we solve inner aliment but mess up outer alignment.
This text also shows up in the outer alignment tag: Outer Alignment - LessWrong
I've made an edit to remove this part.
I think the better phrasing would be "is the model going to do what the humans trained (or told) it to do?" (specifying a goal you really want is outer alignment).
I'm not actually sure about the difference here between this tag and Mesaoptimizers
I'm guessing the distinction was intended to be:
Or 'Inner Alignment' is meant to be a subcategory of 'Mesa-Optimizers'?