Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?
As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.
What are the main proposals for solving the problem? Are there any posts/articles/papers specifically dedicated to addressing this issue?
Evolution was working within tight computational efficiency limits (the human brain burns roughly 1/6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we're now running the human brain well outside it's training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.
So: