There might sort of be three pieces of relevant information, out of which my previous answer only addressed the first one.
The second one is, what's up with mesaoptimizers? Why should we expect an AI to have mesaoptimizers, and why might they end up misaligned?
In order to understand why we would expect mesaoptimizers, we should maybe start by considering how AI training usually work. We usually use an outer optimizer - gradient descent - to train some neural network that we want to apply for some want we have. However, per the argument I made in the other comment thread, when we want to achieve something diffocult, we're likely going to have the neural network itself do some sort of search or optimization. (Though see What is general-purpose search, and why might we expect to see it in ML systems? for more info.)
One way one could see the above is, with simple neural networks, the neural network itself "is not the AI" in some metaphorical sense. It can't learn things on its own, solve goals, etc.. Rather, the entire system of {engineers and other workers who collect the data and write the code and tune the hyperparameters, datacenters who train the network, neural network itself} is the intelligence, and it's not exactly entirely artificial, since it contains a lot of natural intelligence too. This is expensive! And only really works for problems we already know how to solve, since the training data has to come from somewhere! And it's not retargetable, you have to start over if you have some new task that needs solving, which also makes it even more expensive! It's obviously possible to make intelligences that are more autonomous (humans are an existence proof), and people are going to attempt to do so since it's enormously economically valuable (unless it kills us all), and those intelligences would probably have a big internal consequentialist aspect to them, because that is what allows them to achieve things.
So, if we have a neural network or something which is a consequentualist optimizer, and that neural network was constructed by gradient descent, which itself is also an optimizer, then by definition that makes the neural network a mesaoptimizer (since mesaoptimizers by definition are optimizers constructed by other optimizers). So in a sense we "want" to produce mesaoptimizers.
But the issue is, gradient descent is a really crude way of oroducing those mesaoptimizers. The current methods basically work by throwing the mesaoptimizer into some situation where we think we know what it should do, and then adjusting it so that it takes the actions we think it should take. So far, this leaves them very capability-limited, as they don't do general optimization well, but capabilities researchers are aiming to fix that, and they have many plausible methods to improve them. So at some point, maybe we have some mesaoptimizer that was constructed through a bunch of examples of good and bad stuff, rather than through a careful definition of what we want it to do. And we might be worried that the process of "taking our definition of what we want -> producing examples that do or do not align with that definition -> stuffing those examples into the mesaoptimizer" goes wrong in such a way that the AI doesn't follow our definition of what we want, but instead does something else - that's the inner alignment problem. (Meanwhile the "take what we want -> and define it" process is the outer alignment problem.)
So that was the second piece. Now the third piece of information: IMO it seems to me that a lot of people thinking about mesaoptimizers are not thinking about the "practical" case above, but instead more confused or hypothetical cases, where people end up with a mesaoptimizer almost no matter what. I'm probably not the right person to defend that perspective since they often seem confused to me, but here's an attempt at a steelman:
Mesaoptimizers aren't just a thing that you're explicitly trying to make when you train advanced agents. They also happen automatically when trying to predict a system that itself contains agents, as those agents have to be predicted too. For instance for language models, you're trying to predict text, but that text was written by people who were trying to do something when writing it, so a good language model will have a representation of an approximation of those goals.
In theory, language models are just predictive models. But as we've learned, if you prompt them right, you can activate one of those representations of human goals, and thereby have them solve some problems for you. So even predictive models become optimizers when the environment is advanced enough, and we need to beware of that and consider factors like whether they are aligned and what that means for safety.
Ah, so it's more the "stop button problem" you are asking about? I.e. couldn't we just connect an AI to a stop button to keep it under control?
This is a solution that works semi-adequately as a solution for current AIs. Like you can stop a chess computer just by turning off the program.
I say "semi-adequately" because it doesn't always work, e.g. if you're in a self-driving car that's suddenly decided to accelerate with max speed into a crowd of people, you might not be able to make it brake it before it hits people. As AIs control bigger things, presumably inertia like this becomes a bigger issue - both literal physical inertia and more complicated things like programs that it's made and started or nanobots or whatever. This is one of the major challenges, making sure that there's a "safe policy" that the stop button can switch on.
The second major challenge comes when we consider the reach/reflectiveness of the AI. For instance, if a chess bot has been instructed to win chess, but it then gets turned off, then that could cause it to lose chess. So the optimal policy for winning chess would involve preventing yourself from getting turned off. Current chess bots don't resist getting turned off mainly because they can't reason about anything that happens outside of the chess game, but if e.g. we extended the game tree that the bots search over to include a "chess bot gets turned off" state and a "prevent yourself from getting turned off" action, then even many of the simplest algorithms would choose the "prevent yourself from getting turned off" action. So this problem seems to mainly be avoided because chessbots are stupid.
There's various patches that can be applied, such as a negative score penalty from engaging the "prevent yourself from getting turned off" action, but it's unclear how to apply those patches in the general case, where the action space isn't neatly separated by the deep effects of the actions, but is instead shallow stuff like motor actions or word outputs, which go through complex real-world dynamics before they can affect whether it gets turned off, and where it's therefore difficult to "assign blame". What actions count as resisting getting turned off?