Probably is relevant, I am just having trouble distilling the whole post and thread to something succinct.
AI won't have wishes or desires. There is no correlation in the animal kingdom between desires and cognitive function (the desire to climb on the social hierarchy or to have sex is preserved no matter the level of intelligence). Dumb humans want basically the same things as bright humans. All that suggests that predictive modeling of the world is totally decoupled from wishes and desires.
I suppose it is theoretically possible to build a system that also incorporates desires, but why would we do that? We want Von Neuman's cognitive abilities, not Von Neuman's personality.
There might sort of be three pieces of relevant information, out of which my previous answer only addressed the first one.
The second one is, what's up with mesaoptimizers? Why should we expect an AI to have mesaoptimizers, and why might they end up misaligned?
In order to understand why we would expect mesaoptimizers, we should maybe start by considering how AI training usually work. We usually use an outer optimizer - gradient descent - to train some neural network that we want to apply for some want we have. However, per the argument I made in the other comment thread, when we want to achieve something diffocult, we're likely going to have the neural network itself do some sort of search or optimization. (Though see What is general-purpose search, and why might we expect to see it in ML systems? for more info.)
One way one could see the above is, with simple neural networks, the neural network itself "is not the AI" in some metaphorical sense. It can't learn things on its own, solve goals, etc.. Rather, the entire system of {engineers and other workers who collect the data and write the code and tune the hyperparameters, datacenters who train the network, neural network itself} is the intelligence, and it's not exactly entirely artificial, since it contains a lot of natural intelligence too. This is expensive! And only really works for problems we already know how to solve, since the training data has to come from somewhere! And it's not retargetable, you have to start over if you have some new task that needs solving, which also makes it even more expensive! It's obviously possible to make intelligences that are more autonomous (humans are an existence proof), and people are going to attempt to do so since it's enormously economically valuable (unless it kills us all), and those intelligences would probably have a big internal consequentialist aspect to them, because that is what allows them to achieve things.
So, if we have a neural network or something which is a consequentualist optimizer, and that neural network was constructed by gradient descent, which itself is also an optimizer, then by definition that makes the neural network a mesaoptimizer (since mesaoptimizers by definition are optimizers constructed by other optimizers). So in a sense we "want" to produce mesaoptimizers.
But the issue is, gradient descent is a really crude way of oroducing those mesaoptimizers. The current methods basically work by throwing the mesaoptimizer into some situation where we think we know what it should do, and then adjusting it so that it takes the actions we think it should take. So far, this leaves them very capability-limited, as they don't do general optimization well, but capabilities researchers are aiming to fix that, and they have many plausible methods to improve them. So at some point, maybe we have some mesaoptimizer that was constructed through a bunch of examples of good and bad stuff, rather than through a careful definition of what we want it to do. And we might be worried that the process of "taking our definition of what we want -> producing examples that do or do not align with that definition -> stuffing those examples into the mesaoptimizer" goes wrong in such a way that the AI doesn't follow our definition of what we want, but instead does something else - that's the inner alignment problem. (Meanwhile the "take what we want -> and define it" process is the outer alignment problem.)
So that was the second piece. Now the third piece of information: IMO it seems to me that a lot of people thinking about mesaoptimizers are not thinking about the "practical" case above, but instead more confused or hypothetical cases, where people end up with a mesaoptimizer almost no matter what. I'm probably not the right person to defend that perspective since they often seem confused to me, but here's an attempt at a steelman:
Mesaoptimizers aren't just a thing that you're explicitly trying to make when you train advanced agents. They also happen automatically when trying to predict a system that itself contains agents, as those agents have to be predicted too. For instance for language models, you're trying to predict text, but that text was written by people who were trying to do something when writing it, so a good language model will have a representation of an approximation of those goals.
In theory, language models are just predictive models. But as we've learned, if you prompt them right, you can activate one of those representations of human goals, and thereby have them solve some problems for you. So even predictive models become optimizers when the environment is advanced enough, and we need to beware of that and consider factors like whether they are aligned and what that means for safety.
I think an "external push" is extremely likely. It'll just look like someone trying to get an AI to do something clever in the real world.
Take InstructGPT, the language model that right now is a flagship model of OpenAI. It was trained in two phases: First, purely to predict the next token of text. Second, after it was really good at predicting the next token, it was further trained with reinforcement learning from human feedback.
Reinforcement learning to try to satisfy human preferences is precisely the sort of "external push" that will incentivize an AI that previously did not have "wants" (i.e. that did not previously choose its actions based on their predicted impacts on the world) to develop wants (i.e. to pick actions based on their predicted impact on the world).
Why did OpenAI do such a thing, then? Well, because it's useful! InstructGPT does a better job answering questions than regular ol' GPT. The information from human feedback helped the AI do better at its real-world purpose, in ways that are tricky to specify by hand.
Now, if this makes sense, I think there's a subset of peoples' concerns about an "internal push" that make sense by analogy:
Consider an AI that you want to do a task that involves walking a robot through an obstacle course (e.g. mapping out a construction site and showing you the map). And you're trying to train this AI without giving it "wants," just as a tool, so you're not giving it direct feedback for how good of a map it shows you, instead you're doing something more expensive but safer: you're training it to understand the whole distribution of human performance on this task, and then selecting a policy conditional on good performance.
The concern is that the AI will have a subroutine that "wants" the robot to navigate the obstacle course, even though you didn't give an "outside push" to make that happen. Why? Well, it's trying to predict good navigations of the obstacle course, and it models that as a process that picks actions based on their modeled impact on the real world, and in order to do that modeling, it actually runs the compuations.
In other words, there's an "internal push" - or maybe equivalently a "push from the data rather than from the human," which leads to "wants" being computed inside the model of a task that is well-modeled by using goal-based reasoning. This all works fine on-distribution, but produces generalization behavior that generalizes like the modeled agent, which might be bad.
you're training it to understand the whole distribution of human performance on this task, and then selecting a policy conditional on good performance
Yeah, that makes sense to me.
it's trying to predict good navigations of the obstacle course, and it models that as a process that picks actions based on their modeled impact on the real world, and in order to do that modeling, it actually runs the compuations.
I can see why it would run a simulation of what would happen if a robot walked an obstacle course. I don't see why it would actually walk the robot through it if not asked.
Because being able to do impressive stuff means you had some degree of coherence. From https://www.alignmentforum.org/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty :
But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there's no internal mechanism that enforces the global coherence at every point.
To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I'm talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying "look here", even though people have occasionally looked for alternatives.
Having plans that lase is (1) a thing you can generalize on, i.e. get good at because different instances have a lot in common, and (2) a thing that is probably heavily rewarded in general (by the reward thingy, or by internal credit assignment / economies), to the extent that the reward systems have correct credit assignment. So an AI that does impressive stuff probably has a general skill + dynamic of increasing coherence.
I did not understand anything from what you said... How does coherence generate an equivalent of an internal "push" to do something?
I don't have a direct answer to your question, so for now let's say that AI will not in fact "want" anything if not explicitly asked. This seems plausible to me, but also totally irrelevant from a practical perspective—who's going to build an entire freakin' superintelligence, and then just never have it do do anything!? In order for a program to even communicate words or data to us it's going to need to have some sort of drive to do so, since otherwise it would remain silent and we'd effectively have a very expensive silicon brick on our hands. So while in theory it may be possible to build an AI without wants, in practice there is always something an AI will be "trying to do".
I didn't mean that it wouldn't do anything, ever. Just that it will do what is asked, which creates its own set of issues, of course, if it kills us in the process. But I can imagine that it still will not have anything that could be intentional-stanced described as wants or desires.
Why were we so sure that strong enough AIs playing go would develop (what we can describe as a) fear of bad aji (latent potential)?
Well, we weren’t. As far as I know, nobody ever predict that. But in retrospect we should have, just because aji is such an important concept to master this game.
Similarly, if we’re looking for a generic mechanism that would led an AI to develop agency, I suspect any task would do as long as interpreting the data as from agency-based behaviors helps enough.
First they optimized for human behavior - that’s how they understood agency. Then they evaluate how much agency explain their own behavior - that’s how they noticed increasing it helps their current tasks. Rest is history.
But why would it do something when not asked? I.e. Why would it have needs/wants/desires to do anything at all?
These are inequivalent, but the answer to each is "because it was designed that way". Conventional software agents, which don't have to be particularly intelligent, do things without being explicitly instructed.
Firstly we already have AI designs that "want" to do things. Deep blue "wanted" to win at chess. Various reinforcement learning agents that "want" to win other games.
Intelligence that isn't turned to doing anything is kind of useless. Like you have an AI that is supposedly very intelligent. But it sits there just outputting endless 0's. What's the point of that?
There are various things intelligence can be turned towards. One is the "see that thing there, maximize that". Another option is prediction. Another is finding proofs.
An AI that wants things is one of the fundamental AI types. We are already building AI's that want things. They aren't yet particularly smart, and so aren't yet dangerous.
Imagine an AI trained to be a pure perfect predictor. Like some GPT-N. It predicts humans, and humans want things. If it is going somewhat outside distribution, it might be persuaded to predict an exceptionally smart and evil human. And it could think much faster. Or if the predictor is really good at generalizing, it could predict the specific outputs of other superhuman AI.
Mesa-optimizers basically means we can't actually train for wanting X reliably. If we train an AI to want X, we might get one that wants Y instead.
Firstly we already have AI designs that “want” to do things. Deep blue “wanted” to win at chess. Various reinforcement learning agents that “want” to win other games.
"Wanting" in quotes isn't the problem. Toasters "want" to make toast.
Intelligence that isn’t turned to doing anything is kind of useless. Like you have an AI that is supposedly very intelligent. But it sits there just outputting endless 0′s. What’s the point of that?
Doing something is not the same thing as doing-something-because-you-want-to. Toasters don't want to make toast, in the un...
What's the difference between the AI acting as if it wanted something, and it actually wanting something? The AI will act is if it wants something (the goals the programmers have in mind during training, something else that destroys all life at some point after the training) because that's what it will be rewarded for during the training.
The alternative to the AI that doesn't seem to want anything seems to be an AI that has no output.
The alternative would be an AI that goes through the motions and mimics 'how an agent would behave in a given siuation' with a certain level of fidelity, but which doesn't actually exhibit goal-directed behavior.
Like, as long as we stay in the current deep learning paradigm of machine learning, my prediction for what would happen if an AI was unleashed upon the real world, regardless of how much processing power it has, would be that it still won't behave like an agent unless that's part of what we tell it to pretend. I imagine something along the lines of the AI that was trained on how to play Minecraft by analyzing hours upon hours of gameplay footage. It will exhibit all kinds of goal-like behaviors, but at the end of the day it's just a simulacrum limited in its freedom of action to a radical degree by the 'action space' it has mapped out. It will only ever 'act as thought it's playing minecraft', and the concept that 'in order to be able to continue to play minecraft I must prevent my creators from shutting me off' is not part of that conceptual landscape, so it's not the kind of thing the AI will pretend to care about.
And pretend is all it does.
Humans are trained on how to live on Earth by hours of training on Earth. We can conceive of the possibility of Earth being controlled by an external force (God or the Simulation Hypothesis). Some people spend time thinking about how to act so that the external power continues to allow the Earth to exist.
Maybe most of us are just mimicking how an agent would behave in a given situation.
The universe appears to be well constructed to provide minimal clues as to the nature of its creator. Minecraft less so.
"Humans are trained on how to live on Earth by hours of training on Earth. (...) Maybe most of us are just mimicking how an agent would behave in a given situation."
I agree that that's a plausible enough explanation for lots of human behaviour, but I wonder how far you would get in trying to describe historical paradigm shifts using only a 'mimic hypothesis of agenthood'.
Why would a perfect mimic that was raised on training data of human behaviour do anything paperclip-maximizer-ish? It doesn't want to mimic being a human, just like Dall-E doesn't want to generate images, so it doesn't have a utility function for not wanting to be prevented from mimicking being a human, either.
The alternative would be an AI that goes through the motions and mimics 'how an agent would behave in a given situation' with a certain level of fidelity, but which doesn't actually exhibit goal-directed behavior.
If the agent would act as if it wanted something, and the AI mimics how an agent would behave, the AI will act as if it wanted something.
It will only ever 'act as thought it's playing minecraft', and the concept that 'in order to be able to continue to play minecraft I must prevent my creators from shutting me off' is not part of that conceptual landscape, so it's not the kind of thing the AI will pretend to care about.
I can see at least five ways in which this could fail:
So I think maybe some combination of (1), (2) and (3) will happen.
I have no doubt that AI will some day soon surpass humans in all aspects of reasoning, that is pretty obvious. It is also clear to me that will surpass humans in the ability to do something, should it "want" to do it. And if requested to do something drastic, it can accidentally cause a lot of harm, not because it "wants" to destroy humanity, but because it would be acting "out of distribution" (a "tool AI" acting as if it were an "agent"). It will also be able to get out of any human-designed AI box, should the need arise.
I am just not clear whether/how/why it would acquire the drive to do something, like maximizing some utility function, or achieving some objective, without any external push to do so. That is, if it was told to maximize everyone's happiness, it would potentially end up tiling the universe with smiley faces or something, to take the paradigmatic example. But that's not the failure mode that everyone is afraid of, is it? The chatter seems to be about mesaoptimizers going out of control and doing something other than asked, when asked. But why would it do something when not asked? I.e. Why would it have needs/wants/desires to do anything at all?