You may be interested in some recent empirical experiments, demonstrating objective robustness failures/inner misalignment, including ones predicted in the risks from learned optimization paper.
I would also note that most modern-day AI like GPT-N are not actually optimisers, just algorithms produced by optimisation processes - the entity of [GPT-N + its trainer + its training data] could be considered an optimiser (albeit a self-contained one), but as soon as you take GPT-N out of that environment it is a stateless algorithm that looks at a short string of text and provides a probability distribution for the next letter. When it is run in generative mode, the set of its weights and answers will be no different from its isolated guesses when being trained.
I've also seen this conflation of evaluation functions with goals, and came to similar conclusion about the validity of this.
I do think the instrumental convergence thesis is still very much a danger. If there are any goals, for sufficiently powerful systems it seems plausible that instrumental convergence still applies to them, and separation of goals from evaluation just means that they're more opaque to us. Opacity of AI goals is not great from a human point of view.
In practice the evaluation function matters even less than most such discussion implies. In many current systems, only the gradient of the evaluation function matters at all, and not its actual value. That may seem more like an implementation detail, but no optimization process can maximize every evaluation function. Every training process (even including uncomputable processes like Solomonoff induction) that takes an evaluation function as input will fail badly to maximize some functions for arbitrarily many inputs.
So for any nontrivial evaluation function, we should expect that the outputs of any system trying to match it will not be those that maximize it, no matter how superhuman its programming might be.
I do think the instrumental convergence thesis is still very much a danger. If there are any goals, for sufficiently powerful systems it seems plausible that instrumental convergence still applies to them, and separation of goals from evaluation just means that they're more opaque to us.
I'm not certain about this. In the one existing example of inner vs outer alignment we have (human evolution), the inner goals are surprisingly weak - we are not prepared to achieve them at all costs. Humans like sex, but most wouldn't be prepared to take over the world just to get more sex. Perhaps this is chance, but perhaps it is actually fundamental - because the inner goals are not aligned with the outer goals, there's an incentive to stop them being too powerful.
This is exactly the problem that the post describes: the outer evaluation function is simply based on reproduction of genes (which requires sex), but the corresponding inner goal landscape is very different. If the (outer) evaluation function was actually the aim of humans, every man would take extreme actions to try to impregnate every woman on the planet.
There are quite a few examples of humans having goals that do lead them to try to take over the world. They usually have this as an instrumental goal to something that is only vaguely related to the evaluation function. They almost universally fail, but that's more a matter of an upper bound on variation in human capability than their willingness to achieve such goals.
If humans in reality had some exponential scale of individual capability like some fiction and games, I would expect to see a lot more successful world takeovers. Likewise I wouldn't expect SI to have those same bounds on capability as current humans. Even a comparatively weak goal that merely as a side effect outweighs everything that humans prefer about the world would be plenty bad enough.
I think this is a relatively straightforward point, but it's one that I think has definitely led to confused thinking on my part about AI alignment, and presumably on others. After writing this post Ruby pointed out there's an entire sequence on this. However I think this post still provides benefit as being shorter and less formal.
The way most machine learning systems learn involves performing a task using some algorithm (e.g. translate this text into Japanese), evaluating the outcome using some evaluation function (e.g. how close is this to the human translation), and then adapting the original algorithm so that it would do better on that task according to the given evaluation function.
I think there's a tendency to conflate this evaluation function with the aims of the AI itself, which leads to some invalid conclusions.
I started thinking about this when I thought about GPT3. The task GPT3 performs is to predict the continuation of a piece of text. It is evaluated on how close that is to the actual continuation. It then uses backpropogation to adjust the parameters of the model to do better next time.
I asked myself whether if we scaled up GPT3 until it has superhuman performance on this task, might we end up with an alignment problem?
My first thought was yes - GPT3 might work out it was being fed articles from the internet for example, and hack the internet so that all articles have identical text, giving it 100% accuracy on it's evaluation function.
I think that the chances of this happening are actually quite slim. It's easier to realize this if we compare machine learning to evolution, where we have the benefit of hindsight to see what actually occurred.
Evolution works similarly to machine learning. It performs a task (reproduce), and is evaluated based on how well it replicates itself. It uses a hill climbing algorithm (randomly mutate + natural selection) to improve at the original task.
It has often been wondered why humans don't intrinsically want to reproduce. For example, why do humans want to have sex even when birth control is being used? Eliezer has a great sequence on that, but the key point is that humans are adaptation executers not fitness maximizers. The pressure of evolution causes adaptations to persist which have proven useful for maximizing reproduction in the past. It doesn't make lifeforms actually care about reproduction.
Similarly the pressure of the evaluation function causes features to emerge in a machine learning system which have proven useful in the past at maximizing the evaluation function. But that doesn't mean the machine learning system will necessarily care about the evaluation function. It might, but there's no reason to assume it will.
What this means is that I find it unlikely that even a superhuman GPT3 would try to hack the internet. The pressure of it's evaluation function is likely to cause it to gain ever deeper understanding of human thought and how people write things, but it seems unlikely that it will ever try to actively change the world to improve it's predictions. The reason for that it's that it's too difficult for the evaluation function to ever promote that - hacking the world wouldn't improve prediction until it was done very well, so there's no obvious pathway for such abilities to develop.
To be clear, I think it very likely GPT3 will be intelligent enough to know exactly how to hack the internet to improve predictions - after all it needs to be able to predict how an essay about how the internet was hacked would continue. It might also have consciousness and conscious desires to do things. But I find it unlikely that it would have a conscious desire to hack the internet, since it's evaluation function would have no straightforward way to cause that desire to appear.
Now I’m not saying this couldn’t happen. I’m saying that it’s far less likely to happen than if GPT3 directly had the aim of minimizing predictive error - you have to add in an extra step where the evaluation function causes GPT3 to itself be an optimizer which minimizes predictive error.
This is both good news and bad.
Firstly it means that the the instrumental convergence thesis seems less powerful. The instrumental convergence thesis states that agents with wildly varying goals are likely to pursue similar power seeking behavior, regardless of their goal, as a means to achieving their ultimate goal. But if superintelligences are adaptation executers not fitness maximizer, then they may not necessarily have goals at all.
Unfortunately they might yes have goals - just like humans do. And those goals might not be to maximize their evaluation function, but some other goals which happened to be effective in the past at maximizing their evaluation function.
I've seen people try to solve alignment by coming up with an evaluation function that perfectly reflects what humanity wants. This approach seems doomed to fail - the superintelligence is unlikely to have the aligned evaluation function as a goal.