Kaj_Sotala comments on The Brain as a Universal Learning Machine - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (166)
I believe the orthogonality thesis is probably mostly true in a theoretical sense. I thought I made it clear in the article that a ULM can have any utility function.
That being said the idea of programming in goals directly does not really apply to a ULM. You instead need to indirectly specify an initial approximate utility function and then train the ULM in just the right way. So it's potentially much more complex than "program in the goal you want".
However the end result is just as general. If evolution can create humans which roughly implement the goal of "be fruitful and multiply", then we could probably create a ULM that implements the goal of "be fruitful and multiply paperclips".
I agree that just because all utility functions are possible does not make them all equally likely.
The danger is not in paperclip maximizers, it is in simple and yet easy to specify utility functions. For example, the basic goal of "maximize knowledge" is probably much easier to specify than a human friendly utility function. Likewise the maximization of future freedom of action proposal from Wissner-Gross is pretty simple. But both probably result in very dangerous agents.
I think Ex Machina illustrated the most likely type of dangerous agent - it isn't a paperclip maximizer. It's more like a sociopath. A ULM with a too-simple initial utility function is likely to end up something like a sociopath.
I hope not too simple! This topic was beyond the scope of this article. If I have time in the future I will do a follow up article that focuses on the reward system, the human utility function, and neuroscience inspired value learning, and related ideas like inverse reinforcement learning.
"Be fruitful and multiply" is a subtly more complex goal than "maximize future freedom of action". Humans need to be compelled to find suitable mates and form long lasting relationships stable enough to raise children (or get someone else to do it), etc. Humans perform these functions not because of some slow long logical reasoning from first principles. Instead the evolutionary goals are encoded into the value function directly - as that is the only practical efficient implementation. You can think of evolution having to encode it's value function into the human brain using a small number of bits. It still ends up being more complex than the simplest viable utility functions.
This made me think. I've noticed that some machine learning types tend to have a tendency to dismiss MIRI's standard "suppose we programmed an AI to build paperclips and it then proceeded to convert the world into paperclips" examples with a reaction like "duh, general AIs are not going to be programmed with goals directly in that way, these guys don't know what they're talking about".
Which is fair on one hand, but also missing the point on the other hand.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Adding that to my todo-list...
Agreed that this would be valuable. I can't measure it exactly, but I believe it took me some extra time/cognitive steps to get over the paperclip thing and realize that the more general point about human utility functions being difficult to specify is still quite true in any ML approach.
Yes, a better example than Clippie is rather overdue.
I've written about this before. The argument goes something like this.
RL implies self preservation, since dying prevents you from obtaining more reward. And self preservation leads to undesirable behavior.
E.g. making as many copies of yourself as possible for redundancy. Or destroying anything that has the tiniest probability of being a threat. Or trying to store as much mass and energy as possible to last against the heat death of the universe.
Or, you know, just maximizing your reward signal by wiring it that way in hardware. This would reduce your planning gradient to zero, which would suck for gradient-based planning algorithms, but there are also planning algorithms more closely tied to world-states that don't rely on a reward gradient.
Even if the AI wires it's reward signal to +INF, it probably still would consider time, and therefore self preservation.
Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a "planning gradient" is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is 'unspecified.'
I think you're implicitly assuming that the reward function is housed in some other logic, and so it's not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you're going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return "everything is as good as it could possibly be"?
Here's one mathematical argument for it, based on the assumption that the AI can rewire its reward channel but not the whole reward/planning function: http://www.agroparistech.fr/mmip/maths/laurent_orseau/papers/ring-orseau-AGI-2011-delusion.pdf
Yes, that's the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.
Again: that depends what planning algorithm it uses. Many reinforcement learners use planning algorithms which presume that the reward signal has no causal relationship to the world-model. Once these learners wirehead themselves, they're effectively dead due to the AIXI Anvil-on-Head Problem, because they were programmed to assume that there's no relationship between their physical existence and their reward signal, and they then destroyed the tenuous, data-driven correlation between the two.
I'm having a very hard time modelling how different AI types would act in extreme scenarios like that. I'm surprised there isn't more written about this, because it seems extremely important to whether UFAI is even a threat at all. I would be very relieved if that was the case, but it doesn't seem obvious to me.
Particularly I worry about AIs that predict future reward directly, and then just take the local action that predicts the highest future reward. Like is typically done in reinforcement learning. An example would be Deepmind's Atari playing AI which got a lot of press.
I don't think AIs with entire world models that use general planning algorithms would scale to real world problems.Too much irrelevant information to model, too large a search space to search.
As they train their internal model to predict what their reward will be in x time steps, and as x goes to infinity, they care more and more about self preservation. Even if they have already hijacked the reward signal completely.