Kaj_Sotala comments on MIRI's technical research agenda - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (52)
I don't think that this follows: it's easier to predict that someone won't like a plan, than it is to predict what's the plan that would maximally fulfill their values.
For example, I can predict with a very high certainty that the average person on the street would dislike it if I were to shoot them with a gun; but I don't know what kind of a world would maximally fulfill even my values, nor am I even sure of what the question means.
Similarly, an AI might not know what exactly its operators wanted it to do, but it could know that they didn't want it to break out of the box and kill them, for example.
This seems like a very strong claim to me.
Suppose that the AI has been programmed to carry out some goal G, and it builds a model of the world for predicting what would be the best way of to achieve G. Part of its model of the world involves a model of its controllers. It notices that there exists a causal chain "if controller becomes aware of intention I, and controller disapproves of intention I, controller will stop AI from carrying out intention I". It doesn't have a full model of the function controller-disapproves(I), but it does develop a plan that it thinks would cause it to achieve G, and which - based on earlier examples - seems to be more similar to the plans which were disapproved of than the plans that were approved of. A straightforward analysis of "how to achieve G" would then imply "prevent the predicate controller-aware(I) from being fulfilled while I'm carrying out the other actions needed for fulfilling G".
This doesn't seem like it would require the AI to be taught the concept of deception, or even to necessarily possess "deception" as a general concept: it only requires that it has a reasonably general capability for reasoning and modeling the world, and that it manages to detect the relevant causal chain.
While I agree, his proposal does seem like a good start. Restricting a UFAI to pursue only a subset of all potentially detrimental plans is a modest gain, but still a gain worth achieving. I am skeptical that FAI should consist of a grand unified moral theory. I think an FAI made of many overlapping moral heuristics and patches, such as the restriction he describes, is more technically feasible, and might even be more likely to match actual human value systems, given the ambiguous, varying, and context sensitive nature of our evolved moral inclinations.
(I realize that these are not properties generally considered when thinking about computer superintelligences - we're inclined to see computers as rigidly algorithmic, which makes sense given current technology levels. But I believe extrapolating from current technology to predict how AGI will function is a greater mistake than extrapolating from known examples of intelligence - a process is better understood when looking at actions than when looking at substrate. With regard to intelligence at least, AGI will necessarily be much more flexible in its operations than traditional computers are. I expect that the cost of this flexibility in behavior will be sacrificing rigidity at the process level. Performing billions of Bayesian calculations a second isn't feasible, so a more organic and heuristic based approach will be necessary. If this is correct and such technologies will be necessary for an AGI's intelligence, then it makes sense that we'd be able to use them for an AGI's emotions or goals as well.)
Even if we do attempt to build a grand unified Friendly software, I expect little downside (relative to potential risks) to adding these sort of restrictions in addition.
Wow there is a world of assumptions wrapped up in there. For example that the AI has a concept of external agents and an ability to model their internal belief state. That an external agent can have a belief about the world which is wrong. This may sound intuitively obvious, but it’s not a simple thing. This kind of social awareness takes time to be learnt by humans as well. Heinz Wimmer and Josef Perner showed that below a certain age (3-4 years) kids lack an ability to track this information. A teacher puts a toy in a blue cupboard, then leaves the room and you move it to the red cupboard, and the teacher comes back into the room. If you ask the kid not where the toy is, but what cupboard the teacher will look in to find it, and they will say the red cupboard.
It’s no accident that it takes time for this skill to develop. It’s actually quite complex to be able to keep track of and simulate the states of mind of other agents acting in our world. We just take it for granted because we are all well-adjusted adults of a species evolved for social intelligence. But an AI need not think in that way, and indeed of the most interesting use cases for tool AI (“design me a nanofactory constructible with existing tools” or “design a set of experiments organized as a decision tree for accomplishing the SENS research objectives”) would be best accomplished by an idiot savant with no need for social awareness.
I think it goes without saying that obvious AI safety rule #1 is don’t connect an UFAI to the internet. Another obvious rule I think is don’t build in capabilities not required to achieve the things it is tasked with. For the applications of AI I imagine in the pre-singularity timeframe, social intelligence is not a requirement. So when you say “part of its model of the world involves a model of its controllers”, I think that is assuming a capability the AI should not have built-in.
(This is all predicated on soft-enough takeoff that there would be sufficient warning if/when the AI self-developed a social awareness capability.)
Also, what 27chaos said is also worth articulating in my own words. If you want to prevent an intelligent agent from taking a particular category of actions there are two ways of achieving that requirement: (a) have a filter or goal system which prevents the AI from taking (box) or selecting (goal) actions of that type; or (b) prevent it by design from thinking such thoughts to begin with. An AI won’t take actions it never even considered in the first place. While the latter course of action isn’t really possible with unbounded universal inference engines (since “enumerate all possibilities” is usually a step in their construction), such designs arise quite naturally out of more realistic psychology-inspired designs.
The approach to AGI safety that you're outlining (keep it as a tool AI, don't give it sophisticated social modeling capability, never give it access to the Internet) is one that I agree should work to keep the AGI safely contained in most cases. But my worry is that this particular approach being safe isn't actually very useful, because there are going to be immense incentives to give the AGI more general capabilities and have it act more autonomously.
As we wrote in Responses to Catastrophic AGI Risk:
So while I agree that a strict boxing approach would be sufficient to contain the AGI if everyone were to use it, it only works if everyone were indeed to use it, so what we need is an approach that works for more autonomous systems as well.
Hmm. That sounds like a very interesting idea.
While I actually agree that tool AI goals can be programmed, if you want to keep the whole thing from turning unsafely agenty, you're going to have to strictly separate the inductive reasoning from the actual tool run: run induction for a while, then use tool-mode to compose plans over the induced models of the world, potentially after censoring those models for safety.