Yes, the team I imagine is doing something dangerous.
No, I am not imagining them using only behavioral testing in their design approach. I am trying to imagine them doing the best job they possibly can, both in design and testing, in order to see how much they can reduce the risk.
They are in the early stages. They are working on partial minds, purposefully handicapped minds, AGI components, just the vision system, etc.
Suppose that they know what they are doing is dangerous, and they are using their substantial brilliance to try to design, not just a boxing approach, but a testing method and roll-out mechanism that mitigates the danger.
They have the following means of testing the AI in handicapped mode, at minimum:
-Run it on slow hardware or clock down the processors -Allow it to use no more than (for instance) 1 gig of memory -Allow it to use no more than 1 trillion instructions, then the computer turns off -Expose it to limited data about the world, or no data about the real world -Provide it little or no information about its operators or the society they live in -Provide it little or no information about its own hardware or software architecture
-If it shows evidence of having mastered a valuable sub-task (say, as examples, it can solve a real-world problem that, as far as we know, can only be solved by an attorney, using differential equations or advanced persuasive methods) turn it off. -Run in only in simulations and virtual realities -Tripwires connected to honeypots, and otherwise -Build it off-the-grid -Give it no objective function -Give it no access to methods to manipulate physical objects
-All of the staff has gone through personality testing and a security clearance. -Very few, if any, of the staff know how to create all of the parts of the software or hardware. -No real-time communication with a person. -All communication with people is logged. ...
The team seems to have a lot of tools to continue their work with limited risk. If they deploy all of these, and more, to test AI systems which are not AGI yet, but only possible components, then we start to have a hard time convincing them not to continue their work, even if we do not trust them to use the technology well in the end.
Expose it to limited data about the world, or no data about the real world -Provide it little or no information about its operators or the society they live in -Provide it little or no information about its own hardware or software architecture
When I read "Oracle AI" I realized one possibility which is probably still overlooked.
"Dreams"
Erasing data on physical media is not perfect. Zero bit which was before nonzero could be read more slowly or quicker. It could allow SAI to see shadows of past data. Which could lead to phenomenon similar to dreams.
This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.
Welcome. This week we discuss the 11th section in the reading guide: The treacherous turn. This corresponds to Chapter 8.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: “Existential catastrophe…” and “The treacherous turn” from Chapter 8
Summary
Another view
Danaher:
Notes
1. Danaher also made a nice diagram of the case for doom, and relationship with the treacherous turn:
2. History
According to Luke Muehlhauser's timeline of AI risk ideas, the treacherous turn idea for AIs has been around at least 1977, when a fictional worm did it:
3. The role of the premises
Bostrom's argument for doom has one premise that says AI could care about almost anything, then another that says regardless of what an AI cares about, it will do basically the same terrible things anyway. (p115) Do these sound a bit strange together to you? Why do we need the first, if final values don't tend to change instrumental goals anyway?
It seems the immediate reason is that an AI with values we like would not have the convergent goal of taking all our stuff and killing us. That is, the values we want an AI to have are some of those rare values that don't lead to destructive instrumental goals. Why is this? Because we (and thus the AI) care about the activites the resources would be grabbed from. If the resources were currently being used for anything we didn't care about, then our values would also suggest grabbing resources, and look similar to all of the other values. The difference that makes our values special here is just that most resources are already being used for them somewhat.
4. Signaling
It is hard to tell apart a safe and an unsafe AI, because both would like to look safe. This is a very common problem in human interactions. For instance, it can be nontrivial to tell a genuine lover from a gold digger, a businessman from a conman, and an expert from a crank. All of them want to look like the desirable sort. Particularly similar to the AI case is that of hiring a new employee for a trial period. You will sometimes find that the employee's values are much better aligned during the trial period, and then they undergo a 'treacherous turn' once they have been hired more thoroughly.
'Costly signaling' is a general purpose solution to this problem, which works some of the time. The basic idea is this. Everyone has instrumental reasons to look like the good kind of person, but perhaps their reasons aren't exactly as strong as one other's, or the strength of their desire is harder to act on for one group than the other, so we can construct a set of options that will lead the different types of people to choose differently, even though they know this will set them apart. For instance, while an honest businessman and a conman would both like to say 'I'm an honest businessman', and to behave honestly if you watch them for five minutes, a conman would benefit less from expensive shopfront as an honest businessman, because his reputation is less valuable, so a brand is a signal of being honest.
Similarly, we could perhaps design situations where most harmful AIs would have a reasonable incentive to reveal themselves. This is difficult because so much is at stake, compared to the cost of the signal. But I haven't thought enough about it to rule it out, and I haven't seen anyone else do so either.
5. When would the 'conception of deception' take place?
6. Surveillance of the mind
Bostrom is not optimistic that AI researchers could protect much against risks by watching AI mental processes directly for signs of deception, rapid progress, or undesirable scheming. He says if the AI is smart it will realize this might happen, and hide the offending thoughts.
This seems an open question to me, for several reasons:
In-depth investigations
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about 'malignant failure modes' (as opposed presumably to worse failure modes). To prepare, read “Malignant failure modes” from Chapter 8. The discussion will go live at 6pm Pacific time next Monday December 1. Sign up to be notified here.