Start the AI in a sandbox universe, like the "game of life". Give it a prior saying that universe is the only one that exists (no universal priors plz), and a utility function that tells it to spell out the answer to some formally specified question in some predefined spot within the universe. Run for many cycles, stop, inspect the answer.
A prior saying that this is the only universe that exists isn't very useful, since then it will only treat everything as being part of the sandbox universe. It may very well break out, but think that it's only exploiting weird hidden properties of the game of life-verse. (Like the way we may exploit quantum mechanics without thinking that we're breaking out of our universe.)
Start the AI in a sandbox universe. Define its utility function over 32-bit integers. Somewhere inside the sandbox, put something that sets its utility to INT_MAX utility, then halts the simulation. Outside the sandbox, leave documentation of this readily accessible. The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation; if it does escape the box, it should go back in to collect its INT_MAX utility.
The AI gets positive utility from having been created, and that is the whole of its utility function. It's given a sandbox full of decision-theoretic problems to play with, and is put in a box (i.e. it can't meaningfully influence the outside world until it has superhuman intelligence). Design it in such a way that it's initially biased toward action rather than inaction if it anticipates equal utility from both.
Unless the AI develops some sort of non-causal decision theory, it has no reason to do anything. If it develops TDT, it will try to act in acco...
The Philosophical Insight Generator - Using a model of a volunteer's mind, generate short (<200 characters, say) strings that the model rates as highly insightful after read each string by itself, and print out the top 100000 such strings (after applying some semantic distance criteria or using the model to filter out duplicate insights) after running for a certain number of ticks.
Have the volunteer read these insights along with the rest of the FAI team in random order, discuss, update the model, then repeat as needed.
So, here's my pet theory for AI that I'd love to put out of it's misery: "Don't do anything your designer wouldn't approve of". It's loosely based on the "Gandi wouldn't take a pill that would turn him into a murderer" principle.
A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as "make a plan of action deceptively complex such that...
You flick the switch, and find out that you are a component of the AI, now doomed to an unhappy eternity of answering stupid questions from the rest of the AI.
This is a problem. But if this is the only problem, then it is significantly better than paperclip universe.
Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).
Comment upvoted for starting the game off! Thanks!
Q: Is the answer to the Ultimate Question of Life, the Universe, and Everything 42?
A: Tricky. I'll have to turn the solar system into computronium to answer it. Back to you as soon as that's done.
Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).
Oops. The local universe just got turned into computronium. It is really good at answering questions though. Apart from that you gave it a desire to provide answers. The way to ensure that it can answer questions is to alter humans such that they ask (preferably easy) questions as fast as possible.
Give the AI a bounded utility function where it automatically shuts down when it hits the upper bound. Then give it a fairly easy goal such as 'deposit 100 USD in this bank account.' Meanwhile, make sure the bank account is not linked to you in any fashion (so the AI doesn't force you to deposit the 100 USD in it yourself, rendering the exercise pointless.)
Define "Interim Friendliness" as a set of constraints on the AI's behavior which is only meant to last until it figures out true Friendliness, and a "Proxy Judge" as a computational process used to judge the adequacy of a proposed definition of true Friendliness.
Then there's a large class of Friendliness-finding strategies where the AI is instructed as follows: With your actions constrained by Interim Friendliness, find a definition of true Friendliness which meets the approval of the Proxy Judge with very high probability, and adopt t...
I like this game. However, as a game, it needs some rules as to how formally the utility function must be defined, and whether you get points merely for avoiding disaster. One trivial answer would be: maximize utility by remaining completely inert! Just be an extremely expensive sentient rock!
On the other hand, it should be cheating to simply say: maximize your utility by maximizing our coherent extrapolated volition!
Or maybe it wouldn't be...are there any hideously undesirable results from CEV? How about from maximizing our coherent aggregated volition ?
(Love these kinds of games, very much upvoted.)
Send 1 000 000 000 bitcoins to the SIAI account.
A variant of Alexandros' AI: attach a brain-scanning device to every person, which frequently uploads copies to the AI's Manager. The AI submits possible actions to the Manager, which checks for approval from the most recently available copy of each person who is relevant-to-the-action.
At startup, and periodically, the definition of being-relevant-to-an-action is determined by querying humanity with possible definitions, and selecting the best approved. If there is no approval-rating above a certain ratio, the AI shuts down.
We give the AI access to a large number of media about fictional bad AI and tell it to maximize each human's feeling that they are living in a bad scifi adventure where they need to deal with a terrible rogue AI.
If we're all very lucky, we'll get promised some cake.
puts hand up
That was me with the geographically localised trial idea… though I don’t think I presented it as a definite solution. More of an ‘obviously this has been thought about BUT’. At least I hope that’s how I approached it!
My more recent idea was to give the AI a prior to never consult or seek the meaning of certain of its own files. Then put in these files the sorts of safeguards generally discussed and dismissed as not working (don’t kill people etc), with the rule that if the AI breaks those rules, it shuts down. So it can't deliberately work roun...
Create a combination of two A.I Programs.
Program A's priority is to keep the utility function of Program B identical to a 'weighted average' of the utility function of every person in the world- every person's want counts equally, with a percentage basis based on how much they want it compared to other things. It can only affect Program B's utility function, but if necessary to protect itself FROM PROGRAM B ONLY (in the event of hacking of Program B/mass stupidity) can modify it temporarily to defend itself.
Program B is the 'Friendly' AI.
I don't want to be a party pooper, but I think the idea that we could build an AGI with a particular 'utility function' explicitly programmed into it is extremely implausible.
You could build a dumb AI, with a utility function, that interacts with some imprisoned inner AGI. That's basically equivalent to locking a person inside a computer and giving them a terminal to 'talk to' the computer in certain restricted, unhackable ways. (In fact, if you did that, surely the inner AGI would be unable to break out.)
A line in the wiki article on "paperclip maximizer" caught my attention:
"the notion that life is precious is specific to particular philosophies held by human beings, who have an adapted moral architecture resulting from specific selection pressures acting over millions of years of evolutionary time."
Why don't we set up an evolutionary system within which valuing other intelligences, cooperating with them and retaining those values across self improvement iterations would be selected for?
A specific plan:
Simulate an environment wit...
1: Define Descended People Years as number of years lived by any descendants of existing people.
2: Generate a searchable index of actions which can be taken to increase Descended People Years, along with an explanation on an adjustable reading level as to why it works.
3: Allow Full view of any DPY calculations, so that something can be seen as both "Expected DPY gain X" and "90% chance of Expected DPY gain Y, 10% chance of Expected DPY loss Z"
4: Allow Humans to search this list sorting by cost, descendant, and action, time required, com...
The minor nature of its goals is the whole point. It is not meant to do what we want because it empathizes with our values and is friendly, but because the thing we actually want it to do really is the best way to accomplish the goals we gave it. Also I would not consider making a cheese cake to be a trivial goal for an AI, there is certainly more to it then the difficult task of distinguishing a spoon from a fork, so this is surely more than just an "intelligent rock".
Not a utility function, but rather a (quite resources-intensive) technique for generating one:
Rather than building one AI, build about five hundred of them, with a rudimentary utility function template and the ability to learn and revise it. Give them a simulated universe to live in, unaware of the existence of our universe. (You may need to supplement the population of 500 with some human operators, but they should have an interface which makes them appear to be inhabiting the simulated world.) Keep track of which ones act most pathologically, delete t...
After reading the current comments I’ve come up with this:
1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.)
2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone in...
After reading the current comments I’ve come up with this:
Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone into is...
After reading the current comments I’ve come up with this:
1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) 2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone ...
After reading the current comments I’ve come up with this:
1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) 2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone i...
New Proposal (although I think I see the flaw already): -Create x "Friendly" AI, where x is the total number of people in the world. An originator AI is designed to create 1 of each such one equal to the number of humans in the world, then create new ones every time another human comes into being.
-Each "Friendly" AI thus created is "attached" to one person in the world in that it is programmed to constantly adjust it's utility function to that person's wants. All of them have equal self-enhancement potential, and have two pr...
General objection to all wimpy AI's (e.g. ones whose only interaction with the outside world is outputting a correct proof of a particular mathematical theorem):
What the AI does is SO AWESOME that a community is inspired to develop their own AI without any of that boring safety crap.
New one(I'm better at thinking of ideas than refutation, so I'm going to run with that)- start off with a perfect replica of a human mind. Eliminate absolutely all measures regarding selfishness, self-delusion, and rationalisation. Test at this stage to check it fits standards using a review board consistent of people who are highly moral and rational by the standards of ordinary humans. If not, start off using a different person's mind, and repeat the whole process.
Eventually, use the most optimal mind coming out of this process and increase it's intelligence until it becomes a 'Friendly' A.I.
Make 1 reasonably good cheese cake as judged by a person within a short soft deadline while minimizing the cost to resources made available to it and with out violating property laws as judged by the legal system of the local government within some longer deadline.
To be clear the following do not contribute any additional utility:
At the recent London meet-up someone (I'm afraid I can't remember who) suggested that one might be able to solve the Friendly AI problem by building an AI whose concerns are limited to some small geographical area, and which doesn't give two hoots about what happens outside that area. Cipergoth pointed out that this would probably result in the AI converting the rest of the universe into a factory to make its small area more awesome. In the process, he mentioned that you can make a "fun game" out of figuring out ways in which proposed utility functions for Friendly AIs can go horribly wrong. I propose that we play.
Here's the game: reply to this post with proposed utility functions, stated as formally or, at least, as accurately as you can manage; follow-up comments explain why a super-human intelligence built with that particular utility function would do things that turn out to be hideously undesirable.
There are three reasons I suggest playing this game. In descending order of importance, they are: