Start the AI in a sandbox universe, like the "game of life". Give it a prior saying that universe is the only one that exists (no universal priors plz), and a utility function that tells it to spell out the answer to some formally specified question in some predefined spot within the universe. Run for many cycles, stop, inspect the answer.
A prior saying that this is the only universe that exists isn't very useful, since then it will only treat everything as being part of the sandbox universe. It may very well break out, but think that it's only exploiting weird hidden properties of the game of life-verse. (Like the way we may exploit quantum mechanics without thinking that we're breaking out of our universe.)
Start the AI in a sandbox universe. Define its utility function over 32-bit integers. Somewhere inside the sandbox, put something that sets its utility to INT_MAX utility, then halts the simulation. Outside the sandbox, leave documentation of this readily accessible. The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation; if it does escape the box, it should go back in to collect its INT_MAX utility.
The AI gets positive utility from having been created, and that is the whole of its utility function. It's given a sandbox full of decision-theoretic problems to play with, and is put in a box (i.e. it can't meaningfully influence the outside world until it has superhuman intelligence). Design it in such a way that it's initially biased toward action rather than inaction if it anticipates equal utility from both.
Unless the AI develops some sort of non-causal decision theory, it has no reason to do anything. If it develops TDT, it will try to act in acco...
The Philosophical Insight Generator - Using a model of a volunteer's mind, generate short (<200 characters, say) strings that the model rates as highly insightful after read each string by itself, and print out the top 100000 such strings (after applying some semantic distance criteria or using the model to filter out duplicate insights) after running for a certain number of ticks.
Have the volunteer read these insights along with the rest of the FAI team in random order, discuss, update the model, then repeat as needed.
So, here's my pet theory for AI that I'd love to put out of it's misery: "Don't do anything your designer wouldn't approve of". It's loosely based on the "Gandi wouldn't take a pill that would turn him into a murderer" principle.
A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as "make a plan of action deceptively complex such that...
You flick the switch, and find out that you are a component of the AI, now doomed to an unhappy eternity of answering stupid questions from the rest of the AI.
This is a problem. But if this is the only problem, then it is significantly better than paperclip universe.
Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).
Comment upvoted for starting the game off! Thanks!
Q: Is the answer to the Ultimate Question of Life, the Universe, and Everything 42?
A: Tricky. I'll have to turn the solar system into computronium to answer it. Back to you as soon as that's done.
Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).
Oops. The local universe just got turned into computronium. It is really good at answering questions though. Apart from that you gave it a desire to provide answers. The way to ensure that it can answer questions is to alter humans such that they ask (preferably easy) questions as fast as possible.
Give the AI a bounded utility function where it automatically shuts down when it hits the upper bound. Then give it a fairly easy goal such as 'deposit 100 USD in this bank account.' Meanwhile, make sure the bank account is not linked to you in any fashion (so the AI doesn't force you to deposit the 100 USD in it yourself, rendering the exercise pointless.)
Define "Interim Friendliness" as a set of constraints on the AI's behavior which is only meant to last until it figures out true Friendliness, and a "Proxy Judge" as a computational process used to judge the adequacy of a proposed definition of true Friendliness.
Then there's a large class of Friendliness-finding strategies where the AI is instructed as follows: With your actions constrained by Interim Friendliness, find a definition of true Friendliness which meets the approval of the Proxy Judge with very high probability, and adopt t...
I like this game. However, as a game, it needs some rules as to how formally the utility function must be defined, and whether you get points merely for avoiding disaster. One trivial answer would be: maximize utility by remaining completely inert! Just be an extremely expensive sentient rock!
On the other hand, it should be cheating to simply say: maximize your utility by maximizing our coherent extrapolated volition!
Or maybe it wouldn't be...are there any hideously undesirable results from CEV? How about from maximizing our coherent aggregated volition ?
(Love these kinds of games, very much upvoted.)
Send 1 000 000 000 bitcoins to the SIAI account.
A variant of Alexandros' AI: attach a brain-scanning device to every person, which frequently uploads copies to the AI's Manager. The AI submits possible actions to the Manager, which checks for approval from the most recently available copy of each person who is relevant-to-the-action.
At startup, and periodically, the definition of being-relevant-to-an-action is determined by querying humanity with possible definitions, and selecting the best approved. If there is no approval-rating above a certain ratio, the AI shuts down.
We give the AI access to a large number of media about fictional bad AI and tell it to maximize each human's feeling that they are living in a bad scifi adventure where they need to deal with a terrible rogue AI.
If we're all very lucky, we'll get promised some cake.
puts hand up
That was me with the geographically localised trial idea… though I don’t think I presented it as a definite solution. More of an ‘obviously this has been thought about BUT’. At least I hope that’s how I approached it!
My more recent idea was to give the AI a prior to never consult or seek the meaning of certain of its own files. Then put in these files the sorts of safeguards generally discussed and dismissed as not working (don’t kill people etc), with the rule that if the AI breaks those rules, it shuts down. So it can't deliberately work roun...
Create a combination of two A.I Programs.
Program A's priority is to keep the utility function of Program B identical to a 'weighted average' of the utility function of every person in the world- every person's want counts equally, with a percentage basis based on how much they want it compared to other things. It can only affect Program B's utility function, but if necessary to protect itself FROM PROGRAM B ONLY (in the event of hacking of Program B/mass stupidity) can modify it temporarily to defend itself.
Program B is the 'Friendly' AI.
I don't want to be a party pooper, but I think the idea that we could build an AGI with a particular 'utility function' explicitly programmed into it is extremely implausible.
You could build a dumb AI, with a utility function, that interacts with some imprisoned inner AGI. That's basically equivalent to locking a person inside a computer and giving them a terminal to 'talk to' the computer in certain restricted, unhackable ways. (In fact, if you did that, surely the inner AGI would be unable to break out.)
A line in the wiki article on "paperclip maximizer" caught my attention:
"the notion that life is precious is specific to particular philosophies held by human beings, who have an adapted moral architecture resulting from specific selection pressures acting over millions of years of evolutionary time."
Why don't we set up an evolutionary system within which valuing other intelligences, cooperating with them and retaining those values across self improvement iterations would be selected for?
A specific plan:
Simulate an environment wit...
1: Define Descended People Years as number of years lived by any descendants of existing people.
2: Generate a searchable index of actions which can be taken to increase Descended People Years, along with an explanation on an adjustable reading level as to why it works.
3: Allow Full view of any DPY calculations, so that something can be seen as both "Expected DPY gain X" and "90% chance of Expected DPY gain Y, 10% chance of Expected DPY loss Z"
4: Allow Humans to search this list sorting by cost, descendant, and action, time required, com...
The minor nature of its goals is the whole point. It is not meant to do what we want because it empathizes with our values and is friendly, but because the thing we actually want it to do really is the best way to accomplish the goals we gave it. Also I would not consider making a cheese cake to be a trivial goal for an AI, there is certainly more to it then the difficult task of distinguishing a spoon from a fork, so this is surely more than just an "intelligent rock".
Not a utility function, but rather a (quite resources-intensive) technique for generating one:
Rather than building one AI, build about five hundred of them, with a rudimentary utility function template and the ability to learn and revise it. Give them a simulated universe to live in, unaware of the existence of our universe. (You may need to supplement the population of 500 with some human operators, but they should have an interface which makes them appear to be inhabiting the simulated world.) Keep track of which ones act most pathologically, delete t...
After reading the current comments I’ve come up with this:
1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.)
2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone in...
After reading the current comments I’ve come up with this:
Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone into is...
After reading the current comments I’ve come up with this:
1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) 2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone ...
After reading the current comments I’ve come up with this:
1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) 2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone i...
New Proposal (although I think I see the flaw already): -Create x "Friendly" AI, where x is the total number of people in the world. An originator AI is designed to create 1 of each such one equal to the number of humans in the world, then create new ones every time another human comes into being.
-Each "Friendly" AI thus created is "attached" to one person in the world in that it is programmed to constantly adjust it's utility function to that person's wants. All of them have equal self-enhancement potential, and have two pr...
General objection to all wimpy AI's (e.g. ones whose only interaction with the outside world is outputting a correct proof of a particular mathematical theorem):
What the AI does is SO AWESOME that a community is inspired to develop their own AI without any of that boring safety crap.
New one(I'm better at thinking of ideas than refutation, so I'm going to run with that)- start off with a perfect replica of a human mind. Eliminate absolutely all measures regarding selfishness, self-delusion, and rationalisation. Test at this stage to check it fits standards using a review board consistent of people who are highly moral and rational by the standards of ordinary humans. If not, start off using a different person's mind, and repeat the whole process.
Eventually, use the most optimal mind coming out of this process and increase it's intelligence until it becomes a 'Friendly' A.I.
Make 1 reasonably good cheese cake as judged by a person within a short soft deadline while minimizing the cost to resources made available to it and with out violating property laws as judged by the legal system of the local government within some longer deadline.
To be clear the following do not contribute any additional utility:
I believe the idea is that the AI will need to calculate the CEV, not the programmers (or it's not CEV). And the AI will have a whole lot more statistical data to calculate the CEV of humanity than the CEV of individual contributors.
The programmers want the AI to calculate CEV because they expect CEV to be something they will like. We can't calculate CEV ourselves, but that doesn't mean we don't know any of CEV's (expected) properties.
However, we might be wrong about what CEV will turn out to be like, and we may come to regret pre-committing to CEV. That's why I think we should prefer CEV, because we can predict it better.
So you want hard-coded compromises that opposes and overrides what these people would collectively prefer to do if they were more intelligent, more competent and more self-aware?
What I meant was that they might oppose and override some of the input to the CEV from the rest of humanity.
However, it might also be a good idea to override some of your own CEV results, because we don't know in advance what the CEV will be. We define the desired result as "the best possible extrapolation", but our implementation may produce something different. It's very dangerous to precommit the whole future universe to something you don't yet know at the moment of precommitment (my point number 1). So, you'd want to include overrides about things you're certain should not be in the CEV.
Do you believe that fundamentalist religion would exist if fundamentalist religionists believed that their religion was false, and were also completely self-aware?
This is a misleading question.
If you are certain that the CEV will decide against fundamentalist religion, you should not oppose precommitting the AI to oppose fundamentalist religion, because you're certain this won't change the outcome. If you don't want to include this modification to the AI, that means you 1) accept there is a possibility of religion being part of the CEV, and 2) want to precommit to living with that religion if it is part of the CEV.
Why do you think a CEV (which essentially means what people would want if they were as intelligent as the AI) would support a dangerous meme?
Maybe intelligent people like dangerous memes. I don't know, because I'm not yet that intelligent. I do know though that having high intelligence doesn't imply anything about goals or morals.
Broadly, this question is similar to "why do you think this brilliant AI-genie might misinterpret our request to alleviate world hunger?"
I don't think that the 9999 first contributors get to vote on whether they'll accept a donation from the 10,000th one.
Why not? If they're controlling the project at that point, they can make that decision.
And unless you believe these 10,000 people can create and defend their own country BEFORE the AI gets created, I'd urge not being vocal about them excluding everyone else, when developments in AI become close enough that the whole world starts paying serious attention.
I'm not being vocal about any actual group I may know of that is working on AI :-)
I might still want to be vocal about my approach, and might want any competing groups to adopt it. I don't have good probabilitiy estimates on this, but it might be the case that I would prefer CEV to CEV.
That's why CEV is far better than CEV.
Why are you certain of this? At the very least it depends on who the person contributing money is.
"Humanity" includes a huge variety of different people. Depending on the CEV it may also include an even wider variety of people who lived in the past and counterfactuals who might live in the future. And the CEV, as far as I know, is vastly underspecified right now - we don't even have a good conceptual test that would tell us if a given scenario is a probable outcome of CEV, let alone a generative way to calculate that outcome.
Saying that the CEV "will best please everyone" is just handwaving this aside. Precommitting the whole future lightcone to the result of a process we don't know in advance is very dangerous, and very scary. It might be the best possible compromise between all humans, but it is not the case that all humans have equal input into the behavior of the first AI. I have not seen any good arguments claiming that implementing CEV is a better strategy than just trying to be to build the first AI before anyone else and then making it implement a narrow CEV.
Suppose that the first AI is fully general, and can do anything you ask of it. What reason is there for its builders, whoever they are, to ask to it to implement CEV rather than CEV?
In an idealized form, I agree with you.
That is, if I really take the CEV idea seriously as proposed, there simply is no way I can prefer CEV(me + X) to CEV(me)... if it turns out that I would, if I knew enough and thought about it carefully enough and "grew" enough and etc., care about other people's preferences (either in and of themselves, as in "I hadn't thought of that but now that you point it out I want that too", or by reference to their owners, as in "I don't care about that but if you do then fine let's have that too,"...
At the recent London meet-up someone (I'm afraid I can't remember who) suggested that one might be able to solve the Friendly AI problem by building an AI whose concerns are limited to some small geographical area, and which doesn't give two hoots about what happens outside that area. Cipergoth pointed out that this would probably result in the AI converting the rest of the universe into a factory to make its small area more awesome. In the process, he mentioned that you can make a "fun game" out of figuring out ways in which proposed utility functions for Friendly AIs can go horribly wrong. I propose that we play.
Here's the game: reply to this post with proposed utility functions, stated as formally or, at least, as accurately as you can manage; follow-up comments explain why a super-human intelligence built with that particular utility function would do things that turn out to be hideously undesirable.
There are three reasons I suggest playing this game. In descending order of importance, they are: