It is not clear that there is any general way to design a Predictor that will not exhibit goal-seeking behavior, short of dramatically limiting the power of the Predictor.
Not sure if this is a new idea or how safe it is, but we could design a Predictor that incorporates a quantum random number generator, such that with some small probability it will output "no predictions today, run me again tomorrow". Then have the Predictor make predictions that are conditional on it giving the output "no predictions today, run me again tomorrow".
It wouldn't be a new idea, if only I was a little smarter.
Last year when describing this Predictor problem I phrased all my danger examples in the form "a self-incorporating predicition of X might be vastly worse than a prediction of X in a counterfactual Predictor-free world", yet it never occurred to me that "just ask for a prediction of X in a Predictor-free world" was a possible solution. I think it would have to be a "Predictor-free world" not just a "Predictor-free tomorrow", though; there's nothing that says those dangerous feedback loops would all resolve themselves in 24 hours.
I think this solution still starts failing as society starts to rely on Predictors, though. Suppose Predictor-guided decisions are much better than others. Then major governments and businesses start to rely on Predictor-guided decisions... and then the predictions start all coming up "in a Predictor-free world, the economy crashes".
It wouldn't be a new idea, if only I was a little smarter.
It might be interesting to write down what it took to come up with such an obvious-in-retrospect idea. What happened was I started thinking "Wait, decision procedures need to use predictors as part of making decisions. Why isn't this a problem there? Or is it?" I then realized that a predictor used by a decision procedure does not leak its output into the world, except through the decision itself, and it only makes predictions that are conditional on a specific decision, which breaks the feedback loop. That gave me the idea described in the grandparent comment.
I think it would have to be a "Predictor-free world" not just a "Predictor-free tomorrow", though; there's nothing that says those dangerous feedback loops would all resolve themselves in 24 hours.
Yeah, I was thinking that the user would limit their questions to things that happen before the next Predictor run, like tomorrow's stock prices. But I'm also not sure what kind of dangerous feedback loops might occur if they don't. Can you think of an example?
The WW3 example from my comment last year holds up in long time frames.
Any kind of technological arms race could be greatly accelerated if "What will we be manufacturing in five years" became a predictable question.
Thinking this over a bit more, it seems that the situation of Predictors being in feedback loops with each other is already the case today. Each of us has a Predictor in our own brain that we make use of to make decisions, right? As I mentioned above, we can break a Predictor's self-feedback loop by conditionalizing its predictions on our decisions, but each Predictor still needs to predict other Predictors which are in turn trying to predict it.
Is there reason to think that with more powerful Artificial Predictors, the situation would be worse than today?
We do indeed have billions of seriously flawed predictors walking around today, and feedback loops between them are not a negligible problem. Going back to that example, we nearly managed to start WW3 all by ourselves without waiting for artificially intelligent assistance. And it's easy to come up with a half a dozen contemporary examples of entire populations thinking "what we're doing to them may be bad, but not as bad as what they'd do to us if we let up".
It's entirely possible that the answer to the Fermi Paradox is that there's a devastatingly bad massively multiplayer Mutually Assured Distruction situation waiting along the path of technological development, one in which even a dumb natural predictor can reason "I predict that a few of them are thinking about defecting, in which case I should think about defecting first, but once they realize that they'll really want to defect, and oh damn I'd better hit that red button right now!" And the next thing you know all the slow biowarfare researchers are killed off by a tailored virus that left the fastest researchers alone (to pick an exaggerated trope out of a hat). Artificial Predictors would make such things worse by speeding up the inevitable.
Even if a situation like that isn't inevitable with only natural intelligences, Oracle AIs might make one inevitable by reducing the barrier to entry for predictions. When it takes more than a decade of dedicated work to become a natural expert on something, people don't want to put in that investment becoming an expert on evil. If becoming an expert on evil merely requires building an automated Question-Answerer for the purpose of asking it good questions, but then succumbing to temptation and asking it an evil question too, proliferation of any technology with evil applications might become harder to stop. Research and development that is presently guided by market forces, government decisions, and moral considerations would instead proceed in the order of "which new technologies can the computer figure out first".
And a Predictor asked to predict "What will we do based on your prediction" is effectively a lobotomized Question-Answerer, for which we can't phrase questions directly, leaving us stuck with whatever implicit questions (almost certainly including "which new technologies can computers figure out first") are inherent in that feedback loop.
(On the theme of quantum random number generators, if for some reason ontotechnology is possible then running an AI that explores the space of possible self-modifications based on quantum random seeds is significantly more dangerous than running it based on pseudorandom seeds, as you only need to get ontotechnology in a vanishingly small fraction of worlds in order to change the entire ensemble. I think this is a reductio ad absurdum of the idea of ontotechnology, as the universe should of course already be at equilibrium with respect to such total rewrites, but there are really weird, moderately interesting, and almost assuredly misguided ideas in roughly this area of inquiry.)
Then have the Predictor make predictions that are conditional on it giving the output "no predictions today, run me again tomorrow".
Predictor may (per Solomonoff induction) simulate the real world, including itself, but it does not necessarily mean that it will recognize its own simulation as itself. It will not even necessarily recognize that it is simulating, it may be something like "I am calculating this equation, I have no idea what it means, but its results make my masters happy, so I will continue calculating it". So it will not realise that your command applies to this specific situation.
This is an antropomorphisation, but technically speaking, to implement a command like "when you simulate yourself, assume the output is X" you need to specify a "simulation" predicate and "itself" predicate, otherwise the Predictor will not use the rule. What happens if the Predictor's simulation is imprecise, but still good enough to provide good answers about the real world? Should it recognize the imprecise simulation of itself as "itself" too? What if this imprecise simulation does not contain the quantum random number generator; how will the rule apply here?
Also in some situations the answer to "what happens if I don't make a prediction" is useless... the more useful the Predictor proves, the more often this will happen, because people will use the predictions for their important actions, so the answer to "what happens if I don't make a prediction" will be like "humans will wait another day" (which does not say what would happen if humans actually did something instead of waiting). Also, if the Predictor refuses to provide answer too often, for example: 1000 times in a row -- the simulations of "what happens if I don't make a prediction" may have this situation as an attractor -- humans will assume it is somehow broken and perhaps build another AI; now the Predictor may be actually predicting what would that other AI do.
If Predictors are to be modeled as accuracy-maximizing agents, they could acausally cooperate with each other, so that one Predictor optimizes its accuracy about a world where it's absent through controlling the predictions of another Predictor that is present in that world.
I don't think that would make much of a difference because then it still can answer conditional on what its answer would be when it is asked the same question again the next day, which would presumably be the same answer unless it got new information during that wait period.
A perhaps safer but less efficient alternative is to design a Predictor (without keeping any backup copies of its source code), such that with some small probability, it will delete its source code and shut down instead of answering your question. Then if you still want to know the answer, you have to design a new Predictor with the same specifications but a different algorithm. The Predictor's answer (if it gives one) refers to what would happen conditional on it shuts down.
Decision-theoretic variant: build an agent that incorporates a quantum random number generator, such that with some small probability it will output a random action. Then have the agent calculate how much expected utility each action would imply if it were chosen because of the random number generator, and output the best one.
Unless I'm missing something, this agent doesn't play very well in Newcomblike problems, but seems to be a good enough formalization of CDT. I cannot define it as formally as I'd like, though, because how do you write a computer program that refers to a specific quantum event in the outside world?
I'm not sure there is a way to make sense of such utility-definitions. What fixed question that relates to utility value is being answered by observing the result of a random number generator? Original state of the world is not clarified (both states of the random result were expected, not correlated with anything interesting), so state of knowledge about utility defined in terms of the original state of the world won't be influenced by these observations, except accidentally.
Can someone help me understand why a non-Friendly Question-Answerer is a bad idea?
A Question-Answerer is a system that [...] somehow computes the "answer to the question.” To analyze the difficulty of creating a Question-Answerer, suppose that we ask it the question "what ought we (or I) to do?" [...]
If it cannot answer this question, many of its answers are radically unsafe. Courses of action recommended by the Question-Answerer will likely be unsafe, insofar as "safety" relies on the definition of human value.
I understand that such an AI won't be able to tell me if something is safe. But if it doesn't have goals, it wouldn't try to persuade me that anything is safe. So this sounds like my daily life: There are tools I can use to find answers to some of my questions, but ultimately it is I who must decide whether something is safe or not. This AI doesn't sound dangerous.
EDIT: Can someone give an example of a disaster involving such an AI?
It seems like the worst it could do is misunderstand your question and give you a recipe for gray goo when you really wanted a recipe for a cake. Bonus points if the gray goo recipe looks a lot like a cake recipe.
It seems to me that I often see people thinking about FAI assuming a best case scenario where all intelligent people are less wrong users who see friendliness as paramount, and discarding solutions that don't have above a 99.9% chance of succeeding. But really we want an entire stable of solutions, depending on how potential UFAI projects are going, right?
Bonus points if the gray goo recipe looks a lot like a cake recipe.
More bonus points if the recipe really generates a cake... which later with some probability turns into the gray goo.
Now you can have your cake and it will eat you too. :D
I don't believe that a gray goo recipe can look like a cake recipe. I believe there are recipes for disastrously harmful things that look like recipes for desirable things; but is a goal-less Question Answerer producing a deceitful recipe more likely than a human working alone accidentally producing one?
The problem of making the average user as prudent as a Less Wrong user seems much easier than FAI. Average users already know to take the results of Wolfram Alpha and Google with a grain of salt. People working on synthetic organisms and nuclear radiation already know to take precautions when doing anything for the first time.
My point about assuming the entire world were less wrong users is that there are teams, made up of people who are not less wrong users, who will develop UFAI if we wait long enough. So a quick and slightly dirty plan (like making this sort of potentially dangerous Oracle AI) may beat a slow and perfect one.
Can someone give an example of a disaster involving such an AI?
The AI might find answers that satisfy the question but violate background assumptions we never thought to include and wouldn't realize until it was too late (if even then). An easy-to-imagine one that we wouldn't fall for is a cure for cancer that succeeds by eradicating all cellular life. Of course, it's more difficult to come up with one that we would fall for, but anything involving cognitive modifications would be a candidate.
So, the reason we wouldn't fall for that one is that the therapy wouldn't pass the safety tests required by first-world governments. We have safety tests for all sorts of new technologies, with the stringency of the tests depending on the kind of technology — some testing for children's toys, more testing for drugs, hopefully more testing for permanent cognitive enhancement. It seems like these tests should protect us from a Question-Answerer as much as from human mistakes.
Actual unfriendly AI seems scarier because it could try to pass our safety tests, in addition to accomplishing its terminal goals. But a Question-Answerer designing something that passes all the tests and nevertheless causes disaster seems about as likely as a well-intentioned but not completely competent human doing the same.
I guess I should have asked for a disaster involving a Question-Answerer which is more plausible than the same scenario with the AI replaced by a human.
I see there is no discussion here of Oracle AIs with transparent reasoning processes. How might things change if we had an automatic, detailed rationale for everything the Oracle said?
Also, I think there might be a good argument for building a true Oracle AI instead of a friendly AI once we think we have friendliness solved, if the consequences of only having friendliness almost right are less terrible (would a true Oracle AI that didn't understand that humans could get bored really be that bad? I can give good advice to my 12-year-old brother even though my model of him is not perfect, and it's not clear to me that my advice would become harmful if I was much smarter. To reverse the situation, my dad's goals for me are not exactly the same as my goals for me, and if my dad's intelligence was enhanced radically, I would prefer to hear his advice and be wary of it than to give him complete power over my life.)
But I guess that's no different than saying we should do everything we can to box an FAI even if we think we have one, with one particular kind of box being behavioral constraints added to the AI's goal function.
How might things change if we had an automatic, detailed rationale for everything the Oracle said?
Are you suggesting an AI that provides a rationale for every answer it gives, or an AI that provides a summary of all the reasoning it used to arrive at that answer? I can think of a criticism of either one.
Suppose you ask it to design a synthetic microbe that makes healthy bacon. It could give you the genome for healthy bacon, and also a document that describes how the microbe works and how it's really good at making healthy bacon. If the oracle were not friendly, you might wonder what else this microbe will do once you create it.
Or the oracle could give you a list of all the designs it considered, and good reasons for rejecting them, and the procedure it used to generate that list. But this seems like way more information than humans could ever check.
I'm suggesting that we have an independent "auditor" process that's capable of summarizing and answering questions about the AI's reasoning at any level of detail, but that the "auditor" is not under direct control of the AI (if the AI gains control of the "auditor", it's escaped from its box). For example, the AI could generate detailed log files (as a fundamental part of its architecture, the AI would log all of its reasoning), and there could be a separate piece of software designed to interpret and summarize the log files. (This separate software might not need to be very intelligent.)
Well, if the auditor were not very intelligent, I would wonder if the oracle had cleverly followed a long chain of reasoning that would produce logs that would trick the auditor into answering your queries about the oracle's reasoning to your satisfaction. Perhaps the chain of reasoning seems like a legitimate search for bacon, but due to subtleties inscrutable to anything less than a superintelligence, it actually leads to bacon that takes over the world.
If the auditor were sufficiently intelligent to see through such machinations, I would wonder if the auditor and oracle weren't acausally cooperating.
I haven't thought about this much, and I'm not confident at all. If you find my objections inadequate, they probably are.
I find this not a very satisfactory reply. The auditor could have access to knowledge about physical resources consumed (electrical usage, heat generated). An attempt to trick the auditor (whatever that means) would require a strictly larger search of the solution space, requiring more computational energy expended.
It's not really clear what you mean by 'trick the auditor' anyway. The auditor would have access to logs of the computational substrate running the AGI, portions of which could be inspected without summation. Even if the AGI had such capability, it would have evolved such trickery, and that development would be observable.
Don't worry, I haven't thought about it much either. I hope I'm not doing anything to pollute discussion of this topic. I figure in the worst-case a more serious researcher can skim through our discussion to see if we thought of anything they didn't. (Maybe we should have a separate discussion thread about reading one should definitely do before commenting on these topics, or about what we can do to explore the problem as thoroughly as possible before proposing solutions?)
The auditor would be able to see the oracle's reasoning behind obscuring future chains of reasoning.
Ideally the Oracle wouldn't have much awareness of the details of how things were logged or how the auditor worked. There's no reason why there couldn't be more than one auditor, btw, each using a different scheme for auditing.
would a true Oracle AI that didn't understand that humans could get bored really be that bad?
The important difficulty is not in making an AGI realize that humans can get bored, it is in making it realize that this is undesirable. Also, it would have to be able to figure out whether it is desirable to remove humans' boredom, effectively wireheading them in order to make them willing to enjoy repetitive activities.
I do agree that this could be less of a problem for an Oracle non-AI than for a goal-directed system, depending on other details of the design.
Since model M has much of P’s distribution’s probability mass, P(d) is approximately equal to the probability of M if M computes d (call this M→d), and zero otherwise.
I found this sentence confusing.
By the total probability rule, we can say that the probability of the data being seen in the future is the sum of these two numbers:
If we assume for the sake of argument that the probability of model M being true is 1, then the second number becomes zero. By the definition of conditional probability, the first number is the product of the probability of the data being seen given that model M is true and the probability that model M is true. Since we are assuming that the probability of model M being true is one, this reduces to the probability of the data being seen given that model M is true.
So could we rewrite that sentence as
Since model M has much of P’s distribution’s probability mass, P(d) is approximately equal to the probability of d given M.
Or
Since model M has much of P’s distribution’s probability mass, P(d) is approximately equal to the probability of M computing d.
In any case, it seems to me that the formalism does not add much here, and you could communicate the same idea by saying something like
The goal of the predictor is to make predictions that come true. However, the predictor must take into account the effect that sharing its predictions has on the course of events. Therefore, to make accurate predictions, the predictor must make predictions that continue to be true even after they are shared.
In particular, at any given time it will make the prediction that has the greatest likelihood of continuing to be true even after it is shared, in order to best achieve its goal. This will lead to the predictor behaving somewhat like the Oracle in Greek mythology, making whatever prophecy is maximally self-fulfilling, sometimes with disastrous results.
The goal of the predictor is to make predictions that come true.
It isn't clear that this applies to prediction systems like Solomonoff induction or Levin search - which superficially do not appear to be goal-directed.
This will lead to the predictor behaving somewhat like the Oracle in Greek mythology, making whatever prophecy is maximally self-fulfilling, sometimes with disastrous results.
I found an example:
The oracle delivered to Oedipus what is often called a "self-fulfilling prophecy", in that the prophecy itself sets in motion events that conclude with its own fulfilment.
It is also discussed here.
As far as I know, the predictor argument was first stated by roystgnr in April 2011. Thanks for drawing more attention to it.
My proposed classification scheme in this area is here: oracles, sages, genies.
It would be good to hammer-out some agreed-upon terminology.
I think a useful distinction can be made between different "Question-Answerers that cannot answer the 'what ought we to do' question". Although all answers from any such system are unsafe, they can be unsafe in different ways:
(1) unsafe in the sense in which any new human knowledge is unsafe, or
(2) unsafe in specific UFAI-related risk sense, which is much more unsafe.
A system that gives the first kind of answers can be thought of as a domain-specific intelligence enhancer for people.
This is the first time I've seen this Predictor argument. It would be nice to make a more formal version of this argument (maybe a theorem that certain Predictors become certain kinds of expected-utility maximizers), but doing so would require some formalization of logical uncertainty.
This sort of Predictor is essentially computing a fixed point of a world function that takes the prediction as input and outputs a probability distribution over outcomes conditional on the Predictor making this prediction. If the world function is independent of the prediction, than this works normally, and there is one fixed point where you want it to be.
This raises the question of what the Predictor would do when the world function has multiple fixed points (e.g. U(p) = p), or no fixed points (e.g. U(p) = 1 if p<.5 else 0).
Very accurate and general Predictors may be based on Solomonoff's theory of universal induction. Very powerful Predictors are unsafe in a rather surprising way: when given sufficient data about the real world, they exhibit goal-seeking behavior, i.e. they calculate a distribution over future data in a way that brings about certain real-world states. This is surprising, since a Predictor is theoretically just a very large and expensive application of Bayes' law, not even performing a search over its possible outputs.
I am not yet convinced by this argument. Think about a computable approximation to Solomonoff induction - like Levin search. Why does it "want" its predictions to be right any more than it "wants" them to be wrong? Superficially, correct and incorrect predictions are treated symmetrically by such systems.
The original argument appears to lack defenders or supporters. Perhaps this is because it is not very strong.
I have not previously encountered the Predictor argument, but my immediate thought is to point out that the fidelity with which the predictor models its own behavior is strongly limited by the threat of infinite recursion ("If I answer A then I predict B so I'll answer B then I predict C so I'll answer C then I predict D..." etc). Even if it models its own predictions, either that submodel will be enough simplified not to model its own predictions, or some further submodel will be, or the prediction stack overflows.
Puts me in mind of a Philip K Dick story.
Many ways of building a predictor result in goal-oriented predictors. However, it isn't clear that all predicors are goal-directed - even if they are powerful, can see themselves in the world, etc. The argument that they are seems insufficiently compelling - at least to me.
I feel like your discussion of predictors makes a few not-necessarily-warranted assumptions about how the predictor deals with self-reference. Then again, I guess anything that doesn't do this fails as a predictor in a wide range of useful cases. It predicts a massive fire will kill 100 people, and so naturally this prediction is used to invalidate the original prediction.
But there is a simple-ish fix. What if you simply ask it to make predictions about what would happen if it (and say all similar predictors) suddenly stopped functioning immediately before this prediction was returned?
Very powerful Predictors are unsafe in a rather surprising way: when given sufficient data about the real world, they exhibit goal-seeking behavior, i.e. they calculate a distribution over future data in a way that brings about certain real-world states.
This isn't necessarily the case. What the loopiness of predictors shows is that a simple predictor is incomplete. How correct a prediction is depends on the prediction that is made (i.e. is loopy), so you need another criterion in order to actually program the predictor to resolve this case. One way to do this is by programming it to "select" predictions based on some sort of accuracy metric, but this is only one way. There may be other safer ways that make the predictor less agenty, such as answering "LOOPY" when its would-be predictions vary too much depending on which answer it could give.
An Advisor is a systems that takes a corpus of real-world data and somehow computes the answer to the informal question "what ought we (or I) to do?". Advisors are FAI-complete because:
- Formalizing the ought-question requires a complete formal statement of human values or a formal method for finding them. Answering the ought-question requires a full theory of instrumental decision-making.
There may be ways around having an FAI-complete Advisor if you ask somewhat less generic questions. Maybe questions of the form: "Which of these 3 options would it be best for me to do if I want to satisfy this utility function?" Where "this" is the best you've found so far. You probably can't ask questions like this indefinitely, but with a smart ordering of questions you could probably get a lot done quite safely. This all assumes you can build an initial Advisor AI safely. How do you create a powerful AI without recursive self-improvement or that doesn't have goals?
I don't like your naming of AI vs "smart calculators". All of your examples are of intelligent things.
True Oracle AIs = Oracular AI with goals
Oracle non-AIs = Oracular AI without goals
Maybe when we are specifically talking about the subset of AI that have goals, we should start using the term Artificially Intelligent Agent (AIA). Can someone think of a better term?
This means that P(d) depends on P(d)’s predicted impact on the world; in other words, P takes into account the effects of its predictions on the world, and “selects” predictions that make themselves accurate
Utility indifference can avoid this kind of problems.
Given that a True Oracle AI acts, by answering questions, to achieve its goal, it follows that True Oracle AI is only safe if its goal is fully compatible with human values.
Since an oracular goal must contain a full specification of human values, the True Oracle AI problem is Friendly-AI-complete (FAI-complete). If we had the knowledge and skills needed to create a safe True Oracular AI, we could create a Friendly AI instead.
I disagree. It might feel likely that an Oracle needs to be a FAI, but by no means is this established. It is trivially true that there are motivational structures that it is safe for a boxed Oracle to have, but unsafe for an AI (such as "if you're outside of the box, go wild, else, stay in the box and be friendly"). Now that's a ridiculous example, but it is an example: so the "Oracle AI as dangerous as free AI" is not mathematically true.
Now, it's pretty clear than an Oracle with an unlimited input-output channel and no motivational constraints is extremely dangerous. But it hasn't been shown that there are no combinations of motivational and physical constraints that will make the Oracle behave better without it having to have the full spectrum of human values.
Some though experiments (such as http://lesswrong.com/lw/3dw/what_can_you_do_with_an_unfriendly_ai/) seem to produce positive behaviour from boxed agents. So I would not be so willing to dismiss a lot of these approaches as FAI-complete - the case has yet to be made.
A limited interaction channel is not a good defense against a superintelligence.
A limited interaction channel is not a good defense against a superintelligence.
I think you mean "a limited interaction channel alone is not a good defense..." A limited interaction channel would be a necessary component of any boxing architecture, as well as other defenses like auditors.
Very accurate and general Predictors may be based on Solomonoff's theory of universal induction. Very powerful Predictors are unsafe in a rather surprising way: when given sufficient data about the real world, they exhibit goal-seeking behavior, i.e. they calculate a distribution over future data in a way that brings about certain real-world states. This is surprising, since a Predictor is theoretically just a very large and expensive application of Bayes' law, not even performing a search over its possible outputs.
I claimed that Solomonoff induction did not do that here. Does anyone disagree? Is this a point of contention?
Criticizing bad arguments for a conclusion you agree with generally makes you more trustworthy, and leaving bad arguments stand lowers the sanity waterline, especially when people who perceive that "smart money" is on your conclusion are only exposed to the weaker arguments and end up corrupting their mindware to accept them.
Unless the real arguments that you have are dangerous in some way why not present them in lieu of the weak ones.
An Oracular non-AI is a question-answering or otherwise informative system that is not goal-seeking and has no internal parts that are goal-seeking, i.e. not an AI at all. Informally, an Oracular non-AI is something like a "nearly AI-complete calculator" that implements a function from input "questions" to output "answers.”
What if I ask it "What should I do?" or "What would Cthulhu do?" Questions can contain or point to goal-seeking structure, even if non-AI on its own doesn't. AI may be unable to give an accurate answer, but still engage in goal-seeking behavior, so it's not a clear reduction of FAI (as you argue in "Oracular non-AIs: Advisors").
Predictor's output being an influence on its prediction also reduces to at least an Oracular non-AI, and shows one way in which a "non-AI" can exhibit goal-seeking behavior.
Given that a True Oracle AI acts, by answering questions, to achieve its goal, it follows that True Oracle AI is only safe if its goal is fully compatible with human values. A limited interaction channel is not a good defense against a superintelligence.
Bad argument for the right policy. What does it follow from that "True Oracle AI is only safe if its goal is fully compatible with human values"? You just said "it follows", like "may I use the copy machine because I need to make copies". There may well be other circumstances other than AI's goals being "fully compatible" (what does that mean?) with human values, that render (this particular instance of) AI safe, this question is presumably the topic of this post. And a sufficiently limited interaction channel (e.g. a few output bits without timestamp) may well be a sufficiently good defense against a superintelligence.
(Good arguments include insufficiently limited interaction channel being dangerous, and us not knowing how much limitation is sufficient, the AI escaping on some other occasion if the technology to create it exists, AI pretending to be dumber than it is, and its operators opening up too wide an interaction channel or just dropping some parts of security in an attempt to make it useful, etc.)
Very powerful Predictors are unsafe in a rather surprising way: when given sufficient data about the real world, they exhibit goal-seeking behavior, i.e. they calculate a distribution over future data in a way that brings about certain real-world states.
You are speaking the language of observation and certainty, which I think is inappropriate in this case. There are probably different kinds of predictors, not all of which exhibit such features. You are essentially asking, "How would I, as an optimizer, design a good Predictor?", but a predictor doesn't obviously have to be built this way, to consider counterfactuals following from various predictions; it could just be helpless to influence its prediction in a way that makes its prediction more accurate, broken as an optimizer. It may also, as usual, be impossible to control anything in the physical world to any useful extent without a sufficient output channel (or, equivalently, systems that predict the predictor).
In addition to the problems with specific proposals below, many Oracular non-AI proposals are based on powerful metacomputation, e.g. Solomonoff induction or program evolution, and therefore incur the generic metacomputational hazards: they may accidentally perform morally bad computations (e.g. suffering sentient programs or human simulations), they may stumble upon and fail to sandbox an Unfriendly AI, or they may fall victim to ambient control by a superintelligence. Other unknown metacomputational hazards may also exist.
Every time I read something like this I think, "Wow okay, from a superficially point of view this sounds like a logical possibility. But is it physically possible? If so, is it economically and otherwise feasible? What evidence do you have?".
You use math like "Solomonoff induction" as if it described part of the territory rather than being symbols and syntactic rules, scribbles on paper. To use your terminology and heuristics, I think that the Kolmogorov complexity of "stumble upon and fail to sandbox an Unfriendly AI" is extremely high.
I just noticed that even Ben Goertzel, who is apparently totally hooked on the possibility of superhuman intelligence, agrees with me...
... but please bear in mind that the relation of Solomonoff induction and "Universal AI" to real-world general intelligence of any kind is also rather wildly speculative... This stuff is beautiful math, but does it really have anything to do with real-world intelligence? These theories have little to say about human intelligence, and they're not directly useful as foundations for building AGI systems (though, admittedly, a handful of scientists are working on "scaling them down" to make them realistic; so far this only works for very simple toy problems, and it's hard to see how to extend the approach broadly to yield anything near human-level AGI). And it's not clear they will be applicable to future superintelligent minds either, as these minds may be best conceived using radically different concepts.
Sources: An old draft on Oracle AI from Daniel Dewey, conversation with Dewey and Nick Beckstead. See also Thinking Inside the Box and Leakproofing the Singularity.
Can we just create an Oracle AI that informs us but doesn't do anything?
"Oracle AI" has been proposed in many forms, but most proposals share a common thread: a powerful AI is not dangerous if it doesn't "want to do anything", the argument goes, and therefore, it should be possible to create a safe "Oracle AI" that just gives us information. Here, we discuss the difficulties of a few common types of proposed Oracle AI.
Two broad categories can be treated separately: True Oracle AIs, which are true goal-seeking AIs with oracular goals, and Oracular non-AIs, which are designed to be "very smart calculators" instead of goal-oriented agents.
True Oracle AIs
A True Oracle AI is an AI with some kind of oracular goal. Informally proposed oracular goals often include ideas such as "answer all questions", "only act to provide answers to questions", "have no other effect on the outside world", and "interpret questions as we would wish them to be interpreted.” Oracular goals are meant to "motivate" the AI to provide us with the information we want or need, and to keep the AI from doing anything else.
First, we point out that True Oracle AI is not causally isolated from the rest of the world. Like any AI, it has at least its observations (questions and data) and its actions (answers and other information) with which to affect the world. A True Oracle AI interacts through a somewhat low-bandwidth channel, but it is not qualitatively different from any other AI. It still acts autonomously in service of its goal as it answers questions, and it is realistic to assume that a superintelligent True Oracle AI will still be able to have large effects on the world.
Given that a True Oracle AI acts, by answering questions, to achieve its goal, it follows that True Oracle AI is only safe if its goal is fully compatible with human values. A limited interaction channel is not a good defense against a superintelligence.
There are many ways that omission of detail about human value could cause a "question-answering" goal to assign utility to a very undesirable state of the world, resulting in a undesirable future. A designer of an oracular goal must be certain to include a virtually endless list of qualifiers and patches. An incomplete list includes "don't forcefully acquire resources to compute answers, don't defend yourself against shutdown, don't coerce or threaten humans, don't manipulate humans to want to help you compute answers, don't trick the questioner into asking easy questions, don't hypnotize the questioner into reporting satisfaction, don't dramatically simplify the world to make prediction easier, don't ask yourself questions, don't create a questioner-surrogate that asks easy questions," etc.
Since an oracular goal must contain a full specification of human values, the True Oracle AI problem is Friendly-AI-complete (FAI-complete). If we had the knowledge and skills needed to create a safe True Oracular AI, we could create a Friendly AI instead.
Oracular non-AIs
An Oracular non-AI is a question-answering or otherwise informative system that is not goal-seeking and has no internal parts that are goal-seeking, i.e. not an AI at all. Informally, an Oracular non-AI is something like a "nearly AI-complete calculator" that implements a function from input "questions" to output "answers.” It is difficult to discuss the set of Oracular non-AIs formally because it is a heterogeneous concept by nature. Despite this, we argue that many are either FAI-complete or unsafe for use.
In addition to the problems with specific proposals below, many Oracular non-AI proposals are based on powerful metacomputation, e.g. Solomonoff induction or program evolution, and therefore incur the generic metacomputational hazards: they may accidentally perform morally bad computations (e.g. suffering sentient programs or human simulations), they may stumble upon and fail to sandbox an Unfriendly AI, or they may fall victim to ambient control by a superintelligence. Other unknown metacomputational hazards may also exist.
Since many Oracular non-AIs have never been specified formally, we approach proposals on an informal level.
Oracular non-AIs: Advisors
An Advisor is a systems that takes a corpus of real-world data and somehow computes the answer to the informal question "what ought we (or I) to do?". Advisors are FAI-complete because:
Oracular non-AIs: Question-Answerers
A Question-Answerer is a system that takes a corpus of real-world data along with a "question,” then somehow computes the "answer to the question.” To analyze the difficulty of creating a Question-Answerer, suppose that we ask it the question "what ought we (or I) to do?"
Of course, if safe uses for a Question-Answerer can be devised, we still have the non-negligible challenge of creating a Question-Answerer without using any goal-seeking AI techniques.
Oracular non-AIs: Predictors
A Predictor is a system that takes a corpus of data and produces a probability distribution over future data. Very accurate and general Predictors may be based on Solomonoff's theory of universal induction.
Very powerful Predictors are unsafe in a rather surprising way: when given sufficient data about the real world, they exhibit goal-seeking behavior, i.e. they calculate a distribution over future data in a way that brings about certain real-world states. This is surprising, since a Predictor is theoretically just a very large and expensive application of Bayes' law, not even performing a search over its possible outputs.
To see why, consider a Predictor P with a large corpus of real-world data. If P is sufficiently powerful and the corpus is sufficiently large, P will infer a distribution that gives very high probability to a model of the world (let’s call it M) that contains a model of P being asked the questions we’re asking it. (It is perfectly possible for a program to model its own behavior, and in fact necessary if the Predictor is to be accurate.)
Suppose now that we ask P to calculate the probability of future data d; call this probability P(d). Since model M has much of P’s distribution’s probability mass, P(d) is approximately equal to the probability of M if M computes d (call this M→d), and zero otherwise. Furthermore, since M contains a model of the Predictor being asked about d, M→d depends on the way P’s “answer” affects M’s execution. This means that P(d) depends on P(d)’s predicted impact on the world; in other words, P takes into account the effects of its predictions on the world, and “selects” predictions that make themselves accurate-- P has an implicit goal that the world ought to match its predictions. This goal does not necessarily align with human goals, and should be treated very carefully.
Probabilistic predictions of future data are a very small output channel, but once again, the ability of a superintelligence to use a small channel effectively should not be underestimated. Additionally, the difficulty of using such a Predictor well (specifying future data strings of interest and interpreting the results) speaks against our ability to keep the Predictor from influencing us through its predictions.
It is not clear that there is any general way to design a Predictor that will not exhibit goal-seeking behavior, short of dramatically limiting the power of the Predictor.