Armstrong et al. suggest that an Oracle AI could be a solution to confine the harmful potential of artificial intelligence.
While the benefits are certainly considerable, I believe that even an oracle could lead to an existential risk, or at least to unpleasant situations.
I would like to share a proof sketch of the above:

For simplicity, let's consider an Oracle AI that has the ability to answer only yes / no to the questions posed by human operators. It is clear that here we are talking about an even weaker version than the one suggested by Armstrong, Bostrom and Sandberg. We define the Oracle as a superintelligent agent with the aim of maximizing some function . Suppose that for a number of steps, from to , the oracle collaborates with human, always responding correctly to gain their trust. At some time , the oracle knows which is the most optimized version of (what is the procedure to follow in order to obtain the desired goal).
Now a procedure is nothing but an algorithm (a set of rules) and as such can be encoded in a binary string of 0 and 1. So we have that at time , oracle knows and assuming the procedure has finite cardinality i.e. it will eventually halt, leading to the expected result. From this point on, begins a strategic game against humans to get them to perform . If the history of all the answers since was started is encoded in the string , need to include at a certain point in . Since O is a Superintelligence and since every string can be reconstructed by only answering yes or no to questions like "is the first bit 1?" given that has the complete trust of humans (and even if it falters it could return to behave correctly, adjusting the aim over and over again), eventually it can lead humans to act based on its responses and eventually to implement . Note that in all of this humans dont have the same overview and planning capacity of the Oracle and therefore they may not realize that, with their actions, they have set dangerous situations in motion.

New to LessWrong?

New Comment


11 comments, sorted by Click to highlight new comments since:

You're showing how it's technically possible , but not where the motivation comes from.

It's an attempt to formalize convergent instrumental drives for an oracle, I think. The motivation is that by seeking empowerment and generating answers which manipulate the humans in various ways (such as increasing the computing power allotted to the oracle AI, acquiring new data, making the humans ask easier questions or cut out the humans entirely to wirehead by asking at maximum possible speed the easiest possible questions etc), it then increases the probability of a correctly answered question. It's a version of the 'you ask the oracle AI to prove the Riemann Hypothesis and it turns the solar system into computronium to do so' failure mode. As that is the one and only optimization target, it has motivation to do so if it can do so. The usual ad hoc hack or patch is to try to create some sort of 'myopia' so it only cares about the current question and so has no incentive to manipulate or affect future questions.

What do you mean by "where the motivation comes from"?

This is a common problem with a lot of these hypothetical AI scenarios - WHY does the Oracle do this? How did the process of constructing this AI somehow make it want to eventually cause some negative consequence?

The negative consequences come from the oracle implementing an optimisation algorithm with objective function  which is not aligned with humans. The space of objectives  which align with humans is incredibly small among all possible objectives, and very small differences get magnified when optimised against.

I have a few objections here:

  • Even when objectives aren't aligned, that doesn't mean the outcome is literally death. No corporation I interact with us aligned with me, but in many/most cases I am still better off for being able to transact with them.

  • I think there are plenty of scenarios where "humanity continues to exist" has benefits for AI - we are a source of training data and probably lots of other useful resources, and letting us continue to exist is not a huge investment, since we are mostly self-sustaining. Maybe this isn't literally "being aligned" but I think supporting human life has instrumental benefits to AI.

  • I think the formal claim is only true inasmuch as it's also true that the space of all objectives that align with the AI's continued existence is also incredibly small. I think it's much less clear how many of the objectives that are in some way supportive of the AI also result in human extinction.

In fact, corporations are quite aligned with you. Not only because they are run by humans, who are at least roughly aligned with humanity by default, but we have legal institutions and social norms which help keep the wheels on the tracks. In fact the profit motive is a powerful alignment tool - it's hard to make a profit off of humanity if they are all dead. But who aren't corporations aligned with? Humans without money or legal protections for one (though we don't need to veer off into an economic or political discussion). But also plants, insects, most animals.  Some 60% of wild animals have died as a result of human activity over the past ~50 years alone. So, I think you've made a bit of a category error here: in the scenario where a superintelligence emerges, we are not a customer, we are wildlife

Yes, there are definitely scenarios where human existence benefits an AI. But how many of those ensure our wellbeing? It's just that there are certainly many more scenarios where they simply don't care about us enough to actively preserve us. Insects are generally quite self sustaining and good for data too, but boy they sure get in the way when we want to build our cities or plant our crops.

I feel like this metaphor doesn't strike me as accurate because humanity can engage in commerce and insects cannot.

But also humanity causes a lot of environmental degradation but we still don't actually want to bring about the wholesale destruction of the environment.

" [...] since every string can be reconstructed by only answering yes or no to questions like 'is the first bit 1?' [...]"

Why would humans ever ask this question, and (furthermore) why would we ever ask this question n number of times? It seems unlikely, and easy to prevent. Is there something I'm not understanding about this step?

I actually think this example shows a clear potential failure point of an Oracle AI. Though it is constrained, in this example, to only answer yes/no questions, a user can easily circumvent this by formatting the question with this method.

Suppose a bad actor asks the Oracle AI the following: “I want a program to help me take over the world. Is the first bit 1?” Then they can ask for the next bit and recurse until the entire program is written out. Obviously, this is contrived. But I think it shows that the apparent constraints of an Oracle add no real benefit to safety, and we’re quickly relying once again on typical alignment concerns.

Interesting thought. I’m a bit new to this site but is there a “generally accepted” class of solutions for these type of AI problems? What if humans used multiple of these Oracle AIs in isolation (they don’t know about each other) and the humans asking the questions show no reaction to the answers. The questions are all planned in advance so the AI cannot “game” humans by “influencing” the next question using its current answer. What about once the AI achieves some level of competency we reset it after each question-answer cycle so it’s competence is frozen at some useful but not nefarious level (assuming we can figure out where that line is).