Reduced impact AI: no back channels
A putative new idea for AI control; index here.
This post presents a further development of the reduced impact AI approach, bringing in some novel ideas and setups that allow us to accomplish more. It still isn't a complete approach - further development is needed, which I will do when I return to the concept - but may already allow certain types of otherwise dangerous AIs to be made safe. And this time, without needing to encase them in clouds of chaotic anti-matter!
Specifically, consider the following scenario. A comet is heading towards Earth, and it is generally agreed that a collision is suboptimal for everyone involved. Human governments have come together in peace and harmony to build a giant laser on the moon - this could be used to vaporise the approaching comet, except there isn't enough data to aim it precisely. A superintelligent AI programmed with a naive "save all humans" utility function is asked to furnish the coordinates to aim the laser. The AI is mobile and not contained in any serious way. Yet the AI furnishes the coordinates - and nothing else - and then turns itself off completely, not optimising anything else.
The rest of this post details an approach that could might make that scenario possible. It is slightly complex: I haven't found a way of making it simpler. Most of the complication comes from attempts to precisely define the needed counterfactuals. We're trying to bring rigour to inherently un-sharp ideas, so some complexity is, alas, needed. I will try to lay out the ideas with as much clarity as possible - first the ideas to constrain the AI, then ideas as to how to get some useful work out of it anyway. Classical mechanics (general relativity) will be assumed throughout. As in a previous post, the approach will be illustrated by a drawing of unsurpassable elegance; the rest of the post will aim to clarify everything in the picture:
Domesticating reduced impact AIs
About a year ago, I posted several ideas for "reduced impact AI" (what Nick Bostrom calls "domesticity"). I think the most promising approach was the third one, which I pompously titled "The information in the evidence". In this post, I'll attempt to put together a (non-realistic) example of this, to see if it's solid enough to build on. I'll be highlighting assumptions I'm making about the AI; please point out any implicit assumption that I missed, and any other weaknesses of the setup. For the moment, I'm more interested in "this doesn't work" than "this can't be done in practice" or "this can't be usefully generalised".
EDIT: It wasn't clear here, but any paperclip constructed by the reduced impact AI would be destroyed in the explosion, and the AIs would not be observed during the process. How to get useful work out of the AI will be the next step, if this model holds up.
Intuitive idea
For a reduced impact AI, we want an AI that can accomplish something, say building a paperclip, without it going out of control and optimising the universe. We want the future to be roughly the same whether or not the AI was turned on. Hence the piece of information "the AI was turned on" is not particularly important - if we didn't know, we wouldn't go far wrong in our predictions.
To enforce this we'll equip the AI with a two-piece motivation: a utility function U (causing it to build paperclips) and a penalty function R (which penalises the AI if its actions have a large future 'impact'). The challenge is to have a setup and a definition of R that implements this intuitive idea.
Reduced impact
Can we hope to define reduced impact? Doesn't any change percolate across all future time? Yes, but not all changes are the same - especially where our ability to predict them is concerned. Imagine an AI, confined to a box. While inside the box, it will build a paperclip. Then, in one hour's time, an antimatter bomb inside the box will explode, obliterating everything inside. We would intuitively say that whether the paperclip was built or not, would have very little impact afterwards. What we really mean is that we are unable to usefully use that information for future predictions. To make it even harder, surround the box by a large mass of antimatter gas, moving chaotically. In order to figure out the consequences of anything happening inside the box, we'd have to be able to unravel chaos: solve chaotic equation correctly and quantitatively, not just qualitatively. So our first assumption is that the AI can't unravel chaos.
The mathematics of reduced impact: help needed
A putative new idea for AI control; index here.
Thanks for help from Paul Christiano
If clippy, the paper-clip maximising AI, goes out of control, it would fill the universe with paper clips (or with better and better ways of counting the paper-clips it already has). If I sit down to a game with Deep Blue, then I know little about what will happen in the game, but I know it will end with me losing.
When facing a (general or narrow) superintelligent AI, the most relevant piece of information is what the AI's goals are. That's the general problem: there is no such thing as 'reduced impact' for such an AI. It doesn't matter who the next president of the United States is, if an AI wants to tile the universe with little smiley faces. But reduced impact is something we would dearly want to have - it gives us time to correct errors, perfect security systems, maybe even bootstrap our way to friendly AI from a non-friendly initial design. The most obvious path to coding reduced impact is to build a satisficer rather than a maximiser - but that proved unlikely to work.
But that ruthless maximising aspect of AIs may give us a way of quantifying 'reduced impact' - and hence including it in AI design. The central point being:
"When facing a (non-reduced impact) superintelligent AI, the AI's motivation is the most important fact we know."
Hence, conversely:
"If an AI has reduced impact, then knowing its motivation isn't particularly important. And a counterfactual world where the AI didn't exist, would not be very different from the one in which it does."
In this post, I'll be presenting some potential paths to formalising this intuition into something computable, giving us a numerical measure of impact that can be included in the AI's motivation to push it towards reduced impact. I'm putting this post up mainly to get help: does anyone know of already developed mathematical or computational tools that can be used to put these approaches on a rigorous footing?
Completeness, incompleteness, and what it all means: first versus second order logic
First order arithmetic is incomplete. Except that it's also complete. Second order arithmetic is more expressive - except when it's not - and is also incomplete and also complete, except when it means something different. Oh, and full second order-logic might not really be a logic at all. But then, first order logic has no idea what the reals and natural numbers are, especially when it tries to talk about them.
That was about the state of my confusion, and I set out to try and clear it up. Here I'll try and share an understanding of what is really going on with first and second order logic and why they differ so radically. It will be deliberately informal, so I won't be distinguishing between functions, predicates and subsets, and will be using little notation. It'll be exactly what I wish someone had told me before I started looking into the whole field.
Meaningful Models
An old man starts talking to you about addition, subtraction and multiplication, and how they interact. You assume he was talking about the integers; turns out he means the rational numbers. The integers and the rationals are both models of addition, subtraction and multiplication, in that they obey all the properties that the old man set out. But notice though he had the rationals in mind, he didn't mention them at all, he just listed the properties, and the rational numbers turned out, very non-coincidentally, to obey them.
These models are generally taken to give meaning to the abstract symbols in the axioms - to give semantics to the syntax. In this view, "for all x,y xy=yx" is a series of elegant squiggles, but once we have the model of the integers (or the rationals) in mind, we realise that this means that multiplication is commutative.
Recommended Reading for Friendly AI Research
This post enumerates texts that I consider (potentially) useful training for making progress on Friendly AI/decision theory/metaethics.
The two meanings of mathematical terms
[edit: sorry, the formatting of links and italics in this is all screwy. I've tried editing both the rich-text and the HTML and either way it looks ok while i'm editing it but the formatted terms either come out with no surrounding spaces or two surrounding spaces]
In the latest Rationality Quotes thread, CronoDAS quoted Paul Graham:
It would not be a bad definition of math to call it the study of terms that have precise meanings.
View more: Next
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)