Introducing Corrigibility (an FAI research subfield)
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.
On Caring
This is an essay describing some of my motivation to be an effective altruist. It is crossposted from my blog. Many of the ideas here are quite similar to others found in the sequences. I have a slightly different take, and after adjusting for the typical mind fallacy I expect that this post may contain insights that are new to many.
1
I'm not very good at feeling the size of large numbers. Once you start tossing around numbers larger than 1000 (or maybe even 100), the numbers just seem "big".
Consider Sirius, the brightest star in the night sky. If you told me that Sirius is as big as a million earths, I would feel like that's a lot of Earths. If, instead, you told me that you could fit a billion Earths inside Sirius… I would still just feel like that's a lot of Earths.
The feelings are almost identical. In context, my brain grudgingly admits that a billion is a lot larger than a million, and puts forth a token effort to feel like a billion-Earth-sized star is bigger than a million-Earth-sized star. But out of context — if I wasn't anchored at "a million" when I heard "a billion" — both these numbers just feel vaguely large.
I feel a little respect for the bigness of numbers, if you pick really really large numbers. If you say "one followed by a hundred zeroes", then this feels a lot bigger than a billion. But it certainly doesn't feel (in my gut) like it's 10 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 times bigger than a billion. Not in the way that four apples internally feels like twice as many as two apples. My brain can't even begin to wrap itself around this sort of magnitude differential.
This phenomena is related to scope insensitivity, and it's important to me because I live in a world where sometimes the things I care about are really really numerous.
For example, billions of people live in squalor, with hundreds of millions of them deprived of basic needs and/or dying from disease. And though most of them are out of my sight, I still care about them.
The loss of a human life with all is joys and all its sorrows is tragic no matter what the cause, and the tragedy is not reduced simply because I was far away, or because I did not know of it, or because I did not know how to help, or because I was not personally responsible.
Knowing this, I care about every single individual on this planet. The problem is, my brain is simply incapable of taking the amount of caring I feel for a single person and scaling it up by a billion times. I lack the internal capacity to feel that much. My care-o-meter simply doesn't go up that far.
And this is a problem.
Newcomblike problems are the norm
This is crossposted from my blog. In this post, I discuss how Newcomblike situations are common among humans in the real world. The intended audience of my blog is wider than the readerbase of LW, so the tone might seem a bit off. Nevertheless, the points made here are likely new to many.
1
Last time we looked at Newcomblike problems, which cause trouble for Causal Decision Theory (CDT), the standard decision theory used in economics, statistics, narrow AI, and many other academic fields.
These Newcomblike problems may seem like strange edge case scenarios. In the Token Trade, a deterministic agent faces a perfect copy of themself, guaranteed to take the same action as they do. In Newcomb's original problem there is a perfect predictor Ω which knows exactly what the agent will do.
Both of these examples involve some form of "mind-reading" and assume that the agent can be perfectly copied or perfectly predicted. In a chaotic universe, these scenarios may seem unrealistic and even downright crazy. What does it matter that CDT fails when there are perfect mind-readers? There aren't perfect mind-readers. Why do we care?
The reason that we care is this: Newcomblike problems are the norm. Most problems that humans face in real life are "Newcomblike".
These problems aren't limited to the domain of perfect mind-readers; rather, problems with perfect mind-readers are the domain where these problems are easiest to see. However, they arise naturally whenever an agent is in a situation where others have knowledge about its decision process via some mechanism that is not under its direct control.
An introduction to Newcomblike problems
This is crossposted from my new blog, following up on my previous post. It introduces the original "Newcomb's problem" and discusses the motivation behind twoboxing and the reasons why CDT fails. content is probably review for most LessWrongers, later posts in the sequence may be of more interest.
Last time I introduced causal decision theory (CDT) and showed how it has unsatisfactory behavior on "Newcomblike problems". Today, we'll explore Newcomblike problems in a bit more depth, starting with William Newcomb's original problem.
The Problem
Once upon a time there was a strange alien named Ω who is very very good at predicting humans. There is this one game that Ω likes to play with humans, and Ω has played it thousands of times without ever making a mistake. The game works as follows:
First, Ω observes the human for a while and collects lots of information about the human. Then, Ω makes a decision based on how Ω predicts the human will react in the upcoming game. Finally, Ω presents the human with two boxes.
The first box is blue, transparent, and contains $1000. The second box is red and opaque.
You may take either the red box alone, or both boxes,
Ω informs the human. (These are magical boxes where if you decide to take only the red one then the blue one, and the $1000 within, will disappear.)
If I predicted that you would take only the red box, then I filled it with $1,000,000. Otherwise, I left it empty. I have already made my choice,
Ω concludes, before turning around and walking away.
You may take either only the red box, or both boxes. (If you try something clever, like taking the red box while a friend takes a blue box, then the red box is filled with hornets. Lots and lots of hornets.) What do you do?
Causal decision theory is unsatisfactory
This is crossposted from my new blog. I was planning to write a short post explaining how Newcomblike problems are the norm and why any sufficiently powerful intelligence built to use causal decision theory would self-modify to stop using causal decision theory in short order. Turns out it's not such a short topic, and it's turning into a short intro to decision theory.
I've been motivating MIRI's technical agenda (decision theory and otherwise) to outsiders quite frequently recently, and I received a few comments of the form "Oh cool, I've seen lots of decision theory type stuff on LessWrong, but I hadn't understood the connection." While the intended audience of my blog is wider than the readerbase of LW (and thus, the tone might seem off and the content a bit basic), I've updated towards these posts being useful here. I also hope that some of you will correct my mistakes!
This sequence will probably run for four or five posts, during which I'll motivate the use of decision theory, the problems with the modern standard of decision theory (CDT), and some of the reasons why these problems are an FAI concern.
I'll be giving a talk on the material from this sequence at Purdue next week.
1
Choice is a crucial component of reasoning. Given a set of available actions, which action do you take? Do you go out to the movies or stay in with a book? Do you capture the bishop or fork the king? Somehow, we must reason about our options and choose the best one.
Of course, we humans don't consciously weigh all of our actions. Many of our choices are made subconsciously. (Which letter will I type next? When will I get a drink of water?) Yet even if the choices are made by subconscious heuristics, they must be made somehow.
In practice, decisions are often made on autopilot. We don't weigh every available alternative when it's time to prepare for work in the morning, we just pattern-match the situation and carry out some routine. This is a shortcut that saves time and cognitive energy. Yet, no matter how much we stick to routines, we still spend some of our time making hard choices, weighing alternatives, and predicting which available action will serve us best.
The study of how to make these sorts of decisions is known as Decision Theory. This field of research is closely intertwined with Economics, Philosophy, Mathematics, and (of course) Game Theory. It will be the subject of today's post.
Knightian uncertainty: a rejection of the MMEU rule
Recently, I found myself in a conversation with someone advocating the use of Knightian uncertainty. He (who I'm anonymizing as Sir Percy) made suggestions that are useful to most bounded reasoners, and which can be integrated into a Bayesian framework. He also claimed preferences that depend upon his Knightian uncertainty and that he's not an expected utility maximizer. Further, he claimed that Bayesian reasoning cannot capture his preferences. Specifically, Sir Percy said he maximizes minimum expected utility given his Knightian uncertainty, using what I will refer to as the "MMEU rule" to make decisions.
In my previous post, I showed that Bayesian expected utility maximizers can exhibit behavior in accordance with his preferences. Two such reasoners, Paranoid Perry and Cautious Caul, were explored. These hypothetical agents demonstrate that it is possible for Bayesians to be "ambiguity averse", e.g. to avoid certain types of uncertainty.
But Perry and Caul are unnatural agents using strange priors. Is this because we are twisting the Bayesian framework to represent behavior it is ill-suited to emulate? Or does the strangeness of Perry and Caul merely reveal a strangeness in the MMEU rule?
In this post, I'll argue the latter: maximization of minimum expected utility is not a good decision rule, for the same reason that Perry and Caul seem irrational. My rejection of the MMEU rule will follow from my rejections of Perry and Caul.
Knightian Uncertainty: Bayesian Agents and the MMEU rule
Recently, I found myself in a conversation with someone advocating the use of Knightian uncertainty. We both agreed that there's no point in singling out some of your uncertainty as "special" unless you treat it differently.
I am under the impression that most advice from advocates of Knightian uncertainty can be taken to heart in a Bayesian framework, and so I find the concept of "Knightian uncertainty" uncompelling. My friend, who I'm anonymizing as "Sir Percy", claims that he does treat Knightian uncertainty differently from normal uncertainty, and so he needs to make the distinction. Unlike an aspiring Bayesian reasoner, who attempts to maximize expected value, he maximizes the minimum expected value given his Knightian uncertainty. This is the MMEU rule motivated previously.
This surprised me: is it possible for a rational agent to refuse to maximize expected utility? My reflexive reaction was simple:
If you're a rational agent and you don't think you're maximizing expected utility, then you've misplaced your "utility" label.
Sir Percy had a response ready:
That can't be. Remember Sir Percy's coin toss. A coin has been tossed, and the event "H" is "the coin came up heads". Consider the following two bets:
- Pay 50¢ to be payed $1.10 if H
- Pay 50¢ to be payed $1.10 if ¬H
I don't know whether the coin was biased, I know only that my credence is in the interval
[0.4, 0.6]. When considering the first bet, I notice that the probability of H may be 0.4, in which case the bet is expected to lose me 6¢. When considering the second bet, I notice that the probability of H may be 0.6, in which case the bet is expected to lose me 6¢. But when considering both together, I see that I will win 10¢ with certainty. So I reject each bet if presented individually, but I will pay up to 10¢ to play the pair.As you can see, my preferences change under agglomeration of bets. It's not possible to view me as a Bayesian reasoner maximizing expected utility, because there is no credence you can assign to H such that a Bayesian reasoner shares my preferences. I can't be maximizing expected utility, no matter where you put your labels.
My rejection was vague and unformed at the time. I've since fleshed it out, and it will be presented below. But before continuing, see if you can spot my argument on your own:
My friend believes that H occured with probability in the interval [.4, .6]. He is unwilling to pay 50¢ to be payed $1.10 if H, and he is unwilling to pay 50¢ to be payed $1.10 if ¬H, but he is willing to pay up to 10¢ for the pair. There is no way to assign a credence to the event H such that these actions are traditionally consistent. Yet, if we allow that rational agents can in principle have preferences about ambiguity, then he is acting 'rationally' in some sense. Is the Bayesian framework capable of capturing agents with these preferences?
Knightian uncertainty in a Bayesian framework
Recently, I found myself in a conversation with someone advocating the use of Knightian uncertainty. I pointed out that it doesn't really matter what uncertainty you call "normal" and what uncertainty you call "Knightian" because, at the end of the day, you still have to cash out all your uncertainty into a credence so that you can actually act.
My conversation partner, who I'm anonymizing as "Sir Percy", acknowledged that this is true if your goal is to maximize your expected gains, but denies that he should maximize expected gains. He proposes maximizing minimum expected gains given Knightian uncertainty ("using the MMEU rule"), and when using such a rule, the distinction between normal uncertainty and Knightian uncertainty does matter. I motivate the MMEU rule in my previous post, and in the next post, I'll explore it in more detail.
In this post, I will be examining Knightian uncertainty more broadly. The MMEU rule is one way of cashing out Knightian uncertainty into decisions in a way that looks non-Bayesian. But this decision rule is only one way in which the concept of Knightian uncertainty could prove useful, and I want to take a post to explore the concept of Knightian uncertainty in its own right.
Knightian Uncertainty and Ambiguity Aversion: Motivation
Recently, I found myself in a conversation with someone advocating the use of Knightian uncertainty. I admitted that I've never found the concept compelling. We went back and forth for a little while. His points were crisp and well-supported, my objections were vague. We didn't have enough time to reach consensus, but it became clear that I needed to research his viewpoint and flesh out my objections before being justified in my rejection.
So I did. This is the first in a short series of posts during which I explore what it means for an agent to reason using Knightian uncertainty.
In this first post, I'll present a number of arguments claiming that Bayesian reasoning fails to capture certain desirable behavior. I'll discuss a proposed solution, maximization of minimum expected utility, which is advocated by my friend and others.
In the second post, I'll discuss some more general arguments against Bayesian reasoning as an idealization of human reasoning. What role should "unknown unknowns" play in a bounded Bayesian reasoner? Is "Knightian uncertainty" a useful concept that is not captured by the Bayesian framework?
In the third post, I'll discuss the proposed solution: can rational agents display ambiguity aversion? What does it mean to have a rational agent that does not maximize expected utility, maximizing "minimum expected utility" instead?
In the final post, I'll apply these insights to humans and articulate my objections to ambiguity aversion in general. I'll conclude that while it is possible for agents to be ambiguity-averse, ambiguity aversion in humans is a bias. The maximization of minimum expected utility may be a useful concept for explaining how humans actually act, but probably isn't how you should act.
Failures of an embodied AIXI
Building a safe and powerful artificial general intelligence seems a difficult task. Working on that task today is particularly difficult, as there is no clear path to AGI yet. Is there work that can be done now that makes it more likely that humanity will be able to build a safe, powerful AGI in the future? Benja and I think there is: there are a number of relevant problems that it seems possible to make progress on today using formally specified toy models of intelligence. For example, consider recent program equilibrium results and various problems of self-reference.
AIXI is a powerful toy model used to study intelligence. An appropriately-rewarded AIXI could readily solve a large class of difficult problems. This includes computer vision, natural language recognition, and many other difficult optimization tasks. That these problems are all solvable by the same equation — by a single hypothetical machine running AIXI — indicates that the AIXI formalism captures a very general notion of "intelligence".
However, AIXI is not a good toy model for investigating the construction of a safe and powerful AGI. This is not just because AIXI is uncomputable (and its computable counterpart AIXItl infeasible). Rather, it's because AIXI cannot self-modify. This fact is fairly obvious from the AIXI formalism: AIXI assumes that in the future, it will continue being AIXI. This is a fine assumption for AIXI to make, as it is a very powerful agent and may not need to self-modify. But this inability limits the usefulness of the model. Any agent capable of undergoing an intelligence explosion must be able to acquire new computing resources, dramatically change its own architecture, and keep its goals stable throughout the process. The AIXI formalism lacks tools to study such behavior.
This is not a condemnation of AIXI: the formalism was not designed to study self-modification. However, this limitation is neither trivial nor superficial: even though an AIXI may not need to make itself "smarter", real agents may need to self-modify for reasons other than self-improvement. The fact that an embodied AIXI cannot self-modify leads to systematic failures in situations where self-modification is actually necessary. One such scenario, made explicit using Botworld, is explored in detail below.
In this game, one agent will require another agent to precommit to a trade by modifying its code in a way that forces execution of the trade. AIXItl, which is unable to alter its source code, is not able to implement the precommitment, and thus cannot enlist the help of the other agent.
Afterwards, I discuss a slightly more realistic scenario in which two agents have an opportunity to cooperate, but one agent has a computationally expensive "exploit" action available and the other agent can measure the waste heat produced by computation. Again, this is a scenario where an embodied AIXItl fails to achieve a high payoff against cautious opponents.
Though scenarios such as these may seem improbable, they are not strictly impossible. Such scenarios indicate that AIXI — while a powerful toy model — does not perfectly capture the properties desirable in an idealized AGI.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)