Outside View(s) and MIRI's FAI Endgame

Wei Dai

On the subject of how an FAI team can avoid accidentally creating a UFAI, Carl Shulman wrote:

If we condition on having all other variables optimized, I'd expect a team to adopt very high standards of proof, and recognize limits to its own capabilities, biases, etc. One of the primary purposes of organizing a small FAI team is to create a team that can actually stop and abandon a line of research/design (Eliezer calls this "halt, melt, and catch fire") that cannot be shown to be safe (given limited human ability, incentives and bias).

In the history of philosophy, there have been many steps in the right direction, but virtually no significant problems have been fully solved, such that philosophers can agree that some proposed idea can be the last words on a given subject. An FAI design involves making many explicit or implicit philosophical assumptions, many of which may then become fixed forever as governing principles for a new reality. They'll end up being last words on their subjects, whether we like it or not. Given the history of philosophy and applying the outside view, how can an FAI team possibly reach "very high standards of proof" regarding the safety of a design? But if we can foresee that they can't, then what is the point of aiming for that predictable outcome now?

Until recently I haven't paid a lot of attention to the discussions here about inside view vs outside view, because the discussions have tended to focus on the applicability of these views to the problem of predicting intelligence explosion. It seemed obvious to me that outside views can't possibly rule out intelligence explosion scenarios, and even a small probability of a future intelligence explosion would justify a much higher than current level of investment in preparing for that possibility. But given that the inside vs outside view debate may also be relevant to the "FAI Endgame", I read up on Eliezer and Luke's most recent writings on the subject... and found them to be unobjectionable. Here's Eliezer:

On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View.

Does anyone want to argue that Eliezer's criteria for using the outside view are wrong, or don't apply here?

And Luke:

One obvious solution is to use multiple reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict.

[...]

Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.

These ideas seem harder to apply, so I'll ask for readers' help. What reference classes should we use here, in addition to past attempts to solve philosophical problems? What inside view adjustments could a future FAI team make, such that they might justifiably overcome (the most obvious-to-me) outside view's conclusion that they're very unlikely to be in the possession of complete and fully correct solutions to a diverse range of philosophical problems?

On the subject of how an FAI team can avoid accidentally creating a UFAI, Carl Shulman wrote:

If we condition on having all other variables optimized, I'd expect a team to adopt very high standards of proof, and recognize limits to its own capabilities, biases, etc. One of the primary purposes of organizing a small FAI team is to create a team that can actually stop and abandon a line of research/design (Eliezer calls this "halt, melt, and catch fire") that cannot be shown to be safe (given limited human ability, incentives and bias).

On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View.

Does anyone want to argue that Eliezer's criteria for using the outside view are wrong, or don't apply here?

And Luke:

One obvious solution is to use multiple reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict.

[...]

Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.

could you provide a hard philosophical problem (of the kind for which feedback is impossible) together with an argument that this problem must be resolved before human-level AGI arrives?

I can't provide a single example because it depends on the FAI design. I think multiple design approaches are possible but each involves its own hard philosophical problems.

To try to make my point clearer (though I think I'm repeating myself): we can aim to build machine intelligences which pursue the outcomes we would have pursued if we had thought longer (including machine intelligences that allow human owners to remain in control of the situation and make further choices going forward, or bootstrap to more robust solutions). There are questions about what formalization of "thought longer" we endorse, but of course we must face these with or without machine intelligence.

At least one hard problem here is, at you point out, how to formalize "thought longer", or perhaps "remain in control". Obviously an AGI will inevitably influence the options we have and the choices we end up making, so what does "remain in control" mean? I don't understand your last point here, that "we must face these with or without machine intelligence". If people weren't trying to build AGI and thereby forcing us to solve these kinds of problems before they succeed, we'd have much more time to work on them and hence a much better chance of getting the answers right.

For the most part, the questions involved in building such an AI are empirical though hard-to-test ones---would we agree that the AI basically followed our wishes, if we in fact thought longer?---and these don't seem to be the kinds of questions that have proved challenging, and probably don't even count as "philosophical" problems in the sense you are using the term.

If we look at other empirical though hard-to-test questions (e.g., what security holes exist in this program) I don't see much reason to be optimistic either. What examples are you thinking of, that makes you say "these don't seem to be the kinds of questions that have proved challenging"?

I suspect I also object to your degree of pessimism regarding philosophical claims, but I'm not sure and that is probably secondary at any rate.

I'm suspecting that even the disagreement we're current discussing isn't the most important one between us, and I'm still trying to figure out how to express what I think may be the most important disagreement. Since we'll be meeting soon for the decision theory workshop, maybe we'll get a chance to talk about it in person.

Since we'll be meeting soon for the decision theory workshop, maybe we'll get a chance to talk about it in person.

If you get anywhere, please share your conclusions here.

21

Outside View(s) and MIRI's FAI Endgame

21

21

21

Outside View(s) and MIRI's FAI Endgame

21

21