On the subject of how an FAI team can avoid accidentally creating a UFAI, Carl Shulman wrote:
If we condition on having all other variables optimized, I'd expect a team to adopt very high standards of proof, and recognize limits to its own capabilities, biases, etc. One of the primary purposes of organizing a small FAI team is to create a team that can actually stop and abandon a line of research/design (Eliezer calls this "halt, melt, and catch fire") that cannot be shown to be safe (given limited human ability, incentives and bias).
In the history of philosophy, there have been many steps in the right direction, but virtually no significant problems have been fully solved, such that philosophers can agree that some proposed idea can be the last words on a given subject. An FAI design involves making many explicit or implicit philosophical assumptions, many of which may then become fixed forever as governing principles for a new reality. They'll end up being last words on their subjects, whether we like it or not. Given the history of philosophy and applying the outside view, how can an FAI team possibly reach "very high standards of proof" regarding the safety of a design? But if we can foresee that they can't, then what is the point of aiming for that predictable outcome now?
Until recently I haven't paid a lot of attention to the discussions here about inside view vs outside view, because the discussions have tended to focus on the applicability of these views to the problem of predicting intelligence explosion. It seemed obvious to me that outside views can't possibly rule out intelligence explosion scenarios, and even a small probability of a future intelligence explosion would justify a much higher than current level of investment in preparing for that possibility. But given that the inside vs outside view debate may also be relevant to the "FAI Endgame", I read up on Eliezer and Luke's most recent writings on the subject... and found them to be unobjectionable. Here's Eliezer:
On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View.
Does anyone want to argue that Eliezer's criteria for using the outside view are wrong, or don't apply here?
And Luke:
One obvious solution is to use multiple reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict.
[...]
Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.
These ideas seem harder to apply, so I'll ask for readers' help. What reference classes should we use here, in addition to past attempts to solve philosophical problems? What inside view adjustments could a future FAI team make, such that they might justifiably overcome (the most obvious-to-me) outside view's conclusion that they're very unlikely to be in the possession of complete and fully correct solutions to a diverse range of philosophical problems?
Wait, that's not my argument. I was saying that while people like you are trying to develop technologies that let you "remain in control", others with shorter planning horizons or think they have simple, easy to transmit values will already be deploying new AGI capabilities, so you'll fall behind with every new development. This is what I'm suggesting only a singleton can prevent.
You could try to minimize this kind of value drift by speeding up "AI control" progress but it's really hard for me to see how you can speed it up enough to not lose a competitive race with those who do not see a need to solve this problem, or think they can solve a much easier problem. The way I model AGI development in a slow-FOOM scenario is that AGI capability will come in spurts along with changing architectures, and it's hard to do AI safety work "ahead of time" because of dependencies on AI architecture. So each time there is a big AGI capability development, you'll be forced to spend time to develop new AI safety tech for that capability/architecture, while others will not wait to deploy it. Even a small delay can lead to a large loss since AIs can be easily copied and more capable but uncontrolled AIs would quickly take over economic niches occupied by existing humans and controlled AIs. Even assuming secure rights for what you already own on Earth, your share of the future universe will become smaller and smaller as most of the world's new wealth goes to uncontrolled AIs or AIs with simple values.
Where do you see me going wrong here? If you think I'm just too confident in this model, what alternative scenario can you suggest, where people like you and I (or our values) get to keep a large share of the future universe just by speeding up the onset of serious AI safety work?