I wrote an example I erased, based on a possibly apocryphal anecdote by Richard Feynman I am recalling from memory, discussing the motivations for working on the Manhattan Project; the original reasons for starting on the project were to beat Germany to building an atomic bomb; after Germany was defeated, the original reason was outdated, but he (and others sharing his motivation) continued working anyways, solving the immediate problem rather than the one they originally intended to solve.
That's an example of the logical system and the motivational system being in conflict, even if the anecdote doesn't turn out to be very accurate. I hope it is suggestive of the distinction.
The motivational system -could- be a gatekeeper, but I suspect that would mean there are substantive issues in how the logical system is devised. It should function as an enabler - as the motive force behind all actions taken within the logical system. And yes, in a sense it should be less intelligent than the logical system; if it considers everything to the same extent the logical system does, it isn't doing its job, it's just duplicating the efforts of the logical system.
That is, I'm regarding an ideal motivational system as something that drives the logical system; the logical system shouldn't be -trying- to trick its motivational system, in something the same way and for the same reason you shouldn't try to convince yourself of a falsehood.
The issue in describing this is that I can think of plenty of motivational systems, but none which do what we want here. (Granted, if I could, the friendly AI problem might be substantively solved.) I can't even say for certain that a gatekeeper motivator wouldn't work.
Part of my mental model of this functional dichotomy, however, is that the logical system is stateless - if the motivational system asks it to evaluate its own solutions, it has to do so only with the information the motivational system gives it. The communication model has a very limited vocabulary. Rules for the system, but not rules for reasoning, are encoded into the motivational system, and govern its internal communications only. The logical system goes as far as it can with what it has, produces a set of candidate solutions and unresolved problems, and passes these back to the motivational system. Unresolved problems might be passed back with additional information necessary to resolve them, depending on the motivational system's rules.
So in my model-of-my-model, an Asimov-syle AI might hand a problem to its logical system, get several candidate solutions back, and then pass those candidate solutions back into the logical system with the rules of robotics, one by one, asking if this action could violate each rule in turn, discarding any candidate solutions which do.
Manual motivational systems are also conceptually possible, although probably too slow to be of much use.
[My apologies if this response isn't very good; I'm running short on time, and don't have any more time for editing, and in particular for deciding which pieces to exclude.]
(With Kaj Sotala)
SI's current R&D plan seems to go as follows:
1. Develop the perfect theory.
2. Implement this as a safe, working, Artificial General Intelligence -- and do so before anyone else builds an AGI.
The Singularity Institute is almost the only group working on friendliness theory (although with very few researchers). So, they have the lead on Friendliness. But there is no reason to think that they will be ahead of anyone else on the implementation.
The few AGI designs we can look at today, like OpenCog, are big, messy systems which intentionally attempt to exploit various cognitive dynamics that might combine in unexpected and unanticipated ways, and which have various human-like drives rather than the sort of supergoal-driven, utility-maximizing goal hierarchies that Eliezer talks about, or which a mathematical abstraction like AIXI employs.
A team which is ready to adopt a variety of imperfect heuristic techniques will have a decisive lead on approaches based on pure theory. Without the constraint of safety, one of them will beat SI in the race to AGI. SI cannot ignore this. Real-world, imperfect, safety measures for real-world, imperfect AGIs are needed. These may involve mechanisms for ensuring that we can avoid undesirable dynamics in heuristic systems, or AI-boxing toolkits usable in the pre-explosion stage, or something else entirely.
SI’s hoped-for theory will include a reflexively consistent decision theory, something like a greatly refined Timeless Decision Theory. It will also describe human value as formally as possible, or at least describe a way to pin it down precisely, something like an improved Coherent Extrapolated Volition.
The hoped-for theory is intended to provide not only safety features, but also a description of the implementation, as some sort of ideal Bayesian mechanism, a theoretically perfect intelligence.
SIers have said to me that SI's design will have a decisive implementation advantage. The idea is that because strap-on safety can’t work, Friendliness research necessarily involves more fundamental architectural design decisions, which also happen to be general AGI design decisions that some other AGI builder could grab and save themselves a lot of effort. The assumption seems to be that all other designs are based on hopelessly misguided design principles. SI-ers, the idea seems to go, are so smart that they'll build AGI far before anyone else. Others will succeed only when hardware capabilities allow crude near-brute-force methods to work.
Yet even if the Friendliness theory provides the basis for intelligence, the nitty-gritty of SI’s implementation will still be far away, and will involve real-world heuristics and other compromises.
We can compare SI’s future AI design to AIXI, another mathematically perfect AI formalism (though it has some critical reflexivity issues). Schmidhuber, Hutter, and colleagues think that their AXI can be scaled down into a feasible implementation, and have implemented some toy systems. Similarly, any actual AGI based on SI's future theories will have to stray far from its mathematically perfected origins.
Moreover, SI's future friendliness proof may simply be wrong. Eliezer writes a lot about logical uncertainty, the idea that you must treat even purely mathematical ideas with same probabilistic techniques as any ordinary uncertain belief. He pursues this mostly so that his AI can reason about itself, but the same principle applies to Friendliness proofs as well.
Perhaps Eliezer thinks that a heuristic AGI is absolutely doomed to failure; that a hard takeoff immediately soon after the creation of the first AGI is so overwhelmingly likely that a mathematically designed AGI is the only one that could stay Friendly. In that case, we have to work on a pure-theory approach, even if it has a low chance of being finished first. Otherwise we'll be dead anyway. If an embryonic AGI will necessarily undergo an intelligence explosion, we have no choice but to "shut up and do the impossible."
I am all in favor of gung-ho knife-between-the teeth projects. But when you think that your strategy is impossible, then you should also look for a strategy which is possible, if only as a fallback. Thinking about safety theory until drops of blood appear on your forehead (as Eliezer puts it, quoting Gene Fowler), is all well and good. But if there is only a 10% chance of achieving 100% safety (not that there really is any such thing), then I'd rather go for a strategy that provides only a 40% promise of safety, but with a 40% chance of achieving it. OpenCog and the like are going to be developed regardless, and probably before SI's own provably friendly AGI. So, even an imperfect safety measure is better than nothing.
If heuristic approaches have a 99% chance of an immediate unfriendly explosion, then that might be wrong. But SI, better than anyone, should know that any intuition-based probability estimate of “99%” really means “70%”. Even if other approaches are long-shots, we should not put all our eggs in one basket. Theoretical perfection and stopgap safety measures can be developed in parallel.
Given what we know about human overconfidence and the general reliability of predictions, the actual outcome will to a large extent be something that none of us ever expected or could have predicted. No matter what happens, progress on safety mechanisms for heuristic AGI will improve our chances if something entirely unexpected happens.
What impossible thing should SI be shutting up and doing? For Eliezer, it’s Friendliness theory. To him, safety for heuristic AGI is impossible, and we shouldn't direct our efforts in that direction. But why shouldn't safety for heuristic AGI be another impossible thing to do?
(Two impossible things before breakfast … and maybe a few more? Eliezer seems to be rebuilding logic, set theory, ontology, epistemology, axiology, decision theory, and more, mostly from scratch. That's a lot of impossibles.)
And even if safety for heuristic AGIs is really impossible for us to figure out now, there is some chance of an extended soft takeoff that will allow for the possibility of us developing heuristic AGIs which will help in figuring out AGI safety, whether because we can use them for our tests, or because they can by applying their embryonic general intelligence to the problem. Goertzel and Pitt have urged this approach.
Yet resources are limited. Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI's resources on areas in which it has a relative advantage, or which have a greater expected impact.
Even if so, SI should evangelize AGI safety to other researchers, not only as a general principle, but also by offering theoretical insights that may help them as they work on their own safety mechanisms.
In summary:
1. AGI development which is unconstrained by a friendliness requirement is likely to beat a provably-friendly design in a race to implementation, and some effort should be expended on dealing with this scenario.
2. Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs. However, we do not know whether or not this is true.
3. Even the provably friendly design will face real-world compromises and errors in its implementation, so the implementation will not itself be provably friendly. Thus, safety protections of the sort needed for heuristic design are needed even for a theoretically Friendly design.