I don’t see how this could predictably happen without something else going wrong first. I agree and acknowledged that the question-learning solution is hard-to-test, so let’s focus on the RL approach. (Though I also don’t expect this to happen for the question-answering solution.) In this comment I’ll focus on the misunderstanding case.
So in the future, you expect to predictably make a decision which the aliens would consider catastrophically bad. It seems to me like:
If the solution would really be considered catastrophically bad, and it chosen for evaluation, then it will receive a very low payoff—unless the scheme fails in some other way that we have not yet discussed.
So you would only make such mistakes if you thought that you would receive enough expected benefit from more aggressive decisions that it offsets this predictable possibility of a low payoff from catastrophic error.
But if you took more conservative actions, you could justify those actions (when they were evaluated) by explaining the predicted possibility of a catastrophic outcome. Unless something else has gone wrong, the aliens care more about averting this prospect of a bad outcome than saving time by you being more aggressive, so they shouldn’t penalize you for this.
So if you behave aggressively even at the risk of a catastrophic error, it seems like one of the following must have gone wrong:
In fact the aliens wouldn’t be able to detect a catastrophic error during evaluation.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
The aliens wouldn’t accept the justification for conservatism, based on a correct argument that its costs are outweighed by the possibility for error.
This argument is wrong, or else it’s right but you wouldn’t recognize this argument or something like it.
Any of these could happen. 1 and 3 seem like they lead to more straightforward problems with the scheme, so would be worthwhile to explore on other grounds. 2 doesn’t seem likely to me, unless we are dealing with a very minor catastrophe. But I am open to arguing about it. The basic question seems to be how tough it is to ask the aliens enough questions to avoid doing anything terrible.
The examples you give in the parallel thread don’t seem they could present a big problem; you can ask the alien a modest number of questions like “how do you feel about the tradeoff between the world being destroyed and you controlling less of it?” And you can help to the maximum possible extent in answering them. Of course the alien won’t have perfect answers, but their situation seems better than the situation prior to building such an AI, when they were also making such tradeoffs imperfectly (presumably even more imperfectly, unless you are completely unhelpful to the aliens for answering such questions). And there don’t seem to be many plans where the cost of implementing the plan is greater than the cost of consulting the alien about how it feels about possible consequences of that plan.
Of course you can also get this information in other ways (e.g. look at writings and past behavior of the aliens) or ask more open-ended questions like “what are the most likely way things could go wrong, given what I expect to do over the next week,” or pursue compromise solutions that the aliens are unlikely to consider too objectionable.
ETA: actually it's fine if the catastrophic plan is not evaluated badly, all of the work can be done in the step where the aliens prefer conservative plans to aggressive ones in general, after you explain the possibility of a catastrophic error.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
What if this is true, because other aliens (people) have similar AIs, so the aggressive policy is considered better, in a PD-like game theoretic sense, but it would have been better for everyone if nobody had built such AIs?
With any of the black-box designs I've seen, I would be very reluctant to push the button that would potentially give it superhuman capabilities, even if we have theoretical reasons to think that it woul...
I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.
I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.
When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.
The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.
For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.
The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?
Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.
White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.
To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.