Summary: The edge instantiation problem is a hypothesized problem for safe loading in advanced agent scenarios where, for most utility functions we might try to formalize or teach, the maximum of the agent's utility function will end up lying at an edge of the solution space that is a 'weird extreme' from our perspective.
On many classes of problems, the maximizing solution tends to lie at an extreme edge of the solution space. This means that if we have an intuitive outcome X in mind and try to obtain it by giving an agent a solution fitness function F that assigns X a high value, the real maximum of F may be X' which is much more extreme than X. This is the Edge Instantiation problem, a specialization of Bostrom's PerverseInstantiation class of problems.
It is hypothesized (by e.g. Yudkowsky) that many classes of solution that have been proposed for Edge Instantiation would fail to resolve the entire problem and that we can see that further Edge Instantiation problems would remain. For example, even if we consider a Satisficing utility function with only values 0 and 1 where outcome X has value 1, an expected utility maximizer could still end up deploying an extreme strategy in order to maximize the probability that a satisfactory outcome is obtained. Considering several proposed solutions like this and their failures suggests that Edge Instantiation is a ResistantProblem (not ultimately unsolvable, but with many attractive-seeming solutions failing to work).
Consider the hypothetical lesson of the Sorcerer's Apprentice scenario: You instruct an artificial agent to add water to a cauldron, and it floods the entire workplace. In general, this is a case of Bostrom's PerverseInstantiation 'your formal algorithm does not have the intuitive meaning you hoped it did'. It's also the further specialization of Edge Instantiation, 'the agent's solution was more extreme than the one you intuitively had in mind'. You had in mind only adding enough water to fill the cauldron and then stopping, but some stage of the agent's solution-finding process optimized on a step where 'flooding the workplace' scored higher than 'add 4 buckets of water and then shut down safely', even though both of these qualify as 'filling the cauldron'.
This could be because (in the most naive case) the utility function you gave the agent was increasing in the amount of water in contiguous contact with the cauldron's interior - you gave it a utility function that says 4 buckets of water were good and 4,000 buckets of water were better. Then the 'extreme' solution that results is not just the result of ambiguity in the instructions. The problem wasn't just that a wide range of possibilities corresponded to 'filling the cauldron' and a randomly selected possibility from this space surprised us by not being the central example we originally had in mind. Rather, there's a systematic tendency for a maximizing solution to occupy an extreme edge of the solution space, which means that we are systematically likely to see 'extreme' or 'weird' solutions rather than the 'normal' examples we had in mind.
Suppose that, having foreseen in advance the above possible disaster, you try to patch the agent by instructing it not to move more than 50 kilograms of material. The agent promptly begins to build subagents (maybe none of which move more than 50 kilograms) which build further agents and again flood the workplace. You have run into a NearestUnblockedNeighbor problem; when you excluded one extreme edge of the solution space, the result was not the central-feeling 'normal' example you originally had in mind. Instead, the new maximum lay on a new extreme edge of the solution space.
Or perhaps you defined what you thought was a satisficing agent, with a utility function that assigned 1 in all cases where there were at least 4 buckets of water in the cauldron and 0 otherwise. The agent then calculated that it could increase the probability of this condition obtaining from 99.9% to 99.99% by replicating subagents and repeatedly filling the cauldron, just in case one agent malfunctioned. Since 0.9999 > 0.999, there is then a more extreme solution with greater expected utility, even though the utility function itself is binary and satisficing.
Edge Instantiation is a thorny problem if it is in fact a pragmatically important problem for advanced agent scenarios, and a problem that resists most 'naive' attempts to correct it. The proposition is not that the Edge Instantiation Problem is unresolvable, but that it's real, doesn't have a simple answer, and resists most simple attempts to patch it.
As with most aspects of the value loading problem, OrthogonalityThesis is an implicit premise of the Edge Instantiation problem; for Edge Instantiation to be a problem for advanced agents implies that 'what we really meant' or the outcomes of highest value are not inherently picked out by every possible maximizing process, and that most possible utility functions do not care 'what we really meant' unless explicitly constructed to have a DoWhatIMean behavior.
If normative values were extremely simple (of very low algorithmic complexity), then they could be formally specified in full, and the most extreme strategy that scored highest on this formal measure simply would correspond with what we really wanted, with no downsides that hadn't been taken into account in the score.
The Edge Instantiation problem has the NearestUnblockedNeighbor pattern. If you foresee one specific 'perverse' instantiation and try to prohibit it, the maximum over the remaining solution space is again likely to be at another extreme edge of the solution space that again seems 'perverse'.
Advanced agents search larger solution spaces than we do. Therefore the project of trying to visualize all the strategies that might fit a utility function, to try to verify in our own minds that the maximum is somewhere safe, seems exceptionally untrustworthy (not AdvancedSafe).
Agents that acquire new strategic options or become able to search a wider range of the solution space may go from having only apparently 'normal' solutions to apparently 'extreme' solutions. For example, an agent that inductively learns human smiles as a component of its utility function, might as a non-advanced agent have access only to strategies that make humans happy in an intuitive sense (thereby producing the apparent observation that everything is going fine and the agent is working as intended), and then after self-improvement, acquire as an advanced agent the strategic option of transforming the future light cone into tiny molecular smileyfaces.
Suppose you tried to build an agent that was an expected utility satisficer - rather than having a 0-1 utility function and thus chasing probabilities of goal satisfaction ever closer to 1, the agent searches for strategies that have at least 0.999 expected utility. Why doesn't this resolve the problem?
A bounded satisficer doesn't rule out the solution of filling the room with water, since this solution also has >0.999 expected utility. It only requires the agent to carry out one cognitive algorithm which has at least one maximizing or highly optimizing stage, in order for 'fill the room with water' to be preferred to 'add 4 buckets and shut down safely' on that stage (while being equally acceptable at future satisficing stages). (Dispreferring solutions with 'higher impacts' in general is the open problem of LowImpactAI. Currently, no formalizable utility function is known that plausibly has the right intuitive meaning for this.)
On a meta-level, we may run into problems of ReflectiveConsistency for ReflectiveAgents. Maybe one simple way of obtaining at least 0.999 expected utility is to create a subagent that maximizes expected utility? It seems intuitively clear why bounded maximizers would build boundedly maximizing offspring, but a bounded satisficer doesn't need to build boundedly satisficing offspring - a bounded maximizer might also be 'good enough'. (In the current theory of TilingAgents, we can prove that an expected utility satisficer can tile to an expected utility satisficer with some surprising caveats, but the problem is that it can tile to other things besides an expected utility satisficer.)
Since it seems very easy for at least one stage of a self-modifying agent to end up preferring solutions that have higher scores relative to some scoring rule, the EdgeInstantiation problem can be expected to resist naive attempts to describe an agent that seems to have an overall behavior of 'not trying quite so hard'. It's also not clear how to make the instruction 'don't try so hard' be ReflectivelyConsistent, or apply to every part of a considered subagent. This is also why LimitedOptimization is an open problem.
(Obvious.)