How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?
(I've written about this in my Shortform and may regurgitate some stuff from there.) Eliezer proposes that we separate an AI in design space from one that would constitute a fate worse than death if e.g. the reward model's sign (+/-) were flipped or the direction of updates to the reward model were reversed. This seems absolutely crucial, although I'm not yet aware of any robust way of doing this. Eliezer proposes assigning the AI a utility function of: > U = V + W Where V refers to human values & W takes a very negative value for some arbitrary variable (e.g. diamond paperclips of length 5cm). So if the AI instead maximises -U, it'd realise that it can gain more utility by just tiling the universe with garbage. But it seems entirely plausible that the error could occur with V instead of U, resulting in the AI maximising U = W - V, which would result in torture. Another proposition I found briefly described in a Facebook discussion that was linked to by somewhere. Stuart Armstrong proposes the following: > Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. > Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value. Later, he suggests that X should be a historical fact (i.e. the value of X would be set in stone 10 seconds after the system is turned on.) As XU can only take positive values (because U has values in [0, 1]), the greatest value -XU could take would be 0 (which suggests merely killing everyone.) But this could still be problematic if e.g.
I'm interested in arguments surrounding energy-efficiency (and maximum intensity, if they're not the same thing) of pain and pleasure. I'm looking for any considerations or links regarding (1) the suitability of "H=D" (equal efficiency and possibly intensity) as a prior; (2) whether, given this prior, we have good a posteriori reasons to expect a skew in either the positive or negative direction; and (3) the conceivability of modifying human minds' faculties to experience "super-bliss" commensurate with the badness of the worst-possible outcome, such that the possible intensities of human experience hinge on these considerations.
Picturing extreme torture - or even reading accounts of much less extreme suffering - pushes me towards suffering-focused ethics.... (read more)