(I've written about this in my Shortform and may regurgitate some stuff from there.)
Eliezer proposes that we separate an AI in design space from one that would constitute a fate worse than death if e.g. the reward model's sign (+/-) were flipped or the direction of updates to the reward model were reversed. This seems absolutely crucial, although I'm not yet aware of any robust way of doing this. Eliezer proposes assigning the AI a utility function of:
U = V + W
Where V refers to human values & W takes a very negative value for some arbitrary variable (e.g. diamond paperclips of length 5cm). So if the AI instead maximises -U, it'd realise that it can gain more utility by just tiling the universe with garbage.
But it seems entirely plausible that the error could occur with V instead of U, resulting in the AI maximising U = W - V, which would result in torture.
------------------------------------------------------------------------------------------------------
Another proposition I found briefly described in a Facebook discussion that was linked to by somewhere. Stuart Armstrong proposes the following:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Later, he suggests that X should be a historical fact (i.e. the value of X would be set in stone 10 seconds after the system is turned on.) As XU can only take positive values (because U has values in [0, 1]), the greatest value -XU could take would be 0 (which suggests merely killing everyone.)
But this could still be problematic if e.g. the bug occurred in the reward function/model such that it gave positive values for bad things and negative values for bad things. Although I'm not sure how frequently errors effectively multiply everything in the reward by -1. I'm also unsure how this would interact with an error that reverses the direction of updates to a reward model.
------------------------------------------------------------------------------------------------------
A few possible (?) causes for this type of error include: (list obviously not exhaustive)
- Bug caused by AI's developers. For example, GPT-2 experienced an interesting bug:
Bugs can optimise for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form.
The responses to this thread suggest that this type of thing would be noticed and redressed immediately, although this view doesn't appear to be held unanimously. See also Gwern's follow-up comment.
- Errors in self-modification/improvement
- Programmers making a mistake in a relevant database
- Some sort of weird vulnerability getting exploited by daemons (unsure on this one)
------------------------------------------------------------------------------------------------------
So, yeah. Are there any mechanisms to prevent this sort of thing from happening other than the two that I listed; and if not, would you expect the examples provided to robustly prevent this type of error from happening regardless of the cause?
I'm not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don't have any tractable directions for progress in that scenario (or just don't know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don't distinguish between prosaic AGI and brain-inspired AGI.
Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I've worked a lot on trying to understand how the neocortical algorithm works, and I don't think that the algorithm is all that complicated (cf. "cortical uniformity"), and I think that ongoing work is zeroing in on it (see here).