(I've written about this in my Shortform and may regurgitate some stuff from there.)
Eliezer proposes that we separate an AI in design space from one that would constitute a fate worse than death if e.g. the reward model's sign (+/-) were flipped or the direction of updates to the reward model were reversed. This seems absolutely crucial, although I'm not yet aware of any robust way of doing this. Eliezer proposes assigning the AI a utility function of:
U = V + W
Where V refers to human values & W takes a very negative value for some arbitrary variable (e.g. diamond paperclips of length 5cm). So if the AI instead maximises -U, it'd realise that it can gain more utility by just tiling the universe with garbage.
But it seems entirely plausible that the error could occur with V instead of U, resulting in the AI maximising U = W - V, which would result in torture.
------------------------------------------------------------------------------------------------------
Another proposition I found briefly described in a Facebook discussion that was linked to by somewhere. Stuart Armstrong proposes the following:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Later, he suggests that X should be a historical fact (i.e. the value of X would be set in stone 10 seconds after the system is turned on.) As XU can only take positive values (because U has values in [0, 1]), the greatest value -XU could take would be 0 (which suggests merely killing everyone.)
But this could still be problematic if e.g. the bug occurred in the reward function/model such that it gave positive values for bad things and negative values for bad things. Although I'm not sure how frequently errors effectively multiply everything in the reward by -1. I'm also unsure how this would interact with an error that reverses the direction of updates to a reward model.
------------------------------------------------------------------------------------------------------
A few possible (?) causes for this type of error include: (list obviously not exhaustive)
- Bug caused by AI's developers. For example, GPT-2 experienced an interesting bug:
Bugs can optimise for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form.
The responses to this thread suggest that this type of thing would be noticed and redressed immediately, although this view doesn't appear to be held unanimously. See also Gwern's follow-up comment.
- Errors in self-modification/improvement
- Programmers making a mistake in a relevant database
- Some sort of weird vulnerability getting exploited by daemons (unsure on this one)
------------------------------------------------------------------------------------------------------
So, yeah. Are there any mechanisms to prevent this sort of thing from happening other than the two that I listed; and if not, would you expect the examples provided to robustly prevent this type of error from happening regardless of the cause?
In a brain-like AGI, as I imagine it, the "neocortex" module does all the smart and dangerous things, but it's a (sorta)-general-purpose learning algorithm that starts from knowing nothing (random weights) and gets smarter and smarter as it trains. Meanwhile a separate "subcortex" module is much simpler (dumber) but has a lot more hardwired information in it, and this module tries to steer the neocortex module to do things that we programmers want it to do, primarily (but not exclusively) by calculating a reward signal and sending it to the neocortex as it operates. In that case, let's look at 3 scenarios:
1. The neocortex module is steered in the opposite direction from what was intended by the subcortex's code, and this happens right from the beginning of training.
Then the neocortex probably wouldn't work at all. The subcortex is important for capabilities as well as goals; for example, the subcortex (I believe) has a simple human-speech-sound detector, and it prods the neocortex that those sounds are important to analyze, and thus a baby's neocortex learns to model human speech but not to model the intricacies of bird songs. The reliance on the subcortex for capabilities is less true in an "adult" AGI, but very true in a "baby" AGI I think; I'm skeptical that a system can bootstrap itself to superhuman intelligence without some hardwired guidance / curriculum early on. Moreover, in the event that the neocortex does work, it would probably misbehave in obvious ways very early on, before it knows it knows anything about the world, what is a "person", etc. Hopefully there would be human or other monitoring of the training process that would catch that.
2. The neocortex module is steered in the opposite direction from what was intended by the subcortex's code, and this happens when it is already smart.
The subcortex doesn't provide a goal system as a nicely-wrapped package to be delivered to the neocortex; instead it delivers little bits of guidance at a time. Imagine that you've always loved beer, but when you drink it now, you hate it, it's awful. You would probably stop drinking beer, but you would also say, "what's going on?" Likewise, the neocortex would have developed a rich interwoven fabric of related goals and beliefs, much of which supports itself with very little ground-truth anchoring from subcortex feedback. If the subcortex suddenly changes its tune, there would be a transition period when the neocortex would retain most of its goal system from before, and might shut itself down, email the programmers, hack into the subcortex, or who knows what, to avoid getting turned into (what it still mostly sees as) a monster. The details are contingent on how we try to steer the neocortex.
3. The neocortex's own goal system flips sign suddenly.
Then the neocortex would suddenly become remarkably ineffective. The neocortex uses the same system for flagging concepts as instrumental goals and flagging concepts as ultimate goals, so with a sign flip, it gets all the instrumental goals wrong; it finds it highly aversive to come up with a clever idea, or to understand something, etc. etc. It would take a lot of subcortical feedback to get the neocortex working again, if that's even possible, and hopefully the subcortex would recognize a problem.
This is just brainstorming off the top of my (sleep-deprived) head. I think you're going to say that none of these are particularly rock-solid assurance that the problem could never ever happen, and I'll agree.
I'm not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don't have any tractable directions for progress in that scenario (or just don't know en... (read more)