Comment Permalink

Steven Byrnes5y40

In a brain-like AGI, as I imagine it, the "neocortex" module does all the smart and dangerous things, but it's a (sorta)-general-purpose learning algorithm that starts from knowing nothing (random weights) and gets smarter and smarter as it trains. Meanwhile a separate "subcortex" module is much simpler (dumber) but has a lot more hardwired information in it, and this module tries to steer the neocortex module to do things that we programmers want it to do, primarily (but not exclusively) by calculating a reward signal and sending it to the neocortex as it operates. In that case, let's look at 3 scenarios:

1. The neocortex module is steered in the opposite direction from what was intended by the subcortex's code, and this happens right from the beginning of training.

Then the neocortex probably wouldn't work at all. The subcortex is important for capabilities as well as goals; for example, the subcortex (I believe) has a simple human-speech-sound detector, and it prods the neocortex that those sounds are important to analyze, and thus a baby's neocortex learns to model human speech but not to model the intricacies of bird songs. The reliance on the subcortex for capabilities is less true in an "adult" AGI, but very true in a "baby" AGI I think; I'm skeptical that a system can bootstrap itself to superhuman intelligence without some hardwired guidance / curriculum early on. Moreover, in the event that the neocortex does work, it would probably misbehave in obvious ways very early on, before it knows it knows anything about the world, what is a "person", etc. Hopefully there would be human or other monitoring of the training process that would catch that.

2. The neocortex module is steered in the opposite direction from what was intended by the subcortex's code, and this happens when it is already smart.

The subcortex doesn't provide a goal system as a nicely-wrapped package to be delivered to the neocortex; instead it delivers little bits of guidance at a time. Imagine that you've always loved beer, but when you drink it now, you hate it, it's awful. You would probably stop drinking beer, but you would also say, "what's going on?" Likewise, the neocortex would have developed a rich interwoven fabric of related goals and beliefs, much of which supports itself with very little ground-truth anchoring from subcortex feedback. If the subcortex suddenly changes its tune, there would be a transition period when the neocortex would retain most of its goal system from before, and might shut itself down, email the programmers, hack into the subcortex, or who knows what, to avoid getting turned into (what it still mostly sees as) a monster. The details are contingent on how we try to steer the neocortex.

3. The neocortex's own goal system flips sign suddenly.

Then the neocortex would suddenly become remarkably ineffective. The neocortex uses the same system for flagging concepts as instrumental goals and flagging concepts as ultimate goals, so with a sign flip, it gets all the instrumental goals wrong; it finds it highly aversive to come up with a clever idea, or to understand something, etc. etc. It would take a lot of subcortical feedback to get the neocortex working again, if that's even possible, and hopefully the subcortex would recognize a problem.

This is just brainstorming off the top of my (sleep-deprived) head. I think you're going to say that none of these are particularly rock-solid assurance that the problem could never ever happen, and I'll agree.

20

How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

20

20