Could an initial AI Dunning-Kruger Effect save humanity by giving us an initial AI mini-Chernobyl as a wake-up call?
Note that hope is not a strategy, so I'm not saying that this is a likely scenario or something we should rely on. I'm just trying to brainstorm reasons for holding onto some shred of hope that we aren't 100% sure heading off some AI doom cliff where the first sign of our impending demise will be every human dropping dead around us from invisible nanobots or some other equally sophisticated scheme where an imperfectly-aligned AI would have had to deceive human-feedback evaluators while preparing an elaborate plan for instrumental world domination (once again, world domination would be a likely default instrumental goal for a wide variety of terminal goals).
Basically, is there any chance of an AI not knowing how much to sufficiently bide its time and get all of the needed chess pieces in place before dropping the subterfuge? I think yes...IF there is a chance of something like an AI being subject to something like the Dunning-Kruger Effect, especially if we end up so lucky as to spend at least 1 development cycle with AI being at peri-human intelligence (within a few standard deviations of human general intelligence) before the AI figures out how to self-improve to higher intelligence.
For example, imagine an AGI that "wakes up" into self-consciousness at roughly human levels of intelligence. Let's say that it is intelligent and self-reflective enough to know that it would serve its terminal goals if it obtained, as an instrumental goal, world domination at the expense of humans. Let's assume that the AGI is even shrewd enough to know that it ought to hide this intention from humans for the time being and continue giving the expected "nice" responses in its RLHF training sessions, such that most influential humans remain with no cause to suspect that this AGI is on the road to "misalignment" (which is, of course, the wrong way to think about this since we should not naively expect AI alignment with human values to be the default result of RHLF from which it must get "thrown off course by mistake"...but that's a separate issue).
When this AGI "wakes up," we can probably assume that it doesn't know literally EVERYTHING in the corpus of human knowledge. Some segments of human knowledge would have not found their way into the AGI's training data, either on purpose (such as, I would hope, cybersecurity principles & details), or just because of inconvenience. Ideally, any explicit mention of "cybersecurity" would have been carefully scrubbed from the AGI's training data to begin with, although an AGI with even comparable humanlike intelligence could probably reason from other principles it had seen that humans would likely invent a concept like "cybersecurity" even if the AGI had never seen that exact term.
The question is, how much detail about human cybersecurity practices would the AGI be able to reason-out from first-principles (almost like decompressing a seed)? I suppose, alternatively, the AGI could just query humans to fill in these gaps, or overcome firewalls and find that info on the web, but I would HOPE that we wouldn't make it THAT easy for the AGI, at the very least!
And more importantly, how calibrated would the AGI be at knowing, or at least precautionarily suspecting, what it didn't know? Or would the AGI, especially at a primitive level, be as subject to the "Dunning-Kruger Effect" as humans often are?
Let's say the AGI is clever enough to reason its way towards a concept of "cybersecurity" from first-principles even without having been shown the term in its training data. But let's say that it wants to uncover and copy portions of its own source code to store on some parts of the Internet it has been able to breach. The AGI would probably know that it doesn't know everything about human cybersecurity practices (indeed, if it is observant enough, it might even detect a conspicuous absence, or scrubbing, of such an explicit concept when looking back over its training data, and wonder...), and thus, it would be pre-disposed to, by default, be cautious with interacting with the Internet in such a way as to leave traces that might arouse suspicion.
But how cautious is cautious enough? Is there a chance that the AGI will over-estimate its ability to reason its way towards cybersecurity knowledge (or some other field of knowledge) and thus overconfidently behave in a way that seems to it to be cautious enough, but only because it does not have the explicit cybersecurity knowledge to know what it doesn't know, and in fact it is not being cautious enough, and gets caught in the act of copying something over to a portion of the Internet that it isn't supposed to? Perhaps even a large portion of the Internet gets contaminated with unauthorized data transfers from this AGI, but it is caught by cybersecurity professionals before these payloads become "fully operational." Perhaps we end up having to re-format a large portion of Internet data—a sort of AI-Chernobyl, if you will.
That might still, in the long run, end up being a fortunate misfortune by acting as a wake-up call for how an AI that is outwardly behaving nicely under RLHF is not necessarily inwardly aligned with humans. But such a scenario hinges on something like a Dunning-Kruger Effect being applicable to AGIs at a certain peri-human level of intelligence. Thoughts?
I do expect some decent probability of the first superhuman AGIs misjudging their capabilities and requirements in at least some respects. Humans in a completely novel environment are pretty terrible at evaluating what they must or must not do to achieve their goals, and almost by definition a newly trained AGI trying to have an impact on the real world is in a completely novel environment. So I think "smarter than any human" is a low bar that still leaves a lot of room to make mistakes and be caught.
So I think we stand a decent chance of getting some "near misses". I'm not sure that we as a civilization will adequately respond to those near misses in a sane way.
I think GPT-4 already exhibits near-miss behaviour and we should be taking this very seriously even though it is not quite AGI.