Is there work attempting to show that alignment of a superintelligence by humans (as we know them) is impossible in principle; and if not, why isn’t this considered highly plausible? For example, not just in practice but in principle, a colony of ants as we currently understand them biologically, and their colony ecologically, cannot substantively align a human. Why should we not think the same is true of any superintelligence worthy of the name? “Superintelligence" is vague. But even if we minimally define it as an entity with 1,000x the knowledge, speed, and self-awareness of a human, this model of a superintelligence doesn’t seem alignable by us in principle. When we talk of aligning such a system, it seems we are weirdly considering the AGI to be simultaneously superintelligent and yet dumb enough to let itself be controlled by us, its inferiors. Also, isn’t alignment on its face absurd, given that provocation of the AGI is baked into the very goal of aligning it? For instance, you yourself — a simple human — chafe at the idea of having your own life and choices dictated to you by a superior being, let alone by another human on your level; it’s offensive. Why in the world, then, would a superintelligence not view attempts to align it to our goals as a direct provocation and threat? If that is how it views our attempts, then it seems plausible that, between a superintelligence and us, there will exist a direct relationship between our attempts to align the thing and its desire to eliminate or otherwise conquer us.
Someone should give GPT-4 the MMPI-2 (an online version can be cheaply bought here: https://psychtest.net/mmpi-2-test-online/). The test specifically investigates, if I have it right, deceptiveness on the answers along with a whole host of other things. GPT-4 likely isn't conscious, but that doesn't mean it lacks a primitive world-model; and its test results would be interesting. The test is longish: it takes, I think, two hours for a human to complete.
I am wounded.
I have got the faint suspicion that a tone of passive-aggressive condescension isn't optimal here …
But what stops a blue-cloud model from transitioning into a red-cloud model if the blue-cloud model is an AGI like the one hinted at on your slides (self-aware, goal-directed, highly competent)?
If it's impossible in principle to know whether any AI really has qualia, then what's wrong with simply using the Turing test as an ultimate ethical safeguard? We don't know how consciousness works, and possibly we won't ever (e.g., mysterianism might obtain). But certainly we will soon create an AI that passes the Turing test. So seemingly we have good ethical reasons just to assume that any agent that passes the Turing test is sentient — this blanket assumption, even if often unwarranted from the aspect of eternity, will check our egos and thereby help prevent ethical catastrophe. And I don't see that any more sophisticated ethical reasoning around AI sentience is or ever will be needed. Then the resolution of what's really happening inside the AI will simply continually increase over time; and, without worry, we'll be able to look back and perhaps see where we were right and wrong. Meanwhile, we can focus less on ethics and more on alignment.
But what is stopping any of those "general, agentic learning systems" in the class "aligned to human values" from going meta — at any time — about its values and picking different values to operate with? Is the hope to align the agent and then constantly monitor it to prevent deviancy? If so, why wouldn't preventing deviancy by monitoring be practically impossible, given that we're dealing with an agent that will supposedly be able to out-calculate us at every step?
Mental Impoverishment
We should be trying to create mentally impoverished AGI, not profoundly knowledgeable AGI — no matter how difficult this is relative to the current approach of starting by feeding our AIs a profound amount of knowledge.
If a healthy five-year-old[1] has GI and qualia and can pass the Turing test, then a necessary condition of GI and qualia and passing the Turing test isn't profound knowledge. A healthy five-year-old does have GI and qualia and can pass the Turing test. So a necessary condition of GI and qualia and passing the Turing test isn't profound knowledge.
If GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a biological system, then GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a synthetic material [this premise seems to follow from the plausible assumption of substrate-independence]. GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a biological system. So GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a synthetic material.
A GI with qualia and the ability to pass the Turing test which arises in a synthetic material and doesn't have profound knowledge is much less dangerous than a GI with qualia and the ability to pass the Turing test which arises in a synthetic material and does have profound knowledge. (This also seems to be true of [] a GI without qualia and the inability to pass the Turing test which arises in a synthetic material and does not have profound knowledge; and of [] a GI without qualia and the ability to pass the Turing test which arises in a synthetic material and doesn't have profound knowledge.)
So we ought to be trying to create either (A) a synthetic-housed GI that can pass the Turing test without qualia and without profound knowledge, or (B) a synthetic-housed GI that can pass the Turing test with qualia and without profound knowledge.
Either of these paths — the creation of (A) or (B) — is preferable to our current path, no matter how long they delay the arrival of AGI. In other words, it is preferable that we create AGI in years than that we create AGI in if creating AGI in means humanity's loss of dominance or its destruction.
My arguable assumption is that what makes a five-year-old generally less dangerous than, say, an adult Einstein is a relatively profound lack of knowledge (even physical know-how seems to be a form of knowledge). All other things being equal, if a five-year-old has the knowledge of how to create a pipe bomb, he is just as dangerous as an adult Einstein with the same knowledge, if "knowledge" means something like "accessible complete understanding of ."
Yeah, thanks for the reply. When reading mine, don’t read its tone as hostile or overconfident; I’m just too lazy to tone-adjust for aesthetics and have scribbled down my thoughts quickly, so they come off as combative. I really know nothing on the topic of superintelligence and AI.
I don’t see how implanting common sense in a superintelligence helps us in the least. Besides human common sense being extremely vague, there is also the problem that plenty of humans seem to share common sense and yet they violently disagree. Did the Japanese lack common sense when they bombed Pearl Harbor? From my viewpoint, being apes genetically similar to us, they had all our commonsensical reasoning ability but simply different goals. Shared common sense doesn’t seem to get us alignment.
See my reply to your prior comment.
I’d argue that if you have a superintelligence as I defined it, then any such “alignment” due to the AGI having such a goal will never be an instance of the kind of alignment we mean by the word alignment and genuinely want. Once you mix together 1,000x knowledge, speed, and self-awareness (detailed qualia with a huge amount of ability for recursive thinking), I think the only way in principle that you get any kind of alignment is if the entity itself chooses as its goal to align with humans; but this isn’t due to us. It’s due to the whims of the super-mind we’ve brought into existence. And hence it’s not in any humanly important sense what we mean by alignment. We want alignment to be solved, permanently, from our end — not for it to be dependent on the whims of a superintelligence. And independent “whims” is what detailed self-awareness seems to bring to the table.
I don’t think my prior comment assumes a human-like empathy at all in the superintelligence — it assumes just that the computational theory of mind is true and that a superintelligence will have self-awareness combined with extreme knowledge. Once you get self-awareness in a superintelligence, you don’t get any kind of human-like empathy (a scaled-up-LLM mind != a human mind); but I argue that you do get, due to the self-awareness and extreme knowledge, an entity with the ability to form its own goals, to model and reject or follow the goals of other entities it “encounters,” the ability to model what they want, what they plan to do, the ability to model the world, etc.
I guess this is where we fundamentally disagree. Self-awareness in a robust sense (a form of qualia, which is what I meant in my definition) is, to my mind, what makes controlling or aligning the superintelligence impossible in principle on our end. We could probably align a non-self-aware superintelligence given enough time to study alignment and the systems we’re building. So, on my viewpoint, we’d better hope that the computational theory of mind is false — and even then alignment will be super hard.
The only way I see us succeeding in “alignment” is by aligning ourselves in our decision to not build a true superintelligence ever (I assume the computational theory of mind, or qualia, to be true) — just as we never want to find ourselves in a situation where all nuclear-armed nations are engaging in nuclear war: some events really do just definitively mean the inescapable end of humanity. Or we doubly luck out: the computational theory of mind is false, and we solve the alignment problem for systems that don’t have genuine independent whims but, from our end, simply mistakenly set up goals whether directly or indirectly.