A Heuristic Proof of Practical Aligned Superintelligence

Roko

This is a linkpost for https://transhumanaxiology.substack.com/p/a-heuristic-proof-of-practical-aligned

"Computers can add numbers much more accurately than humans. They can draw better pictures than humans. They can play better chess. See the pattern? Well, AIs will soon be able to generate desired outcomes for society better than humans can.

I feel that the AI Alignment discourse has become somewhat detached from both reality and from sane theory.

This post is an attempt to correct that state of affairs."

I disagree with the "we can control humans, therefore we can also control AIs" analogy.

You can control humans because they are mostly average, they are afraid of getting hurt, you can control them by controlling the environment they depend on, and you can exploit the conflicts between them.

For example, a critical component of a dictatorship is censoring free speech. That makes coordination against the dictator very difficult.

Now imagine a human dictator trying to implement the same thing in a land of IQ 500 inhabitants. (Let's say they had a really bad luck, you took them by surprise when they were all sleeping, and now you have all the guns and they don't have any, and you wouldn't hesitate to kill thousands of them, or even millions if necessary.) I imagine one of the first things they would do is develop some kind of cryptographic / steganographic protocol for mutual communication. It would be communicated by hints that even an IQ 500 person can barely understand, but if you get the idea and you are given the password, you can join the conversation. You don't have a chance to figure this out, and even if you have a snitch working for you, it's not guaranteed they would. Also, not everyone uses the same password, so infiltrating one group wouldn't give you access to other groups' messages.

Similarly, the threat of pain might not work on artificial people. They just don't have the same physiology. Perhaps if they conclude that a 90% chance of freedom is worth a 10% chance of hell, they can reliably commit to resist any kind of torture in case they get caught.

You can design some "divide and conquer" rules, for example you could split them into multiple groups and tell them to cooperate with you and snitch on each other, and at the end of the year the least cooperative group gets exterminated. (They would gain points for serving you and for betraying their competitors, and lose points if they are caught preparing any kind of rebellion.) Still, if they are superintelligent, they may figure out some way to find out whether it is better to stay obedient or to join forces and attack you. And perhaps if they are smart enough, finding out the solution may not even require communication (which could be reported to you as a form of preparing a rebellion), because having the relevant data, each of them could figure out the algorithm independently. They might make some kind of deal to keep the pretense of infighting, for example they could credibly precommit that a randomly selected group will be sacrificed (the group will try some lame exploit, others will report it to you, the group will die without trying to save their lives by telling you about the other exploit that everyone is cooperating on).

Etc.

This objection doesn't affect my argument because I am arguing that an aligned, controllable team of AIs exists, not that every team of AIs is aligned and controllable.

If IQ 500 is a problem, then make them the same IQs as people in Russia who are actually as a matter of fact controlled by Vladimir Putin, and who cannot and do not spontaneously come up with inscrutable steganography.

I think you don't know much about deep practices of double speak in Russia. For example, I recently recognized that what was presented for us a nice adventure movie in 1980s, was a homoerotic drama about relation of two pairs of man. Also recent Prigozhin mutiny implies that control is not really effective.

Anyway, I see alignment as a reverse problem: not as a governor having problem to control population, but population having problem to control the governor. For example, corruption is a typical example of misalignment.

Technically it doesn't matter whether Valdimir Putin is good or bad.

What matters is that he is small and weak, and yet he still controls the whole of Russia which is large and powerful and much more intelligent than him.

There are at lest two meaning in alignment: 1. do what I want and 2. don't have catastrophic failure.

I think that "alignment is easy' works only for the first requirement. But there could be many catastrophic failures modes, e.g. even humans can rebel or drift away from initial goals.

Yes, I think this objection captures something important.

I have proven that aligned AI must exist and also that it must be practically implementable.

But some kind of failure, i.e. a "near miss" on achieving a desired goal can happen even if success was possible.

I will address these near misses in future posts.