I'm trying think through various approaches to AI alignment, and so far this is the one I came up with that I like best. I have not read much of the literature, so please do point me if this has been discussed before.
What if we train an AI agent (ie, reinforcement learning) to survive/thrive in an environment where there are a wide variety of agents with wildly different levels of intelligence? In particular, such that pretty much every agent can safely assume they'll eventually meet an agent much smarter than they are; structure the environment to reward tit-for-tat with a significant bias towards cooperation, eg require agents to "eat" resources that require cooperation to secure and are primarily non-competitive. The idea is to have them learn to respect even beings of lesser intelligence, because they want beings of higher intelligence to respect them; and because in this environment a bunch of lesser intelligences can gang up and defeat one higher-intelligence being. Also, we effectively train each AI to detect and defeat new AIs that seek to disturb this balance. I have not thought this through, curious what you all think
(Cross posted from EA Forum)
Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.
One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can't request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.
You are right that the authority to "vote out" other AIs may be misused. That's where logs would be handy - for other agents to analyse the "minds" of both sides and see who was doing right.
It's not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.